Interesting ideas interspersed with nonsense - RSS - by nikhil bhatla, -
Home Archives July 2010

« Video of real bacterial chemotaxis - X »
P2P Lending + SVM = $$$?
Jul 7, 2010, 1:20p - Investing

Since 2004, I've been intrigued by microfinance. It all started when I read Mohammed Yunus' book "Banker for the Poor". In it, Yunus describes how he was able to improve the lives of poor women in Bangladesh by providing them with small loans (as little as $50 or less). Not only did the lenders get a reliable return, the borrowers got the means to pull themselves out of poverty. I'd never been that interested in investing (for a variety of reasons), but this win-win approach really appealed to me. When I looked into investing my own money, I found that most services for lenders positioned the loan more as a temporary donation (with almost 0% interest) than as a competitive investment. The money would also be sent far away, which made me uncomfortable, so ultimately I never lent any money.

Peer-to-Peer (P2P) Lending
When started providing peer-to-peer (P2P) loans in the United States in 2005, my interest was piqued again. With their service, anyone in the US could go online and request to borrow up to $25,000. Lenders could then go to the site, look at the borrower's credit grade and other listing information, and decide to lend as little as $50 per loan. If the borrower got enough investors, their loan amount would be fully funded and they'd receive the loan (a supposed "wisdom of crowds" approach). Prosper was set up as an auction, where the borrower specified the highest interest rate they're willing to borrow at, and lenders specified the lowest interest rate they're willing to lend at. So the final interest rate ended up somewhere in between. The loan term was for 3 years, with a fixed monthly payment of principal and interest.

By cutting the bank out as the middleman, borrowers were able to get unsecured loans at interest rates lower than what credit cards or banks offer (from 8-20%). Likewise, lenders were able to get interest rates higher than bank CDs and perhaps even the stock market. Again, a win-win situation: you help people buy a car or get an engagement ring, and you get a good interest rate in return.

So in 2008, I started lending money on Prosper. At first it seemed like it was going really well, with my loans averaging about 11% interest. I only lent to AA or A grade borrowers, so I thought that it was a relatively low-risk investment. Fast-forward to July 2010: even after diversifying with more than 100 loans, 13% have defaulted, leaving my annualized return at a meager 0.8%. So much for that idea...

Support Vector Machines (SVM)
As the defaults began to accumulate in 2009, I took a class on machine learning at MIT. Machine learning is a sub-field of computer science that focuses on developing software that can learn. What does that mean exactly? As an example, let's say that I want a computer to label faces in a photograph with each person's name. One approach would be to hand-code heuristics that identify each face. For example, any face with light brown skin, black hair, and thick eyebrows could be labeled "Nikhil". Of course, this approach is tricky and fraught with peril. Not only does a person have to figure out the feature values that uniquely identify an individual, the computer has to know what "hair" and "eyebrows" are, what those features actually mean. The first step is really time-consuming, brittle and perhaps impossible as the number of different unique faces increases. And the second step is still an unsolved problem in computer vision. In this heuristic approach, the computer doesn't learn: it's just programmed to label faces based on a set of human-specified features.

Contrast this to a machine learning approach. Rather than having a human figure out what sets of features values go with each person's face, the computer would be "trained" in the same way that people are trained: by just being shown faces and the names attached to those faces. Then it's the computer's job to figure out what feature values go with the names. This type of learning is called "supervised" learning. After being shown a large number of face-name pairs, the program is "tested" by being presented with a face it's never seen before. The program's job is to decide, of the labeled faces it's already seen, which is most similar to the new face, and then label it with that person's name.

I'm not sure if this sounds easy or hard to you, but in fact it is quite tricky and doesn't work very well, especially with complex visual stimuli such as faces. If you think about it, a face can be photographed from many different angles and under various lighting conditions. One amazing ability of the human brain is that it instantly identifies faces, while even the most cutting-edge computer programs are much less accurate and much slower.

So what does all of this have to do with P2P lending? What if instead of having to label faces (a hard problem of choosing from a set of n choices), the computer just had to identify loan requests that were likely to default (a simpler problem of choosing from a set of only 2 choices, will default or won't)? Labeling loans likely to default is a yes/no type of question, so maybe machine learning techniques could work in this more limited space. I was so excited by this idea that I decided to actually test it.

But before we get to the data analysis, let me explain the machine learning technique I used, which is called a support vector machine (SVM). I don't fully understand the math behind the technique, but I do understand graphically how it's supposed to work. So I'm going to use graphs to explain the concept.

For simplicity, let's just assume that we know only 2 things about each loan: the loan amount (with a range of $50-$25,000) and the borrower's credit score (with a range of 550-800). With this information, we can plot 10 imaginary loans on a 2-dimensional graph, as shown in Figure 1. 5 of the loans have defaulted (shown in red) and the remaining 5 are current or paid in full (shown in blue).


Figure 1: 10 imaginary loans plotted with respect to credit score and loan amount

Now, what a support vector machine does is find the line that best separates the 2 populations of points, as shown in Figure 2:


Figure 2: A support vector machine (SVM) finds the line that best separates the 2 data sets

The perfect separating line would have all defaulted loans on one side and all non-defaulted loans on the other. In the example above, perfect separation is impossible for a linear function, though possible with a more complex function that can wiggle around. To keep things simple, though, we'll focus on a linear separating function, which ends up looking like a straight line on a graph.

An SVM is a bit magical because it'll find the line that best separates the two categories of points, by minimizing the categorization error over the training set. The "training set" is the set of categorized points that the SVM gets to learn from before it draws its line (the 10 points in the current example). In the graph above, the error rate on the training set is 10%, because 1 of the 10 loans is on the wrong side of the line and miscategorized as not-defaulting, even though it did default.

Performance on the training set only means so much, though. An SVM's true utility is revealed on loans not included in the training set, where it can predict whether they will default or not depending on which side of the line they fall on. 2 new loans, whose fate the SVM does not know, are depicted as green and purple x's below. The green x is a $14,000 loan request from a borrower with a credit score of 690, and the purple x is a $6,000 loan request from a borrower with a credit score of 640.


Figure 3: Two new loans (the x's) plotted with respect to the separating line. The SVM predicts that the green x will default, while the purple x won't.

A naive investor might think that the larger loan to the more credit-worthy borrower is a better bet. But the SVM, having learned from the outcomes of the last 10 loans, would suggest skipping that loan and instead lending to the lower credit-score borrower, because they're predicted not to default.

In addition to categorizing a novel point, the SVM can also tell us how confident it is that its classification is correct. Graphically, this amounts to how far the point is from the separating line. The SVM is less confident about points closer to the line (near the category boundary), while more confident about points farther from the line. This subtlety will prove critical in the analysis described below.

Hopefully this explanation of an SVM makes some sense. One beauty of the SVM is that it can find the best separating line across n-dimensions, not just the 2-dimensions from the example. Prosper makes available all of their historical loan data, which includes over 100 features per loan and whether or not the loan has defaulted. In my analysis, I use this feature set to train the SVM, and then use the remaining historical data to test the SVM's prediction accuracy.

Before we see how well it does, I want to mention some brief technical details. All of the following analysis was done in Matlab (R2008a) using libsvm-mat-2.89-1 (released April 2009). I simply normalized my data in Matlab and then fed it to libsvm, and it did all the hard work for me.

Using an SVM to Predict Defaults on
Remember, our goal with all of this mathematical trickery is to create a system that will give a return on investment that's better than we'd get without it. This amounts to the SVM being able to identify loans that will default better than I can identify ones that'll default.

At the time I did this analysis (April 2009), 4.65% of my Prosper loans fell into the default category (which includes loans that are more than 1 month late). For me to benefit from an SVM, it should reduce this default rate significantly (ideally by half or more).

I trained the SVM on Prosper loan data from November 11, 2005 to October 1, 2007, the day before my first loan. Training was done on 35% of the total, feature-complete data set (7,108 of all 20,204 loans - more details in Appendix 1). Initial results were promising. If I had followed the SVM's default predictions for the loans in my portfolio, my default rate would have dropped 14% (from 4.65% to 4.03%). While a small effect, this was an encouraging start.

After further thinking, I realized that instead of just using the SVM's prediction alone to decide what to invest in, I could act only on those predictions that the SVM had confidence in. If I only invested in loans that the SVM was more than 90% confident would not default, I could further reduce my default rate by 76% (from 4.65% to 1.12%). I would invest in much fewer loans, but earn a better return overall.

Curious about whether this result was specific to my portfolio or generalized further, I retested on the broader set of all loans available after the training period (a total of 13,096 loans). I call this the "invest-in-everything" strategy. If I had invested in all loans the SVM predicted would not default with more than 90% confidence, my default rate would drop a whopping 85% (from 15.28% to only 2.33%)! Clearly the SVM is providing a significant improvement in accuracy, part of which is redundant with that ill-defined loan selection process that operates in my head. This is sketched out in Figure 4:


Figure 4: Proportion of Prosper loans that default with subsequent filters. If all loans are purchased, the default rate is 15%. If only loans in my portfolio are included, the default rate is 4.7%. If only the loans in my portfolio approved with high confidence by the SVM are included, the default rate is 1.1%. If all loans approved with high confidence by the SVM are included, the default rate is 2.3%.

It would be interesting to re-run this analysis today and see if the SVM still returns a significant advantage, as more than a year's worth of new data has accumulated. However, I can no longer find a way to download loan performance data from Prosper. It looks like they're no longer releasing it, which is unfortunate.

Using an SVM to Predict Defaults on
After completing my analysis of Prosper in April 2009, I wasn't able to use it to invest in new loans because Prosper had temporarily shut down. They were revising their legal agreements to comply with SEC regulations, and no new loans would be processed during that time. So I turned my attention to another P2P lending service called

Unlike Prosper's auction method, Lending Club sets interest rates using a proprietary analysis of the loan application, purposefully avoiding the supposed "wisdom of crowds". In retrospect, this is a much better strategy. Although Lending Club loans have lower interest rates, the default rates are substantially lower: from June 2007 to April 2009, 7.6% of all Lending Club loans were 1 or more months late, while 21% of all Prosper loan dollars were 1 or more months late. By rejecting over 90% of borrower applications, Lending Club has improved performance for lenders.

So once again I focused my trusty SVM on a new set of data, to see if it could improve results much like it had for Prosper. In this case, though, I didn't haven any investments with Lending Club, so I couldn't compare the SVM's results with my portfolio's results. So I was only able to compare the SVM approach with an "invest-in-everything" strategy.

If the SVM is trained on the oldest 35% of loans and tested on the newest 65% (similar to what was done for the Prosper analysis above), the SVM default rate is no different than the naive default rate (of about 2.97%, or 79 of 2,662 loans defaulting). If a confidence threshold of 0.9 is used, the default rate actually increases to 3.85% (3 of 78 loans default)! (More details can be found in Appendix 2)

Note that this result depends on the training set used. If I train on the oldest 60% of loans (up to October 16, 2008) and test on the newest 40%, I am able to lower my default rate from 0.82% to 0.45% by only lending when the SVM is more than 90% confident that the loan won't default.

Both outcomes are diagrammed in Figure 5:


Figure 5: Proportion of Lending Club loans that default with 2 filters. SVM #1 refers to an SVM that is trained on the oldest 35% of data. This results in an increase in default rate. SVM #2 refers to an SVM that is trained on the oldest 60% of the data. This results in a decrease in default rate.

Overall, it appears as if the Lending Club SVM can either increase or decrease the default rate, depending perhaps on the time period examined or overall amount of training data used.

So what did I learn from all of this?
  1. Prosper has more defaulting loans, probably because it doesn't screen borrowers as well as Lending Club does.

  2. Investing on Prosper siginficantly benefits from using an SVM to make loan decisions, especially when a high confidence threshold (90%) is used.

  3. It's unclear whether investing on Lending Club benefits from an SVM: depending on the conditions, it can either help or hurt.

  4. The SVM is a useful computational tool for categorization, especially when there are only 2 categories to choose from.
So how has all of this analysis influenced my lending? I'm currently lending on Lending Club with new criteria I derived from my Prosper loans that defaulted. So far I have 0 defaults, but the loans haven't had much time to default (varying between one month to one year).

I have no intention of making new loans on Prosper even with the SVM, as my non-SVM return on investment has been abysmal, the company seems a bit shady (for reasons you can find while Googling), and Lending Club seems to be a less riskier alternative.

Perhaps in the future I'll setup a service that will use a Lending Club SVM to automatically suggest loans to purchase. However, I'm a bit hesitant to use a program whose learned criteria I don't yet understand. Though maybe I just learned the hard way what the SVM could have told me all along...

If you've gotten this far and still have questions or comments, don't hesitate to email me. My current email address is at the top of the page.

Notes, credits, and links:
  • My synopsis and comments on Yunus' "Banker for the Poor"
  • - the first P2P lending service in the US. I got basically no return on my investment here, so I would not recommend them.
  • - another P2P lending service in the US. I've only started investing here, and so far my experience has been positive.
  • Matlab - data analysis environment used
  • libsvm - SVM library used in Matlab for all analyses
  • All figures were generated with Google Chart Tools

  • --

    Appendix 1: Prosper Analysis Details
    I trained the Prosper SVM with 112 features (see below). The training data spanned November 11, 2005 to October 1, 2007 (7,028 loans). I chose this start date because it's the first date data was available, and this end date because it's one day before my first loan. This data set included 4,931 examples of non-defaulted loans, and 2,177 examples of defaulted loans (after removing loans with incomplete feature sets).

    The test data set was comprised of either (1) my portfolio of loans or (2) all loans made from October 2, 2007 to October 16, 2008 (the last day before Prosper shut down to comply with SEC regulations).

    A loan was classified as not defaulted if its Loan.Status was Origination delayed, Current, Late (less than 30 days), Payoff in progress, or Paid. Conversely, a loan was classified as defaulted if its Loan.Status was Charge-off, 1 month late, 2 months late, 3 months late, 4+ months late, Defaulted (Delinquency), Defaulted (Bankruptcy), or Defaulted (Deceased). A loan was not labeled if its Loan.Status was Repurchased or Cancelled.

    The following 112 features were used when training and testing the SVM:
      Features related to the borrower herself:
    1. CreditProfiles.AmountDelinquent
    2. CreditProfiles.BankcardUtilization
    3. CreditProfiles.CurrentCreditLines
    4. CreditProfiles.CurrentDelinquencies
    5. CreditProfiles.DelinquenciesLast7Years
    6. CreditProfiles.Income
    7. CreditProfiles.InquiriesLast6Months
    8. CreditProfiles.LengthStatusMonths
    9. CreditProfiles.OpenCreditLines
    10. CreditProfiles.PublicRecordsLast10Years
    11. CreditProfiles.PublicRecordsLast12Months
    12. CreditProfiles.RevolvingCreditBalance
    13. CreditProfiles.TotalCreditLines
    14. CreditProfiles.FirstRecordedCreditLine

      A borrower's credit grade. Only 1 of the following can be true, leaving the rest false:
    15. CreditProfiles.AA
    16. CreditProfiles.A
    17. CreditProfiles.B
    18. CreditProfiles.C
    19. CreditProfiles.D
    20. CreditProfiles.E
    21. CreditProfiles.HR
    22. CreditProfiles.NC

      A borrower's employment status. Only 1 of the following can be true, leaving the rest false:
    23. CreditProfiles.Full-time
    24. CreditProfiles.Part-time
    25. CreditProfiles.Self-employed
    26. CreditProfiles.Retired
    27. CreditProfiles.Not employed
    28. CreditProfiles.Not available

      A borrower's occupation. Only 1 of the following can be true, leaving the rest false:
    29. CreditProfiles.Accountant/CPA
    30. CreditProfiles.Administrative Assistant
    31. CreditProfiles.Analyst
    32. CreditProfiles.Architect
    33. CreditProfiles.Attorney
    34. CreditProfiles.Biologist
    35. CreditProfiles.Bus Driver
    36. CreditProfiles.Car Dealer
    37. CreditProfiles.Chemist
    38. CreditProfiles.Civil Service
    39. CreditProfiles.Clergy
    40. CreditProfiles.Clerical
    41. CreditProfiles.Computer Programmer
    42. CreditProfiles.Construction
    43. CreditProfiles.Dentist
    44. CreditProfiles.Doctor
    45. CreditProfiles.Engineer - Chemical
    46. CreditProfiles.Engineer - Electrical
    47. CreditProfiles.Engineer - Mechanical
    48. CreditProfiles.Executive
    49. CreditProfiles.Fireman
    50. CreditProfiles.Flight Attendant
    51. CreditProfiles.Food Service
    52. CreditProfiles.Food Service Management
    53. CreditProfiles.Homemaker
    54. CreditProfiles.Investor
    55. CreditProfiles.Judge
    56. CreditProfiles.Laborer
    57. CreditProfiles.Landscaping
    58. CreditProfiles.Medical Technician
    59. CreditProfiles.Military Enlisted
    60. CreditProfiles.Military Officer
    61. CreditProfiles.Nurse (LPN)
    62. CreditProfiles.Nurse (RN)
    63. CreditProfiles.Nurse's Aide
    64. CreditProfiles.Pharmacist
    65. CreditProfiles.Pilot - Private/Commercial
    66. CreditProfiles.Police Officer/Correction Officer
    67. CreditProfiles.Postal Service
    68. CreditProfiles.Principal
    69. CreditProfiles.Professional
    70. CreditProfiles.Professor
    71. CreditProfiles.Psychologist
    72. CreditProfiles.Realtor
    73. CreditProfiles.Religious
    74. CreditProfiles.Retail Management
    75. CreditProfiles.Sales - Commission
    76. CreditProfiles.Sales - Retail
    77. CreditProfiles.Scientist
    78. CreditProfiles.Skilled Labor
    79. CreditProfiles.Social Worker
    80. CreditProfiles.Student - College Freshman
    81. CreditProfiles.Student - College Sophomore
    82. CreditProfiles.Student - College Junior
    83. CreditProfiles.Student - College Senior
    84. CreditProfiles.Student - College Graduate Student
    85. CreditProfiles.Student - Community College
    86. CreditProfiles.Student - Technical School
    87. CreditProfiles.Teacher
    88. CreditProfiles.Teacher's Aide
    89. CreditProfiles.Tradesman - Carpenter
    90. CreditProfiles.Tradesman - Electrician
    91. CreditProfiles.Tradesman - Mechanic
    92. CreditProfiles.Tradesman - Plumber
    93. CreditProfiles.Truck Driver
    94. CreditProfiles.Waiter/Waitress
    95. CreditProfiles.Other

      Information that's supposed to be about the loan listing itself, not the borrower (though you can see some, like HasVerifiedBankAccount, is about the borrower):
    96. Listing.AmountRequested
    97. Listing.BankDraftFeeAnnualRate
    98. Listing.BorrowerMaximumRate
    99. Listing.LenderRate
    100. Listing.DebtToIncomeRatio
    101. Listing.FundingOption
    102. Listing.GroupLeaderRewardRate
    103. Listing.HasVerifiedBankAccount
    104. Listing.IsBorrowerHomeowner

      What the loan will be used for. Only 1 of the following can be true, leaving the rest false:
    105. Listing.Not available
    106. Listing.Debt consolidation
    107. Listing.Home Improvement Loan
    108. Listing.Business Loan
    109. Listing.Personal Loan
    110. Listing.Student Loan
    111. Listing.Auto Loan
    112. Listing.Other
    The following 36 features were ignored and NOT included when training or testing the SVM:
    1. CreditProfiles.CreationDate
    2. CreditProfiles.DatePulled

    3. Listing.AmountFunded
    4. Listing.AmountRemaining
    5. Listing.BidCount
    6. Listing.BidMaximumRate
    7. Listing.BorrowerCity
    8. Listing.BorrowerState
    9. Listing.CreationDate
    10. Listing.CreditGrade (duplicate of credit grade associated with the CreditProfile)
    11. Listing.Description
    12. Listing.Duration
    13. Listing.EndDate
    14. Listing.Images
    15. Listing.GroupKey
    16. Listing.Key
    17. Listing.BorrowerRate
    18. Listing.ListingNumber
    19. Listing.MemberKey
    20. Listing.PercentFunded
    21. Listing.Status
    22. Listing.StartDate
    23. Listing.Title
    24. Listing.LoanTermInMonths

    25. Loan.AgeInMonths
    26. Loan.AmountBorrowed
    27. Loan.BorrowerRate
    28. Loan.CreationDate
    29. Loan.CreditGrade
    30. Loan.DebtToIncomeRatio
    31. Loan.GroupKey
    32. Loan.Key
    33. Loan.LenderRate
    34. Loan.ListingKey
    35. Loan.OriginationDate
    36. Loan.Term
    Appendix 2: Lending Club Analysis Details
    I trained the Lending Club SVM with 20 features (see below). The first training data set spanned June 14, 2007 (the earliest available) through March 22, 2008 (1,430 loans). The end date was chosen so that the training data comprised 35% of all of the available data (as was the case with the Prosper training set). I thought this would be the best way to make the methods comparable. This data set included 1,199 examples of non-defaulted loans, and 231 examples of defaulted loans (after removing loans with incomplete feature sets).

    The first test data set was comprised of the remaining 65% of data, spanning March 23, 2008 to April 28, 2009 (2,662 loans).

    A loans was categorized as not-defaulted if its Status was Issued, Current, In Grace Period, Late (16-30 days), or Fully Paid. A loan was categorized as defaulted if its Status was Late (31-120 days), Default, or Charged Off. A loan was not categorized or analyzed if its Status was Cancelled or Removed.

    The following 20 features were used to train and test the SVM:
    1. Req (loan amount requested)
    2. IntRate
    3. Credit
    4. DebtToIncomeRatio

    5. Home.RENT
    6. Home.OWN
    7. Home.MORTGAGE
    8. Home.NONE (Home.ANY gets all zeros)

    9. MoInc (monthly income)
    10. FICO (credit score)
    11. EarliestCredit
    12. OpenCreditLines
    13. TotalCreditLines
    14. RevolvingCreditBalance
    15. RevolvingLineUtilization
    16. InquiriesInTheLast6Months
    17. AccountsNowDelinquent
    18. DelinquentAmount
    19. DelinquenciesLast2Yrs
    20. PublicRecordsOnFile
    The following 17 features were left out and NOT used in the SVM:
    1. Bor (amount borrowed, usually redundant with Req)
    2. AppDate
    3. AppExp
    4. IssuedDate
    5. Title
    6. Description
    7. MoPayt
    8. RemainingPrincipal
    9. PaymentsToDate
    10. ScreenName
    11. City
    12. State
    13. Education
    14. Associations
    15. Code
    16. MonthsSinceLastDelinquency
    17. MonthsSinceLastRecord

    Read comments (5) - Comment

    sundar - Jul 8, 2010, 11:51p
    yes i did get this far - great to see you continue to do interesting stuff, this atleast is much better than those frikking worms you write about:)

    Tuomas Talola - Jul 9, 2010, 5:35a
    Have to say, I've never heard of Support Vector Machines. What you've done, I know as regression analysis. Nothing wrong with regression analysis itself, but using it to predict future returns or defaults is little dubious. This has been the case in financial markets over and over again.

    However, the work you have done seems quite thorough, I appreciate it. I'd be interested in more detailed results.

    asenski - May 7, 2012, 12:57a
    Few questions:

    1. Have you tried applying non-linear SVM?

    2. Did you make sure you are not using training data that would not have been available for the test data set? i.e. whether a loan defaulted or not won't be known until 3 years from its origination date. You may be cheating unintentionally here!

    nikhil - May 10, 2012, 9:57p
    Hi asenski,

    1) No, I have not tried using the non-linear SVM. I think I just tried whatever was the default option in libsvm.

    2) If I recall accurately (it's been several years now), the training set always included loans issued before the loans in the test set. I defined a default as a loan that's more than 1 month late. So it's possible that a loan that issued several years ago defaulted after a loan from the test set was issued. So if you were doing this analysis in current-time to predict the future, this might be construed as a form of "cheating" - i.e. you wouldn't know that a training set loan would default because it hasn't at the time you're considering making a new loan (which would be equivalent to an item from the test set). Good eyes! If I were to redo this analysis I would try to take care of this caveat, since it might make a difference.

    Dave - Jun 16, 2012, 2:34p
    You should really try non-linear. I'd try a radial basis function first.

    « Video of real bacterial chemotaxis - X »

    Come back soon! Better yet, stay up-to-date with RSS and an RSS Reader. Creative Commons License