P2P Lending + SVM = $$$? - by nikhil bhatla

Interesting ideas interspersed with nonsense - RSS - by nikhil bhatla, nikhil@superfacts.org -

Home › Archives › July 2010

« Video of real bacterial chemotaxis - X »

P2P Lending + SVM = $$$?
Jul 7, 2010, 1:20p - Investing

Since 2004, I've been intrigued by microfinance. It all started when I read Mohammed Yunus' book "Banker for the Poor". In it, Yunus describes how he was able to improve the lives of poor women in Bangladesh by providing them with small loans (as little as $50 or less). Not only did the lenders get a reliable return, the borrowers got the means to pull themselves out of poverty. I'd never been that interested in investing (for a variety of reasons), but this win-win approach really appealed to me. When I looked into investing my own money, I found that most services for lenders positioned the loan more as a temporary donation (with almost 0% interest) than as a competitive investment. The money would also be sent far away, which made me uncomfortable, so ultimately I never lent any money.

Peer-to-Peer (P2P) Lending
When Prosper.com started providing peer-to-peer (P2P) loans in the United States in 2005, my interest was piqued again. With their service, anyone in the US could go online and request to borrow up to $25,000. Lenders could then go to the site, look at the borrower's credit grade and other listing information, and decide to lend as little as $50 per loan. If the borrower got enough investors, their loan amount would be fully funded and they'd receive the loan (a supposed "wisdom of crowds" approach). Prosper was set up as an auction, where the borrower specified the highest interest rate they're willing to borrow at, and lenders specified the lowest interest rate they're willing to lend at. So the final interest rate ended up somewhere in between. The loan term was for 3 years, with a fixed monthly payment of principal and interest.

By cutting the bank out as the middleman, borrowers were able to get unsecured loans at interest rates lower than what credit cards or banks offer (from 8-20%). Likewise, lenders were able to get interest rates higher than bank CDs and perhaps even the stock market. Again, a win-win situation: you help people buy a car or get an engagement ring, and you get a good interest rate in return.

So in 2008, I started lending money on Prosper. At first it seemed like it was going really well, with my loans averaging about 11% interest. I only lent to AA or A grade borrowers, so I thought that it was a relatively low-risk investment. Fast-forward to July 2010: even after diversifying with more than 100 loans, 13% have defaulted, leaving my annualized return at a meager 0.8%. So much for that idea...

Support Vector Machines (SVM)
As the defaults began to accumulate in 2009, I took a class on machine learning at MIT. Machine learning is a sub-field of computer science that focuses on developing software that can learn. What does that mean exactly? As an example, let's say that I want a computer to label faces in a photograph with each person's name. One approach would be to hand-code heuristics that identify each face. For example, any face with light brown skin, black hair, and thick eyebrows could be labeled "Nikhil". Of course, this approach is tricky and fraught with peril. Not only does a person have to figure out the feature values that uniquely identify an individual, the computer has to know what "hair" and "eyebrows" are, what those features actually mean. The first step is really time-consuming, brittle and perhaps impossible as the number of different unique faces increases. And the second step is still an unsolved problem in computer vision. In this heuristic approach, the computer doesn't learn: it's just programmed to label faces based on a set of human-specified features.

Contrast this to a machine learning approach. Rather than having a human figure out what sets of features values go with each person's face, the computer would be "trained" in the same way that people are trained: by just being shown faces and the names attached to those faces. Then it's the computer's job to figure out what feature values go with the names. This type of learning is called "supervised" learning. After being shown a large number of face-name pairs, the program is "tested" by being presented with a face it's never seen before. The program's job is to decide, of the labeled faces it's already seen, which is most similar to the new face, and then label it with that person's name.

I'm not sure if this sounds easy or hard to you, but in fact it is quite tricky and doesn't work very well, especially with complex visual stimuli such as faces. If you think about it, a face can be photographed from many different angles and under various lighting conditions. One amazing ability of the human brain is that it instantly identifies faces, while even the most cutting-edge computer programs are much less accurate and much slower.

So what does all of this have to do with P2P lending? What if instead of having to label faces (a hard problem of choosing from a set of n choices), the computer just had to identify loan requests that were likely to default (a simpler problem of choosing from a set of only 2 choices, will default or won't)? Labeling loans likely to default is a yes/no type of question, so maybe machine learning techniques could work in this more limited space. I was so excited by this idea that I decided to actually test it.

But before we get to the data analysis, let me explain the machine learning technique I used, which is called a support vector machine (SVM). I don't fully understand the math behind the technique, but I do understand graphically how it's supposed to work. So I'm going to use graphs to explain the concept.

For simplicity, let's just assume that we know only 2 things about each loan: the loan amount (with a range of $50-$25,000) and the borrower's credit score (with a range of 550-800). With this information, we can plot 10 imaginary loans on a 2-dimensional graph, as shown in Figure 1. 5 of the loans have defaulted (shown in red) and the remaining 5 are current or paid in full (shown in blue).

Figure 1: 10 imaginary loans plotted with respect to credit score and loan amount

Now, what a support vector machine does is find the line that best separates the 2 populations of points, as shown in Figure 2:

Figure 2: A support vector machine (SVM) finds the line that best separates the 2 data sets

The perfect separating line would have all defaulted loans on one side and all non-defaulted loans on the other. In the example above, perfect separation is impossible for a linear function, though possible with a more complex function that can wiggle around. To keep things simple, though, we'll focus on a linear separating function, which ends up looking like a straight line on a graph.

An SVM is a bit magical because it'll find the line that best separates the two categories of points, by minimizing the categorization error over the training set. The "training set" is the set of categorized points that the SVM gets to learn from before it draws its line (the 10 points in the current example). In the graph above, the error rate on the training set is 10%, because 1 of the 10 loans is on the wrong side of the line and miscategorized as not-defaulting, even though it did default.

Performance on the training set only means so much, though. An SVM's true utility is revealed on loans not included in the training set, where it can predict whether they will default or not depending on which side of the line they fall on. 2 new loans, whose fate the SVM does not know, are depicted as green and purple x's below. The green x is a $14,000 loan request from a borrower with a credit score of 690, and the purple x is a $6,000 loan request from a borrower with a credit score of 640.

Figure 3: Two new loans (the x's) plotted with respect to the separating line. The SVM predicts that the green x will default, while the purple x won't.

A naive investor might think that the larger loan to the more credit-worthy borrower is a better bet. But the SVM, having learned from the outcomes of the last 10 loans, would suggest skipping that loan and instead lending to the lower credit-score borrower, because they're predicted not to default.

In addition to categorizing a novel point, the SVM can also tell us how confident it is that its classification is correct. Graphically, this amounts to how far the point is from the separating line. The SVM is less confident about points closer to the line (near the category boundary), while more confident about points farther from the line. This subtlety will prove critical in the analysis described below.

Hopefully this explanation of an SVM makes some sense. One beauty of the SVM is that it can find the best separating line across n-dimensions, not just the 2-dimensions from the example. Prosper makes available all of their historical loan data, which includes over 100 features per loan and whether or not the loan has defaulted. In my analysis, I use this feature set to train the SVM, and then use the remaining historical data to test the SVM's prediction accuracy.

Before we see how well it does, I want to mention some brief technical details. All of the following analysis was done in Matlab (R2008a) using libsvm-mat-2.89-1 (released April 2009). I simply normalized my data in Matlab and then fed it to libsvm, and it did all the hard work for me.

Using an SVM to Predict Defaults on Prosper.com
Remember, our goal with all of this mathematical trickery is to create a system that will give a return on investment that's better than we'd get without it. This amounts to the SVM being able to identify loans that will default better than I can identify ones that'll default.

At the time I did this analysis (April 2009), 4.65% of my Prosper loans fell into the default category (which includes loans that are more than 1 month late). For me to benefit from an SVM, it should reduce this default rate significantly (ideally by half or more).

I trained the SVM on Prosper loan data from November 11, 2005 to October 1, 2007, the day before my first loan. Training was done on 35% of the total, feature-complete data set (7,108 of all 20,204 loans - more details in Appendix 1). Initial results were promising. If I had followed the SVM's default predictions for the loans in my portfolio, my default rate would have dropped 14% (from 4.65% to 4.03%). While a small effect, this was an encouraging start.

After further thinking, I realized that instead of just using the SVM's prediction alone to decide what to invest in, I could act only on those predictions that the SVM had confidence in. If I only invested in loans that the SVM was more than 90% confident would not default, I could further reduce my default rate by 76% (from 4.65% to 1.12%). I would invest in much fewer loans, but earn a better return overall.

Curious about whether this result was specific to my portfolio or generalized further, I retested on the broader set of all loans available after the training period (a total of 13,096 loans). I call this the "invest-in-everything" strategy. If I had invested in all loans the SVM predicted would not default with more than 90% confidence, my default rate would drop a whopping 85% (from 15.28% to only 2.33%)! Clearly the SVM is providing a significant improvement in accuracy, part of which is redundant with that ill-defined loan selection process that operates in my head. This is sketched out in Figure 4:

Figure 4: Proportion of Prosper loans that default with subsequent filters. If all loans are purchased, the default rate is 15%. If only loans in my portfolio are included, the default rate is 4.7%. If only the loans in my portfolio approved with high confidence by the SVM are included, the default rate is 1.1%. If all loans approved with high confidence by the SVM are included, the default rate is 2.3%.

It would be interesting to re-run this analysis today and see if the SVM still returns a significant advantage, as more than a year's worth of new data has accumulated. However, I can no longer find a way to download loan performance data from Prosper. It looks like they're no longer releasing it, which is unfortunate.

Using an SVM to Predict Defaults on LendingClub.com
After completing my analysis of Prosper in April 2009, I wasn't able to use it to invest in new loans because Prosper had temporarily shut down. They were revising their legal agreements to comply with SEC regulations, and no new loans would be processed during that time. So I turned my attention to another P2P lending service called LendingClub.com.

Unlike Prosper's auction method, Lending Club sets interest rates using a proprietary analysis of the loan application, purposefully avoiding the supposed "wisdom of crowds". In retrospect, this is a much better strategy. Although Lending Club loans have lower interest rates, the default rates are substantially lower: from June 2007 to April 2009, 7.6% of all Lending Club loans were 1 or more months late, while 21% of all Prosper loan dollars were 1 or more months late. By rejecting over 90% of borrower applications, Lending Club has improved performance for lenders.

So once again I focused my trusty SVM on a new set of data, to see if it could improve results much like it had for Prosper. In this case, though, I didn't haven any investments with Lending Club, so I couldn't compare the SVM's results with my portfolio's results. So I was only able to compare the SVM approach with an "invest-in-everything" strategy.

If the SVM is trained on the oldest 35% of loans and tested on the newest 65% (similar to what was done for the Prosper analysis above), the SVM default rate is no different than the naive default rate (of about 2.97%, or 79 of 2,662 loans defaulting). If a confidence threshold of 0.9 is used, the default rate actually increases to 3.85% (3 of 78 loans default)! (More details can be found in Appendix 2)

Note that this result depends on the training set used. If I train on the oldest 60% of loans (up to October 16, 2008) and test on the newest 40%, I am able to lower my default rate from 0.82% to 0.45% by only lending when the SVM is more than 90% confident that the loan won't default.

Both outcomes are diagrammed in Figure 5:

Figure 5: Proportion of Lending Club loans that default with 2 filters. SVM #1 refers to an SVM that is trained on the oldest 35% of data. This results in an increase in default rate. SVM #2 refers to an SVM that is trained on the oldest 60% of the data. This results in a decrease in default rate.

Overall, it appears as if the Lending Club SVM can either increase or decrease the default rate, depending perhaps on the time period examined or overall amount of training data used.

Conclusions
So what did I learn from all of this?
Prosper has more defaulting loans, probably because it doesn't screen borrowers as well as Lending Club does.

Investing on Prosper siginficantly benefits from using an SVM to make loan decisions, especially when a high confidence threshold (90%) is used.

It's unclear whether investing on Lending Club benefits from an SVM: depending on the conditions, it can either help or hurt.

The SVM is a useful computational tool for categorization, especially when there are only 2 categories to choose from.
So how has all of this analysis influenced my lending? I'm currently lending on Lending Club with new criteria I derived from my Prosper loans that defaulted. So far I have 0 defaults, but the loans haven't had much time to default (varying between one month to one year).

I have no intention of making new loans on Prosper even with the SVM, as my non-SVM return on investment has been abysmal, the company seems a bit shady (for reasons you can find while Googling), and Lending Club seems to be a less riskier alternative.

Perhaps in the future I'll setup a service that will use a Lending Club SVM to automatically suggest loans to purchase. However, I'm a bit hesitant to use a program whose learned criteria I don't yet understand. Though maybe I just learned the hard way what the SVM could have told me all along...

If you've gotten this far and still have questions or comments, don't hesitate to email me. My current email address is at the top of the page.

--
Notes, credits, and links:

My synopsis and comments on Yunus' "Banker for the Poor"

Prosper.com - the first P2P lending service in the US. I got basically no return on my investment here, so I would not recommend them.

LendingClub.com - another P2P lending service in the US. I've only started investing here, and so far my experience has been positive.

Matlab - data analysis environment used

libsvm - SVM library used in Matlab for all analyses

All figures were generated with Google Chart Tools

--

Appendix 1: Prosper Analysis Details
I trained the Prosper SVM with 112 features (see below). The training data spanned November 11, 2005 to October 1, 2007 (7,028 loans). I chose this start date because it's the first date data was available, and this end date because it's one day before my first loan. This data set included 4,931 examples of non-defaulted loans, and 2,177 examples of defaulted loans (after removing loans with incomplete feature sets).

The test data set was comprised of either (1) my portfolio of loans or (2) all loans made from October 2, 2007 to October 16, 2008 (the last day before Prosper shut down to comply with SEC regulations).

A loan was classified as not defaulted if its Loan.Status was Origination delayed, Current, Late (less than 30 days), Payoff in progress, or Paid. Conversely, a loan was classified as defaulted if its Loan.Status was Charge-off, 1 month late, 2 months late, 3 months late, 4+ months late, Defaulted (Delinquency), Defaulted (Bankruptcy), or Defaulted (Deceased). A loan was not labeled if its Loan.Status was Repurchased or Cancelled.

The following 112 features were used when training and testing the SVM:

Features related to the borrower herself:

CreditProfiles.AmountDelinquent
CreditProfiles.BankcardUtilization
CreditProfiles.CurrentCreditLines
CreditProfiles.CurrentDelinquencies
CreditProfiles.DelinquenciesLast7Years
CreditProfiles.Income
CreditProfiles.InquiriesLast6Months
CreditProfiles.LengthStatusMonths
CreditProfiles.OpenCreditLines
CreditProfiles.PublicRecordsLast10Years
CreditProfiles.PublicRecordsLast12Months
CreditProfiles.RevolvingCreditBalance
CreditProfiles.TotalCreditLines
CreditProfiles.FirstRecordedCreditLine

A borrower's credit grade. Only 1 of the following can be true, leaving the rest false:
CreditProfiles.AA
CreditProfiles.A
CreditProfiles.B
CreditProfiles.C
CreditProfiles.D
CreditProfiles.E
CreditProfiles.HR
CreditProfiles.NC

A borrower's employment status. Only 1 of the following can be true, leaving the rest false:
CreditProfiles.Full-time
CreditProfiles.Part-time
CreditProfiles.Self-employed
CreditProfiles.Retired
CreditProfiles.Not employed
CreditProfiles.Not available

A borrower's occupation. Only 1 of the following can be true, leaving the rest false:
CreditProfiles.Accountant/CPA
CreditProfiles.Administrative Assistant
CreditProfiles.Analyst
CreditProfiles.Architect
CreditProfiles.Attorney
CreditProfiles.Biologist
CreditProfiles.Bus Driver
CreditProfiles.Car Dealer
CreditProfiles.Chemist
CreditProfiles.Civil Service
CreditProfiles.Clergy
CreditProfiles.Clerical
CreditProfiles.Computer Programmer
CreditProfiles.Construction
CreditProfiles.Dentist
CreditProfiles.Doctor
CreditProfiles.Engineer - Chemical
CreditProfiles.Engineer - Electrical
CreditProfiles.Engineer - Mechanical
CreditProfiles.Executive
CreditProfiles.Fireman
CreditProfiles.Flight Attendant
CreditProfiles.Food Service
CreditProfiles.Food Service Management
CreditProfiles.Homemaker
CreditProfiles.Investor
CreditProfiles.Judge
CreditProfiles.Laborer
CreditProfiles.Landscaping
CreditProfiles.Medical Technician
CreditProfiles.Military Enlisted
CreditProfiles.Military Officer
CreditProfiles.Nurse (LPN)
CreditProfiles.Nurse (RN)
CreditProfiles.Nurse's Aide
CreditProfiles.Pharmacist
CreditProfiles.Pilot - Private/Commercial
CreditProfiles.Police Officer/Correction Officer
CreditProfiles.Postal Service
CreditProfiles.Principal
CreditProfiles.Professional
CreditProfiles.Professor
CreditProfiles.Psychologist
CreditProfiles.Realtor
CreditProfiles.Religious
CreditProfiles.Retail Management
CreditProfiles.Sales - Commission
CreditProfiles.Sales - Retail
CreditProfiles.Scientist
CreditProfiles.Skilled Labor
CreditProfiles.Social Worker
CreditProfiles.Student - College Freshman
CreditProfiles.Student - College Sophomore
CreditProfiles.Student - College Junior
CreditProfiles.Student - College Senior
CreditProfiles.Student - College Graduate Student
CreditProfiles.Student - Community College
CreditProfiles.Student - Technical School
CreditProfiles.Teacher
CreditProfiles.Teacher's Aide
CreditProfiles.Tradesman - Carpenter
CreditProfiles.Tradesman - Electrician
CreditProfiles.Tradesman - Mechanic
CreditProfiles.Tradesman - Plumber
CreditProfiles.Truck Driver
CreditProfiles.Waiter/Waitress
CreditProfiles.Other

Information that's supposed to be about the loan listing itself, not the borrower (though you can see some, like HasVerifiedBankAccount, is about the borrower):
Listing.AmountRequested
Listing.BankDraftFeeAnnualRate
Listing.BorrowerMaximumRate
Listing.LenderRate
Listing.DebtToIncomeRatio
Listing.FundingOption
Listing.GroupLeaderRewardRate
Listing.HasVerifiedBankAccount
Listing.IsBorrowerHomeowner

What the loan will be used for. Only 1 of the following can be true, leaving the rest false:
Listing.Not available
Listing.Debt consolidation
Listing.Home Improvement Loan
Listing.Business Loan
Listing.Personal Loan
Listing.Student Loan
Listing.Auto Loan
Listing.Other

The following 36 features were ignored and NOT included when training or testing the SVM:

Appendix 2: Lending Club Analysis Details
I trained the Lending Club SVM with 20 features (see below). The first training data set spanned June 14, 2007 (the earliest available) through March 22, 2008 (1,430 loans). The end date was chosen so that the training data comprised 35% of all of the available data (as was the case with the Prosper training set). I thought this would be the best way to make the methods comparable. This data set included 1,199 examples of non-defaulted loans, and 231 examples of defaulted loans (after removing loans with incomplete feature sets).

The first test data set was comprised of the remaining 65% of data, spanning March 23, 2008 to April 28, 2009 (2,662 loans).

A loans was categorized as not-defaulted if its Status was Issued, Current, In Grace Period, Late (16-30 days), or Fully Paid. A loan was categorized as defaulted if its Status was Late (31-120 days), Default, or Charged Off. A loan was not categorized or analyzed if its Status was Cancelled or Removed.

The following 20 features were used to train and test the SVM:

Req (loan amount requested)
IntRate
Credit
DebtToIncomeRatio
Home.RENT
Home.OWN
Home.MORTGAGE
Home.NONE (Home.ANY gets all zeros)
MoInc (monthly income)
FICO (credit score)
EarliestCredit
OpenCreditLines
TotalCreditLines
RevolvingCreditBalance
RevolvingLineUtilization
InquiriesInTheLast6Months
AccountsNowDelinquent
DelinquentAmount
DelinquenciesLast2Yrs
PublicRecordsOnFile

The following 17 features were left out and NOT used in the SVM:

Bor (amount borrowed, usually redundant with Req)
AppDate
AppExp
IssuedDate
Title
Description
MoPayt
RemainingPrincipal
PaymentsToDate
ScreenName
City
State
Education
Associations
Code
MonthsSinceLastDelinquency
MonthsSinceLastRecord

Read comments (5) - Comment

sundar - Jul 8, 2010, 11:51p
yes i did get this far - great to see you continue to do interesting stuff, this atleast is much better than those frikking worms you write about:)

Tuomas Talola - Jul 9, 2010, 5:35a
Have to say, I've never heard of Support Vector Machines. What you've done, I know as regression analysis. Nothing wrong with regression analysis itself, but using it to predict future returns or defaults is little dubious. This has been the case in financial markets over and over again.

However, the work you have done seems quite thorough, I appreciate it. I'd be interested in more detailed results.

asenski - May 7, 2012, 12:57a
Few questions:

1. Have you tried applying non-linear SVM?

2. Did you make sure you are not using training data that would not have been available for the test data set? i.e. whether a loan defaulted or not won't be known until 3 years from its origination date. You may be cheating unintentionally here!

nikhil - May 10, 2012, 9:57p
Hi asenski,

1) No, I have not tried using the non-linear SVM. I think I just tried whatever was the default option in libsvm.

2) If I recall accurately (it's been several years now), the training set always included loans issued before the loans in the test set. I defined a default as a loan that's more than 1 month late. So it's possible that a loan that issued several years ago defaulted after a loan from the test set was issued. So if you were doing this analysis in current-time to predict the future, this might be construed as a form of "cheating" - i.e. you wouldn't know that a training set loan would default because it hasn't at the time you're considering making a new loan (which would be equivalent to an item from the test set). Good eyes! If I were to redo this analysis I would try to take care of this caveat, since it might make a difference.

Dave - Jun 16, 2012, 2:34p
You should really try non-linear. I'd try a radial basis function first.

« Video of real bacterial chemotaxis - X »

Come back soon! Better yet, stay up-to-date with RSS and an RSS Reader.

Creative Commons License