Abstract

This report contains an exploratory analysis of loan data from Prosper. Prosper (www.prosper.com) is a peer-to-peer lending marketplace in which prospective borrowers post loan listings while investors select specific listings to then invest in. Investors can select loans based on a variety of factors, such as: the loan amount, the loan rating, the borrower’s salary, or the borrower’s number of past inquiries. An analysis of this loan data is a great asset for anybody looking to maximize investment returns in the peer-to-peer lending marketplace.

Count and Volume of All Loans

There are 113,937 loans and 81 variables in the dataset. Prosper made significant changes in its underwriting process in July 2009. For example, we can see how changes in the servicing fee occurred in 2009, in the following figure.

The above figure shows the difference in Borrower Rate and Lender Yield on the y-axis, and the Borrower Rate on the x-axis, where Lender Yield is equal to the Borrower Rate minus Prosper’s servicing fee cut. Every loan in the dataset is depicted by a dot, with red dots coming from loans originating prior to 2009 and blue dots originating in or after 2009. We can see that while the servicing fee varied by quite a lot prior to 2009, Prosper then started to use a fixed servicing fee rate of 0.05 in 2009 and after. There are other issues like this in the Prosper dataset. For example, certain variables in the datafile are only defined after July 2009, such as the loan’s “Prosper Rating”. Thus, this report will focus only on loans originating in the Q3 2009 to Q1 2014 time period, for consistency reasons.

The next figure shows the total # of loans, the total loan volume (in $USD), and the average loan amount, for each financial quarter.

The top two histograms show that both the number of loans as well as the total loan volume grew from Q3 2009 to Q1 2014, with a small dip at the end of 2012 to the beginning of 2013. Q4 2013 had the highest loan volume ($160,139,500), and I would suspect Q1 2014 would have an even higher # of loans as well as loan volume than the previous quarter, if the dataset included the entire Q1 2014 period.

The bottom errorbar chart shows that the average loan size also steadily increased, from $3,782 in Q3 2009 to $11,912 in Q1 2014. The variation in loan size, as measured by the standard deviation of loan size for each quarter, also generally increased during the same time period, from $3,376 in Q3 2009 to $6,693 in Q1 2014. If we look at simple statistical summaries for the Q3 2009 and Q1 2014 periods (as shown below), we see that the range in loan size also increased, from $2,400 in Q3 2009 to $33,000 in Q1 2014.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1500    2500    3732    5000   25000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2000    6500   10000   11910   15000   35000

Distribution of Prosper Rating for All Loans

Each loan is rated by Prosper’s proprietary algorithm, with grades ranging from “AA” (the best rating) to “HR” (the lowest rating). The next figure shows the distribution of ratings at each quarter.

We can see that the proportion of AA, A, and B rated loans all generally increased from Q3 2009 to Q1 2014. According to references, this is a direct result of Prosper tightening its loan underwriting process in 2009, and thus increasing the proportion (by loan count) of highly rated loans in the system.

Count and Volume of All Closed Loans

We can separate all of the loans in this dataset into two smaller subsets: loans that are closed (from either being completed or going into default or charge-off status) or loans that are still current and in progress. Closed loans are informative because we know what ultimately happened to these loans, and can therefore use this information to get a more accurate picture of which variables contribute to a loan’s final status and total return. The next figure shows the count, volume, and average loan size of the subset of all closed loans.

In the top two histograms depicting total number of loans and total volume of loans, we see that both these figures increase from Q3 2009 to Q4 2011, and then steadily decrease afterwards. The Prosper dataset ranges from 2006 to the Q1 2014 quarter, with most loans having 36 month terms. Thus, the majority of earlier loans should be closed, while the majority of latter loans will still be in progress. We see this in the two histograms below, depicting closed loans vs. open loans, with bars colored by the term length.

Loans can have terms of 12, 36, or 60 months. The majority of loans have 36-month terms, while a very small fraction of loans have 12-month terms. 12-month term loans only originated in the periods between Q4 2010 to Q1 2013 (inclusively). The majority of older loans are closed, while the majority of younger loans are still open.

Distribution of Loan Status for Closed Loans

Not all closed loans are good; rather, they fall into the following status categories: “completed”, “defaulted”, and “charged-off”. Both “defaulted” and “charged-off” are bad, from a lender’s perspective - a loan with this status is not expected to be paid off, causing lost principal as well as lost interest. From the above figure, it’s not clear whether the closed loans were completed or defaulted/charged-off. Because “defaulted” and “charged-off” are virtually the same, I created a new variable to reflect this essentially binary final status (“completed” vs. “defaulted/charged-off”). A histogram of closed loan count is shown in the next plot, with bars proportionally colored by their final status.

The top plot in the above figure depicts raw counts of completed vs. defaulted/charged-off loans for each quarter. Let’s plot the percentage of defaulted/charged-off loans instead, for better clarity.

Here, we can see that the ratio of completed to defaulted/charged-off loans in the earlier and latter quarters was quite high. In contrast, from Q2 2011 to Q2 2013, this ratio is significantly lower. Does the loan’s term length have something to do with this pattern?

We can immediately see that the proportion of default/charged-off loans for all 12-month term closed loans is very small regardless of the loan’s origination date. As we saw before, the bulk of the closed loans have 36-month terms. The default/charged-off rate is relatively high at almost all but the latter quarters. This may be because borrowers tend to default on their loans nearer to the middle or end of the loan life - if this is true, then we would expect to see the proportion of default/charged-off loans to increase, as the loans aged. The Prosper dataset only gives us a snapshot of all of Prosper’s loans, taken at some point in Q1 2014, so we can only hypothesize on how a loan evolves, how a borrower changes his/her paying behavior, etc. That said, let’s look at some other variables and make more observations, particularly as loan term length does not seem to be particularly informative.

Distribution of Prosper Rating for Closed Loans

Let’s look at the distribution of Prosper ratings for completed loans vs. defaulted/charged-off loans.

Here, we can see that most of the highly rated loans (AA and A) were completed. However, for B, C, D, E, and HR-rated loans, it looks like a toss up between whether the loan ultimately was completed or defaulted/charged-off.

Total Return of Investment on Closed Loans

Looking at whether a loan was just completed or defaulted/charged-off is rather simplistic. If a loan defaults right before its final principal payment, then the loan will still give a positive return, despite having a final defaulted status. Similarly, if a loan has a very low interest rate, and the borrower pays off this loan after just one day, then the investor may not actually make any profit after Prosper’s 5% servicing fee (paid by the investor) is taken into account. So, we should calculate each finished loan’s return on investment as an initial measure of a loan’s profitability.

Return on Investment (ROI) is a very simple calculation:

ROI = (Final Value - Initial Value) / Initial Value

From a lender’s point of view, the final value consists of the borrower’s principal and interest payments (net service fees and collection fees), while the initial value is the original loan amount (principal).

The above figure plots a point for each loan, colored by its Prosper rating. Loans with lower Prosper ratings are riskier; thus, lenders get paid a higher interest rate for taking on such risk. Conversely, loans with higher Prosper ratings are safer, and therefore come with a less lucrative interest rate. We can see that most loans in the closed loan subset gave a positive return, with higher ROIs associated with lower Prosper ratings, and lower ROIs associated with higher Prosper ratings. However, there were also loans that gave negative ROIs, particularly in the periods from Q4 2009 to Q2 2013.

Mean and Median ROIs are also plotted for each quarter, with mean ROIs depicted by connected black points and median ROIs depicted by connected blue points. While mean ROI and median ROI tracked each other at the beginning and at the end of the dataset timeframe, median ROI was much higher than mean ROI from ~Q4 2010 to Q3 2013. More significantly, median ROI was always a positive figure while mean ROI dipped below zero during Q2 2011 to Q2 2013. This trend matches what we saw before, where the ratio of defaulted/charged off to completed loans was much higher during this exact period of time. The above figure also implies that ROI must generally be negatively skewed, since the mean ROI value is less than or equal to the median ROI value for all periods Let’s look at a histogram of ROI, colored by Prosper rating.

There are clearly two distinct groups of ROIs. Common sense tells us that positive ROIs will generally be completed loans, while negative ROIs will be defaulted/charged-off loans. Let’s create the same histogram, but with bars colored by our final binary loan status variable.

As expected, every single loan with negative ROIs had defaulted/charged-off status, while the majority of loans with positive ROIs had completed status. However, we can see that some loans with positive ROIs actually had defaulted/charged-off status.

In the previous plot showing ROI vs. financial quarter, mean ROI was plotted for each quarter. This mean did not take into account different loan sizes. A more informative figure would be the dollar-weighted mean, with loan ROIs scaled by the loan principal amount. If we hypothetically invested in every single closed loan in the dataset, then this would be our actual return on the portfolio.

Surprisingly, the weighted mean ROI is not very different from the unweighted mean ROI. Let’s look at the distribution of loan amounts, colored by final loan status.

The set of completed loans and the set of defaulted/charged-off loans have similar looking loan amount distribution shapes. This would explain why the weighted and unweighted mean ROIs are approximately the same.

Annual Return of all Closed Loans

Return on investment is a simple calculation, in that it disregards time. However, a loan that gives a large ROI after one month should naturally be valued higher than a loan that gives an equal ROI after 5 years. To take the factor of time into account, we can calculate the annual return on each closed loan. Annual Return is calculated as:

Annual Return = (Final Value / Initial Value) ^ (1/# of years) - 1

Let’s plot annual return against time:

Median annual return was always equal or higher to mean annual return, which means that annual return was generally negatively skewed (similar to ROI). While median ROI steadily decreased from its high in Q4 2010 to Q1 2014, median annual return actually generally increased during this same time period. This is because while the actual dollar returns of the more recent loans were smaller, these loans were paid off more quickly. Annual return is a more useful way to assess an investment than return on investment, since the effect of time is non-trivial. We can compare the loans’ actual annual returns with the estimated returns that Prosper calculated when the loan listings were created (from my understanding, these two figures are comparable), where estimated return = borrower’s interest rate - servicing fee - estimated uncollected interest - estimated principal loss + estimated late fees

The above figure depicts annual return vs. estimated return for all of the closed loans, with subplots for each Prosper rating. The dashed line in each sub-plot is the y=x line with intercept=0, i.e. the line showing annual return equal to estimated return. For the AA, A, and B-rated loans, the points fall quite closely along this line, with a minority of loans having predicted estimated returns lower than their calculated annual returns. As loan rating progressively worsens, the spreads in both the annual return values and the estimated return values increase. For example, many HR-rated loans have significantly higher than estimated annual returns. However, many HR-rated loans also have significantly lower than estimated annual returns. Interestingly, many of the loans that had negative estimated returns actually ended up returning positive annual return figures. Let’s look at a simple statistical summary of annual return for all closed loans with negative estimated returns:

summary(subset(cloloan, EstimatedReturn < 0)$annualreturn)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.00000  0.07486  0.12780  0.02230  0.18290  0.39130

The mean annual return is -3.0% while the median annual return is 10.8%, which is very interesting and non-intuitive.

Estimated Return vs. Lender Yield

The above figure again plots EstimatedReturn against LenderYield, where points are colored by loan rating and sub-figures are created by year. EstimatedReturn is calculated as before, while LenderYield is the borrower’s rate minus the sevicing fee. Qualitatively, we see that the relationship between estimated return and lender yield becomes more regular in later years. In 2013 and 2014, the resulting curves look almost continuous regardless of loan grade/rating, with estimated return as a function of lender yield. There is also less overlap in lender yields between different Prosper rating scores. In contrast, the relationship between these two variables in 2009 and 2010 are quite disjoint and irregular, with separate curves for each loan grade. There is also a lot of overlap in lender yield and estimated return between the different loan ratings. Additionally, we can see that for some of the worst rated loans in 2009 and 2010, Prosper actually made negative estimated returns (and amazingly, lenders financed such loans). From 2011 and onward, however, all of the estimated returns become positive.

Prosper must have developed better predictive models for estimated return (based on different loan grades) from 2009 to 2014. Prosper must have also developed a better system for evaluating loans and assigning them rating. Both these factors would lead to a tighter relationship between EstimatedReturn, LenderYield, and loan rating.

Does loan size effect the loan’s final status?

The above figure shows a loan’s original amount vs. its origination quarter, with each loan depicted by a colored point that is red if the loan was completed and green if it went into defaulted/charge-off status. This figure alone is not very instructive

When does a loan first default?

For 36-month and 60-month term loans, most of the loans going into default/charged-off status do so earlier in their age, with the peak time of default at 8 months. The distributions for both terms look like they come from a poisson distribution, and it would be interesting if default behavior could be modeled as such. This plot contradicts my initial hypothesis that borrowers would generally make on-time payments early on, and then start dropping payments later as the loan matured.

Cumulative gross payments made by the borrower vs. original amount of loan

Of the loans that were finished (i.e. completed, chargedoff, or defaulted), how much money did the borrower actually pay? LP-CustomerPayments is the “Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries”, while LoanOriginalAmount is the original loan amount. In the above plot, we can see that all completed loans had LP-CustomerPayments greater than or equal to the LoanOriginalAmount - borrowers were not only paying the principle (original loan amount) but additional interest. The additional amount of money the borrower paid over the original amount also looks to be a function of the original amount, and bounded. In other words, the interest rate is a function of the original amount, and has an upper bound. In contrast, for loans that went into default or were charged-off, the amount of money the borrower paid could be as high as slightly above the original loan amount, or as low as $0.00.

We can also observe that the majority of the defaulted/charged-off loans had original loan amounts of less than or equal to $15,000, while completed loans had original loan amounts ranging from $1,000 to $25,000. Perhaps the defaulted/charged-off loans had smaller loan amounts because the borrowers were deemed riskier, and thus given less credit? This would be reflected in the loan Prosper rating.

When we re-plot the previous figure, but with points colored and faceted by Prosper rating, we can see that indeed, the lower loan amounts are correlated with lower Prosper scores. Let’s plot the percentage of loan paid against the number of days between loan origination and ending.

The above-left figure depicts percentage of loan paid vs. the # of days between the loan origination date and the closing date, with each loan point color based on whether the loan was completed or defaulted/charged-off. The completed loans (in red) all have 100% payoff values, while the defaulted/charged-off loans (in blue) range from o to almost 100%. In the above-right figure, the ratio of borrower payment to loan amount is plotted against the # of days between loan origination date and closing date. In both plots, vertical dashed lines are positioned at 1, 3, and 5 years, to correspond to the ending dates for 1, 3, and 5 year terms. We can see that there are bursts of completed loans ending at the 1 and 5 year point. In contrast, uncompleted loans end anywhere from ~120 days (4 months) and later. The uncompleted loans ending at 4 months are probably the same loans that first default after 4 months, which we saw in a previous figure.

There are distinct bands within the two groups of finished loans, that most likely correspond to other variables.

The above figure shows ratio of total borrower payment to loan amount, on the y-axis, vs. the # of days between loan origination date and end date, on the x-axis, but with points colored by Prosper rating, and subplots based on loan term. As guessed earlier, the distinct bands do correspond with both loan term and Prosper rating.

Prosper’s collections rate on charged-off loans

Let’s look at loans that were charged-off. If a borrower does not pay off the loan, then Prosper attempts to recover the money owed from the borrower. The LP_GrossPrincipalLoss is the gross charged off amount of the loan (i.e. the money from the loan that is not paid back), while the LP_NetPrincipalLoss is the principal (original loan amount) that remains uncollected after any recoveries.

If the points fall along the line y=x, then there are little-to-none recoveries made from the charged-off loans. As we can see, this is generally the case. So, once an investor’s money is lost, it seems to be irretrievably lost.

Let’s look at open loans

The above figure depicts all of the open loans in the dataset in a graph of percent paid vs. # of months since origination. Although each point depicts a single loan, with all the loans depicted at a single point in time (sometime in Q1 2014), because we have so many loans in the system, we can hypothesize that most loans will follow this progression/trend. I.e. a 36-month term current loan at 1 year will have about 25% of the principal paid, while a 60-month term current loan at 1 year will have about 12.5% of its principal paid. Let’s compare the progress of current open loans to loans that ended in default/charged-off status.

Defaulted vs. Open

The above figure depicts percent paid vs. loan months since origination. There are 3 sub-plots corresponding to loan term (12-months on the left, 36-months in the center, and 60-months on the right). Dark-colored points correspond to current, open loans, while light-colored spots correspond to default/charged-off loans, where the x-coordinate is the month that the loan first went into default. We can see that the percent paid is generally higher for current loans than for defaulted loans, for all three month terms. Again, while each point depicts a single loan, and all of the information of the loan is taken at a single instance in time, we can again think of the trends between percent paid and loan months since origination as a rate or type of progress. If we interpret the above figure in such a way, then it can be very useful for a lender deciding between keeping a loan or selling it on the secondary market, i.e. whether a loan will ultimately be completed or go into default/charged-off mode.

Final Plots and Summary

Plot 1:

The above figure was the first plot in this report. It is important because it is a clear example of how Prosper drastically changed much of its loan servicing procedures in 2009. In my original, first-pass analysis of the Prosper loan data, I included every single loan. However, I soon realized that there were distinct differences in loans based on whether they originated before or after 2009. For example, the Prosper rating variables (“ProsperRating [numeric]” and “ProsperRating [Alpha]”), as well as “EstimatedEffectiveYield”, “EstimatedLoss”, and “EstimatedReturn”, are only applicable for loans originating in July 2009 (i.e. in Q3 2009) or later. In contrast, “CreditGrade” is only applicable for loans originating in 2008 or earlier.

In this figure, we can see that servicing fee (a new variable that I created) was originally a variable rate prior to 2009, and then became fixed at a set rate of 5% afterwards. This figure visibly demonstrates, and therefore justifies, that the loans in the Prosper dataset should be separated into two groups: those originating either prior to Q3 2009, or in/after Q3 2009. Because I only used loans originating in Q3 2009 or later for this entire report, I included this figure as one of the final plots.

Plot 2:

The above figure shows Annual Return vs. Estimated Return, with subplots and colors based on the Prosper rating. This is a great plot for several reasons. First, it shows how both Annual Return (one of the variables I calculated) and Estimated Return have a dependence on Prosper rating. Second, it shows that Annual Return and Estimated Return are very similar when the loan rating is high, but diverge greatly as loan rating becomes worse. Third, it shows that different ranges in a loan’s annual return, ranging from as low as -100% (i.e. the entire principal is lost, in addition to lost interest revenue) to as high as ~40% for some of the very risky loans (e.g. E and HR). Fourth, it shows that many loans that had negative Estimated Return values actually ended up having postive Annual Returns.

Because of the aforementioned reasons, I chose to include this figure as one of the final plots for this report. I think the main take-home message of this figure is that the actual annual return from a loan is not necessarily well-predicted by Prosper’s Estimated Return, particularly for lower rated loans.

Plot 3:

The above figure shows the percentage of each loan paid, on the y-axis, vs. the number of months since the loan originated, on the x-axis, with sub-figures for each loan term. As stated in the bottom of the figure, darker points depict open loans, while lighter points depict closed loans that ended up in defaulted or charged-off status. I included this figure as the last final plot because it showed that “good” loans can be differentiated from loans that ultimately went “bad” based on the borrower’s repayment progress. Looking at this figure, I can clearly see that with even a simple machine learning algorithm, we can determine whether a loan will eventually be completely paid back or not. If we plot the loan in question in the above figure, we can see whether it falls in the “good” “trend” or the “bad” “trend” (with “trend” in quotation marks, as these points represent loans captured at a single instance in time, rather than indicating some kind of rate or progress). From the analysis done in this report, borrower behavior (i.e. loan repayment progress) seems to be one of the most useful indicators of a loan’s final status.

Reflections

I did not find too many difficulties in the dataset, as it was relatively clean and I could use it quickly. I think the main difficulty was determining what sort of plots would be most useful, rather than plotting anything and everything and therefore overwhelming the reader. A big challenge for me was actually understanding the different financial terms. I wanted to get an idea of what a loan actually returned, and thus had to learn how to properly calculate Return On Investment (ROI) (which I hopefully did!) in order to create a new variable that reflected a loan’s profitability. Furthermore, I wanted to see whether Prosper’s predicted/estimated returns were actually correct or not, but again, I had to create a new variable for this. Thus, I read up on Annual Return, and hopefully calculated it accurately as well, and was correct in comparing Annual Return with Estimated Return.

In terms of how the analysis could be enriched in future work, the inclusion of machine learning models to predict whether a new or ongoing loan will ultimately be completed or go into default/charge-off would be very informative and useful. However, as this report is specifically supposed to be an exploratory data analysis, such a model would go beyond the scope of the report. Obviously having both more data and more up-to-date data is always useful - the Prosper dataset in this report ranged from 2006 to March 2014, with all the loan information presumably taken in March 2014. If the data was updated, then we would see how many of the older, “current” loans actually fared in the end, and thus enlarge the closed loan subset.