Data & methodology
Through Facebook application several thousand lines of information were received per user on average. Gathered information was then structured to clear categorized database fields and further processed into relevant variables. For example sheer number of friends was not informative enough, therefore percentage of friends who were connected via the same employer or had similar education were analyzed. Good and bad are defined as no credit loss (good) and credit loss or very high probability of credit loss (bad).
In the end 80 features were selected from thousands of lines of raw data on average (average profile contains 5,000-10,000 lines of data). Some of the examples of these features would be the following: age, gender, hometown, marital status, number of jobs, work location, time spent on Facebook, moreover volume information such as number of likes, groups, interests, events, videos and so on were used. In addition, the selected features contained information about users’ friends such as their education, average work time, number of languages etc. Multiple linear regression was selected as the modeling algorithm as it explains variance in the best way.
All of the applications were scored by an existing credit loss scorecard and a score between 1 to 1000 was added to the customers’ Facebook data. For consistency reasons these scores were aggregated into score bands. The distribution for the total population of applications – with or without Facebook data is normally distributed. All Facebook data elements were preprocessed before the modeling phase. All nominal variables were dichotomized, ordinal variables were regrouped to remove outliers and for normalizing scale variables hyperbolic tangent function was used. In addition there is a case of data validity as normally application data is verified by an eligible authority. Facebook data is not verified, but the probability of a client faking years of Facebook usage is extremely low, therefore data can be assumed to be valid in majority of the cases.
As a result credit scorecard based on Facebook data was created by dividing the results into 10 deciles and assigning these deciles scores with 10 being the best rating and 1 the worst. On the following table it is possible to see a clear correlation with our Facebook based ratings and the loan outcomes.
The performance of the estimated original credit loss scorecard is measured with the Gini coefficient and Kolmogorov-Smirnov statistic. The respective values are Gini=0,285 and K-S=0,302. These characteristics indicate that the efficiency of the Facebook scorecard is high. The real efficiency is most likely even higher as all calculations were done using score bands and not the real scores. The gain chart is provided below.
This study has found that it is possible to build an efficient credit scorecard based only on data gathered from Facebook. Moreover, adding that information to an existing credit application score significantly improves the performance. In addition, it is possible to implement Facebook based credit scoring model without changing the model in use by creating a matrix combining application score and Facebook credit score. Therefore, it is possible to avoid unnecessary credit losses and on the other hand gain new clients with new improved ratings.
For additional information contact us at firstname.lastname@example.org
Present study was made for informative purposes only and real scorecards and algorithms used in scoring solutions offered to our clients may vary depending on market and product specifics.