Created by Matheus Laranjeira (https://github.com/mathlaranjeira and https://www.linkedin.com/in/matheus-laranjeira-m-sc-1387a859/) & Guilherme Origo Fulop (https://github.com/GuilhermeFulop and https://www.linkedin.com/in/guilherme-origo-fulop/).
Insurance fraud, as defined by the California Department of Insurance, occurs when someone knowingly lies to obtain a benefit or advantage to which they are not otherwise entitled or someone knowingly denies a benefit that is due and to which someone is entitled. According to the Coalition Against Insurance Fraud, insurance fraud, as a whole, occurs in about 10% of property-casualty insurance losses and steals at least $308.6 billion every year from consumers in the United States. Medical care fraud alone accounts for an estimated cost of $60 billion every year.
Vehicles are also an essential source of insurance fraud, which consists of false or exaggerated claims related to property damage or personal injuries. Some common fraud practices are staged accidents, phantom passengers or exaggerated injuries. With that said, this project focuses on vehicle fraud claims.
The main characteristic of this dataset is the difference between cases that weren't fraud and those that were fraud. Frauds represent only 6% of the entire dataset. If we trained a model this way, the accuracy would be very high, because the model would hit only the non-fraudulent cases and just a few frauds would be prevented. This is not ideal for us, we want a model that predicts correctly almost all frauds in our dataset!
We'll show only the best result, which was achieved through the undersampling method. For this, we used ClusterCentroid, that uses k-means to identify the cluster centroids and replace some values by the centroid value.
By the image above, we can see that, at the end, the number of not fraud is equal to the fraud amount, proving that the dataset was successfully balanced.
Our best model was RandomForestClassifier, first, we made the hyperparameter tunning, to get the best parameters which will lead us the best result.
With the model prepared, we plot the confudion matrix and, as we can see, only 5 frauds were wrongly predicted, in contrasto to 262 frauds correctly predicted, a hit rate of 98%!
Our decision were also based on the ROC curve, which is displayed below:
By some estimates, through our method, the insurance company could save US$ 434,022.47 per year, a 97.75% cut on expenses!
We also make a deploy of the model, you can check it here: https://mathlaranjeira-vehicle-claim-fraud-detection-deploymain-s4auzj.streamlitapp.com/
Please, check out our project, we tested some hypoteses, sampling methods (such as over and undersampling), outliers detection and many more techiniques!