Machine learning estimations for insurrance companies
Started on Jan. 6, 2020
When a trial is over, a summary of the trial is published with all the important information dealing with the case that have just been judged.
This document is called jurisprudence in French.
In the case of a trial between a victim and an insurer, this document contains all the circumstances, and the medical and financial data from the first injuries to the final amounts of indemnisation.
At Predilex, we have “jurisprudence” data as text files and we want to build an algorithm to automate the extraction of the relevant information.
In this challenge, we want to extract from "jurisprudence" the sex of the victim, the date of the accident and the date of the consolidation of the injuries.
The inputs are “.txt” files containing a whole “jurisprudence” publication.
The documents come from different courts and different judges. The samples are taken on a period long enough to observe changes in the formulations, but the words and phrases are often similar.
The outputs are the data we want to extract: sex of the victim, date of accident, date of consolidation.
We provide them in a “.csv” file as columns. Each line of this file will be linked to a “jurisprudence” text file with its trial number and court.
This information is always contained in the document and can only take two values : "homme" and "femme"
Except in very rare cases, this information is always domewhere in the document (usually at the beginning). It is the date when the accident happenned. We expect a date in the format dd/mm/yyyy.
This is the date when the injuries of the victim became stable and were declared final by a physician. The information should be present in most cases but sometimes it is either missing (so we put "n.c." in the csv file) or not applicable (so we put "n.a." in the csv) if the injury did not stabilize before the death of the victim.
The classification of the sex of the victim was made by counting the number of some key words like in the text 'il' vs 'elle', 'monsieur' vs 'madame', "né" vs "née"...
For the extraction of the dates, we extracted all sentences in the text that contain a date, and classified those sentences based on bag of words. The classification was done by a SVM classifier.
In every file, and for each field ("date accident" and "date consolidation") we have ranked sentences based on the SVM score. Our prediction for each field is the date that had the best score, if this score was above a threshold (otherwise we predicted "n.c.").
Files are accessible when logged in and registered to the challenge