Market surveillance
Started on Jan. 6, 2023
The EU taxonomy for sustainable activities is a classification system, establishing a list of environmentally sustainable economic activities. It allows to measure the "greenness" of an investment. Since 2022, the issuers subject to this regulation have had to report their Key Performance Indicators (KPIs) related to the EU Taxonomy in their annual financial reports.
Since such reports are very long and complex (about 500 pages), the task of manually identifying sections of interest, e.g. that contain relevant KPIs, is cumbersome. To facilitate the decision-making process, it is essential to be able to swiftly identify the location of “Taxonomy KPIs" in financial annual reports.
To that end, the AMF has been implementing an NLP-based solution that would allow to automatically analyze whether several thousands of issuers comply with EU taxonomy regulation. Compliance being measured with the help of the aforementionned KPIs, the solution must be able to detect every different sections in a document: where they start (from the title of the section) and where they finish (until the title of the next section).
The goal of the challenge is to be able to rebuild the Table of Content (ToC) of financial annual reports, based on the text blocks of the document and their metadata (position, font, font-size, etc.). This task may be split into two sub-tasks, consisting first in determining whether a block of text is a title (binary classification of block of text) and then in determining the level of the title, in the annotation (level 0 corresponds to the title of document then level 1 is the section, level 2 is the sub-section, etc.). The screenshot below shows an example of the labelling tool developed by the AMF to annotate every text block from a financial annual report.
Figure 1: Annotation of titles in the document
It should be noted that financial annual reports have been supposed to be produced in “machine-readable” format (XHTML), since ESEF (European Single Electronic Format) regulation entered into force in 2022. However in practice, documents are not “machine-readable”, except for IFRS consolidated financial statements for which ESEF designs specific “tags”. This means that it is not possible to identify the different sections of document thanks to the XHTML tags.
The dataset is stored in CSV files:
The label file contains the detected titles represented by the file and the idx of each text block (the index is relative to the file) and the annotation that contains the level of the title.
The train_data contains 1 072 345 rows (each row is text block and its metadata), corresponding to 71 documents, the test_data contain 252 250 rows corresponding to 21 documents.
The columns of train_data and test_data :
The metric is the mean percentage of the section that can be retrieved. A section comprises all text blocks between a given title and the following title of the same or upper level. For instance, if the current text block is a level 2 title, then all text blocks will be included in the section until the a level 2, 1 or 0 title is met. (see Definition of a title level).
tss = true section size, the number of blocks between the title and the next title of the same or upper level.
pss = predicted section size, the number of block between the title and the next title of the same or upper level.
NT = number of titles (after outer join between predicted and true)
,
This metric is computed by joining the predicted titles with the true titles based on 'file' and 'idx' columns. False positive and false negative titles count as 0 percent of the retrieved section.
Assuming all subsections are of equal length:
Title | Prediction | Label | Score |
---|---|---|---|
1 | 1 | 1 | 100% |
1.1 | 2 | 2 | 25% |
1.2 | 3 | 2 | 100% |
1.3 | 3 | 2 | 100% |
1.4 | 3 | 2 | 100% |
2 | 1 | 1 | 100% |
2.1 | 2 | 2 | 50% |
2.2 | 2 | 0% | |
2.3 | 2 | 2 | 100% |
In this example, the average section size recovered is 75%.
The benchmark algorithm has two components:
The first model is a bidirectional GRU network that predicts whether a text block is a title based on the following features and architecture:
This model gets an f1-score of 0,71.
A random forest classifier is then trained to predict whether a given title is at a lower level, the same level or an upper level from the previous title.
Note : definition of a title level
Next, we define an ordering relationship between titles within a document, that reflects the way titles are hierarchically arranged. A section is more important than a subsection, hierarchically speaking, therefore one will say that the section's level is higher than that of the associated subsection. For instance, the level of section 1 is greater than that of subsection 1.1.
In practice, the model does not predict the title's real level (i.e. is the title level equal to 'section 1' or 'subsection 1.1' ?). Instead, it predicts abstract levels represented by integers, that describe the ordering relationship between titles. The only caveat is that the greater the number representing a title level, the lower the title stands in the document's title hierarchy. That is, if the true level of 'section 1' is equal to 0, then the 'subsection 1.1' title level will be equal to 1, that of 'sub-subsection 1.1.1' will be equal to 2, that of 'section 2' will be equal to 0, that of 'section 2.1' will be equal to 1 etc.
In what follows, when mentioning a title level, we will systematically refer to the hierarchical relationship between real title levels as opposed to the ordering relationship between abstract title levels, represented by integers.
A standard ranfom forest model is used here. The input features are those mentioned above, while each row corresponds to a text block that was labeled as a title by the first model. The second model makes predictions in the set , where:
This model gets an f1 score of 0,87. (The 'average' parameter is set to 'micro'.)
Given predictions from the second model, one can reconstruct title levels sequentially. In other words, one wants to translate values from the set into integers.
Title | Font size | RF prediction | Level |
---|---|---|---|
Document | 20px | 0 | 0 |
1. | 15px | -1 | 1 |
1.1 | 12px | -1 | 2 |
1.2 | 12px | 0 | 2 |
2. | 15px | 1 | 1 |
Looking at the last row, one predicts that the level of 'title 2.' is 1 , because the font size closest to that of 'title 2' is the one from 'title 1.' (e.g. 15 px).
This method (e.g. combining model 1 and model 2) reaches an average score of 51%. In other words, a little more than half of each section is retrieved by the algorithm on average.
Files are accessible when logged in and registered to the challenge