Challenge data

Description

Competitive challenge

Economic sciences

Finance

Classification

Texts

10MB to 1GB

Advanced level

Dates

Started on Jan. 6, 2023

Challenge context

The EU taxonomy for sustainable activities is a classification system, establishing a list of environmentally sustainable economic activities. It allows to measure the "greenness" of an investment. Since 2022, the issuers subject to this regulation have had to report their Key Performance Indicators (KPIs) related to the EU Taxonomy in their annual financial reports.

Since such reports are very long and complex (about 500 pages), the task of manually identifying sections of interest, e.g. that contain relevant KPIs, is cumbersome. To facilitate the decision-making process, it is essential to be able to swiftly identify the location of “Taxonomy KPIs" in financial annual reports.

To that end, the AMF has been implementing an NLP-based solution that would allow to automatically analyze whether several thousands of issuers comply with EU taxonomy regulation. Compliance being measured with the help of the aforementionned KPIs, the solution must be able to detect every different sections in a document: where they start (from the title of the section) and where they finish (until the title of the next section).

Challenge goals

The goal of the challenge is to be able to rebuild the Table of Content (ToC) of financial annual reports, based on the text blocks of the document and their metadata (position, font, font-size, etc.). This task may be split into two sub-tasks, consisting first in determining whether a block of text is a title (binary classification of block of text) and then in determining the level of the title, in the annotation (level 0 corresponds to the title of document then level 1 is the section, level 2 is the sub-section, etc.). The screenshot below shows an example of the labelling tool developed by the AMF to annotate every text block from a financial annual report.

XHTML annotation

Figure 1: Annotation of titles in the document

It should be noted that financial annual reports have been supposed to be produced in “machine-readable” format (XHTML), since ESEF (European Single Electronic Format) regulation entered into force in 2022. However in practice, documents are not “machine-readable”, except for IFRS consolidated financial statements for which ESEF designs specific “tags”. This means that it is not possible to identify the different sections of document thanks to the XHTML tags.

Data description

The dataset is stored in CSV files:

train_label.csv
train_data.csv
test_data.csv

The label file contains the detected titles represented by the file and the idx of each text block (the index is relative to the file) and the annotation that contains the level of the title. The train_data contains 1 072 345 rows (each row is text block and its metadata), corresponding to 71 documents, the test_data contain 252 250 rows corresponding to 21 documents.

The columns of train_data and test_data :

file : the name of the file (used to join with label)
page_num : the number of the page based on a model that aims to find the split between pages.
idx : the index of the text block within the file (used to join with label)
txt : the actual text contained in the block
x0 : the x coordinate of the upper left point of the text block (rectangle)
y0 : the y coordinate of the upper left point of the text block (rectangle)
orig_y0 : the y coordinate from the beginning of the document of the upper left point of the block of text (rectangle)
x1 : the x coordinate of the lower right point of the text block (rectangle)
y1 : the y coordinate of the lower right point of the text block (rectangle)
orig_y1 : the y coordinate from the beginning of the document of the lower right point of the block of text (rectangle)
font : the font name used for the block
font_size : the size of the font
font_color : the color of the font, in CSS rgb or rgba format
font_style : the style used with the font

The metric is the mean percentage of the section that can be retrieved. A section comprises all text blocks between a given title and the following title of the same or upper level. For instance, if the current text block is a level 2 title, then all text blocks will be included in the section until the a level 2, 1 or 0 title is met. (see Definition of a title level).

tss = true section size, the number of blocks between the title and the next title of the same or upper level.

pss = predicted section size, the number of block between the title and the next title of the same or upper level.

NT = number of titles (after outer join between predicted and true)

$\text{m} = \frac{1}{NT} \Sigma_{i=1}^{NT} \left(1 - \min(1, \frac{\text{tss}_i - \text{pss}_i}{\text{tss}_i}) \right)$ ,

This metric is computed by joining the predicted titles with the true titles based on 'file' and 'idx' columns. False positive and false negative titles count as 0 percent of the retrieved section.

Prediction example

Assuming all subsections are of equal length:

for subsection 1.1, we retrieve 4 times as much text as is required. Indeed, the model predicts that titles 1.2, 1.3 and 1.4 are at a lower level than that of section 1.1, whose predicted level is 2. Hence, subsections 1.1, 1.2, 1.3 and 1.4 will be considered as sub-subsections of the subsection 1.1 and all text present up to the next title whose level is equal or greater than the title level of subsection 1.1 will be attributed to subsection 1.1. In this case, we stop at title 2, whose predicted level is 1, which represents a real title level greater than a level 2 title, such as section 1.1.
For section 2.1, one retrieves twice as much text as is required, because the reconstruction includes subsections 2.1 and 2.2 (i.e we stop at subsection 2.3's title, when we should have stopped at subsection 2.2's title).
For subsection 2.2, the title is not recovered, so 0% of the section is retrieved.

Title	Prediction	Label	Score
1	1	1	100%
1.1	2	2	25%
1.2	3	2	100%
1.3	3	2	100%
1.4	3	2	100%
2	1	1	100%
2.1	2	2	50%
2.2		2	0%
2.3	2	2	100%

In this example, the average section size recovered is 75%.

Benchmark description

The benchmark algorithm has two components:

Model 1: Title detection

The first model is a bidirectional GRU network that predicts whether a text block is a title based on the following features and architecture:

Features

The length of the text
The font size normalized per document
The usage of the font in the document
Ratio between font size and block width/height
The number of capital letters
The position of the block
A regex rule that aims to find title numeration.

GRU architecture

a bidirectional model: for each text block, the 10 previous blocks and the 10 following blocks are used to make a prediction
number of hidden layers: 3 dense layers + 1 dropout layer
number of neurons per layer: 25
activation function: sigmoid
number of epochs: 5
optimizer: Adam

Score

This model gets an f1-score of 0,71.

Model 2: Title level identification

A random forest classifier is then trained to predict whether a given title is at a lower level, the same level or an upper level from the previous title.

Note : definition of a title level

Next, we define an ordering relationship between titles within a document, that reflects the way titles are hierarchically arranged. A section is more important than a subsection, hierarchically speaking, therefore one will say that the section's level is higher than that of the associated subsection. For instance, the level of section 1 is greater than that of subsection 1.1.

In practice, the model does not predict the title's real level (i.e. is the title level equal to 'section 1' or 'subsection 1.1' ?). Instead, it predicts abstract levels represented by integers, that describe the ordering relationship between titles. The only caveat is that the greater the number representing a title level, the lower the title stands in the document's title hierarchy. That is, if the true level of 'section 1' is equal to 0, then the 'subsection 1.1' title level will be equal to 1, that of 'sub-subsection 1.1.1' will be equal to 2, that of 'section 2' will be equal to 0, that of 'section 2.1' will be equal to 1 etc.

In what follows, when mentioning a title level, we will systematically refer to the hierarchical relationship between real title levels as opposed to the ordering relationship between abstract title levels, represented by integers.

Features

The difference between font sizes
The font size from text between titles
A regex that identifies numbers in titles

Model

A standard ranfom forest model is used here. The input features are those mentioned above, while each row corresponds to a text block that was labeled as a title by the first model. The second model makes predictions in the set $\mathcal{Y} = \{-1, 0, 1\}$ , where:

$-1$ : the current title level is lower than the previous title level
$0$ : the current title level is equal to the previous title level
$+1$ : the current title level is greater than the previous title level

Score

This model gets an f1 score of 0,87. (The 'average' parameter is set to 'micro'.)

Reconstructing predicted title levels

Given predictions from the second model, one can reconstruct title levels sequentially. In other words, one wants to translate values from the set $\mathcal{Y} = \{-1, 0, 1\}$ into integers.

$-1$ : the current title level is below the previous title level (ex: if the previous title level is 2, then the current title level will be 3. In this case, the current title value is equal to the previous title level + 1.)
$0$ : the current title level is equal to the previous title level (ex: if the previous title level is equal to 1, then the current title level will be 1. In this case, the current title value is equal to the previous title value - 1.)
$+1$ : the current title level is greater than the previous title level (ex: the previous title level is 4, hence the current title level could be 0, 1, 2 or 3). In order to decide which level to pick, one can compare the current title's font size with the font sizes from the previous titles and affect a level for the current title based on the closest match in terms of font size. Let's look at an example:

Title	Font size	RF prediction	Level
Document	20px	0	0
1.	15px	-1	1
1.1	12px	-1	2
1.2	12px	0	2
2.	15px	1	1

Looking at the last row, one predicts that the level of 'title 2.' is 1 , because the font size closest to that of 'title 2' is the one from 'title 1.' (e.g. 15 px).

Benchmark score

This method (e.g. combining model 1 and model 2) reaches an average score of 51%. In other words, a little more than half of each section is retrieved by the algorithm on average.

Files

Files are accessible when logged in and registered to the challenge

The challenge provider

Market surveillance

PROVIDER WEBSITE

Congratulation for the winners of the challenge

1 David Gauthier
2 Tom Lefrere
3 yfe

You can find the whole list of winners of the season here

Challenge Data

Can you retrieve the Table of Content?
by Autorité des Marchés Financiers

Description

Dates

Challenge context

Challenge goals

Data description

Prediction example

Benchmark description

Model 1: Title detection

Features

GRU architecture

Score

Model 2: Title level identification

Features

Model

Score

Reconstructing predicted title levels

Benchmark score

Files

The challenge provider

Congratulation for the winners of the challenge

Challenge Data

Can you retrieve the Table of Content? by Autorité des Marchés Financiers

Description

Dates

Challenge context

Challenge goals

Data description

Prediction example

Benchmark description

Model 1: Title detection

Features

GRU architecture

Score

Model 2: Title level identification

Features

Model

Score

Reconstructing predicted title levels

Benchmark score

Files

The challenge provider

Congratulation for the winners of the challenge

Can you retrieve the Table of Content?
by Autorité des Marchés Financiers