Unify, Unleash, Orchestrate Talent Data
Started on Jan. 10, 2024
Can you predict the professional evolution of one employee ?
In many companies, the career evolution of individuals often follows a linear trajectory, marked by multiple stages or steps corresponding to distinct positions within the company's hierarchy. This progression can be likened to subway stations, where individuals move through these stages and may eventually halt, either by departing from the company or changing their career aspirations, similar to travelers disembarking at a specific station.
The goal is to anticipate the likely position an employee will attain within a company. This prediction relies on evaluating information about the employee, such as qualifications, experience, skills, or interests, along with details about the company, such as sector, business type, management style, or size.
It's important to note that this problem's definition is oversimplified and may not capture the genuine complexity of career progression. In reality, it is unlikely that prior information about the employee or the company will determine career paths. Moreover, many companies lack a clearly defined and vertical hierarchy, leading to more intricate and varied career trajectories.
The goal of this test is to employ sequential decision-making methods to predict the step at which an employee is likely to cease advancing through the positions in the company's hierarchy. There are arguably several valid frameworks to solved this, such as:
There are four different positions (or career steps) an employee can reach in a company:
We assume that an employee must progress through these positions linearly. For instance, if an employee reaches the Manager position, he must have previously held the Executive position. Furthermore, we assume that an employee cannot reverse the order of the positions, meaning they cannot go from the Director to the Manager position. This sequential hypothesis can be used to propose as reinforcement learning solution to the problem.
Participants are tasked with leveraging the provided dataset to develop predictive models. These models aim to forecast the target position for each career based on available information, encompassing details about the employee and the company.
The dataset provided for this challenge consists of, for a given career, of employee,company, and the position the employee reached within its company information. For simplicity and anonymisation purpose, embeddings are directly provided for employees and enterprises.
The dataset is structured as follows:
Career (X train / test):
Position (y train / test):
Vector embeddings capture textual information regarding both the employee and the company. These embeddings were generated utilizing BERT, a state-of-the-art model renowned for their excellence in natural language understanding tasks.
The output format for predictions should conform to the provided format, associating each career ID with the predicted position.
For each component, you will find train and test files in .csv format with corresponding fields. These files can easily be extracted using Pandas and Numpy packages. Note that the embeddings are serialized as strings. To address this, you could use the following lines:
# deserialize embeddings as lists of floats
import json
json.loads(df["employee embedding"][index])
The dataset comprises 36K careers, with 29K constituting the training set and 7K forming the test set. These careers originate from 22K distinct employees, with no overlap between the train and test sets, and involve 5K different companies.
It's worth noting the significant imbalance in the data among the four distinct positions. Specifically, "Assistant", "Executive," "Manager," and "Director" account for approximately 29.5%, 63%, 3.5%, and 4% of the careers, respectively.
The metric used for this challenge is the "macro" F1 score. It is a standard multi-class classification metric, for cases when class are highly unbalanced.
The default F1 score is the harmonic mean of the precision and recall, it thus give as much weights to false negatives and false positives. It yields score between 0 and 1. In a multi-class framework, the F1 scores of each class are derived and averaged. A "macro" average uses the same weight for each class.
The goal of the challenge is thus to maximise this "macro" F1 score on the test sets.
The benchmark proposition for this challenge is a XGBoost Classifier. XGBoost is a Python package of boosting models that won several data science competitions.
For an unoptimised model, the F1 score is around 0.258.
It's worth noting that the true innovation in this project lies in the potential to enhance results by exploiting the sequential nature of the careers. We thus recommand to participant who can handle it to explore the Reinforcement Learning approach as much as possible, and not to limit themselves to the multi-class framework.
Files are accessible when logged in and registered to the challenge