Challenge Data

Real estate price prediction
by Institut Louis Bachelier

(15/01/2022) The first version of the reduced_images folder contained a corrupted image for ann_35876173. If you face an issue, you can either remove the problematic image in the folder or download the new version of the reduced_images folder that is available.

Login to your account to try this challenge!



Started on Jan. 5, 2022

Challenge context

Institut Louis Bachelier (ILB) is a sponsored research network in Economics and Finance. It is an association as defined by the law of 1901 and was created in 2008 at the instigation of the Treasury and Caisse des Dépôts et Consignations. Through its activities, it aims to involve academics, public authorities and companies in research projects as well as in scientific events and other forums of exchange. The dozens of research projects hosted at ILBfocus on four societal transitions: environmental, digital, demographic and financial.

The ILB Datalab is a team of data scientists working alongside researchers of the ILB network on applied research projects for both public and private actors of our economic and financial ecosystem. The ILB datalab recently collected an extensive amount of French real estate data and would like to conduct analyses and experiments with it. This challenge is an opportunity to do so.

Challenge goals

Real estate prices are usually predicted from numerical data: surface, location, etc. Can you do better by using photos?

Estimating housing real estate price is quite a common topic, with an important litterature on estimating prices based on usual data such as: location, surface, land, number of bedrooms, age of the building... The approaches are usually sufficient to estimate the price range but lack precision. However, few have worked to see if adding photos of the asset would bring complementary information, enabling a more precise price estimation. The objective is thus to work on modelling French housing real estate prices based on usual hierarchical, tabular data and a few photos (between 1 and 6) for each asset and see if it allows better performance than a model trained without the photos.

Data description

The output y represents offered housing real estate prices of French assets in euros.

The input X contains:

  • a listing identifier
  • the property type (house, apartment, condo, mansion...),
  • the location (approximated latitude, approximated longitude, city, postal code, exposition, floor when applicable...),
  • the size (living area and land area when applicable),
  • the number of rooms, bedrooms, bathrooms...
  • energy performance indicators (energy and greenhouse gas emissions)
  • the number of photos attached to the listing,
  • indicators whether there is a cellar, a balcony, air conditioning...

Alongisde this tabular data, we provide a compressed folder containing 1 to 6 photos per listing (see supplementary files). For each listing in the tabular set of data, the corresponding photos are located in the folder named "ann_XX" where XX corresponds to the listing identifier.

The global sample size is around 50K listings and 300K photos. The tabular data set take up ~10MB in CSV format, the images take up ~30GB.

We performed a 80%/20% train/test split, that gives us ~40K listings for training and ~10K for testing.

Benchmark description

Our benchmark consists in an XGBoost regression model taking as input the tabular features from the X_train.csv file alongside a simple embedding of the corresponding listing images.

The tabular features were preprocessed as follows:

  • imputing missing categorical features with an 'Unknown' modality
  • imputing missing numerical and binary feature with the median value
  • simple label encoding of the categorical features

The images were embedded by concatenating the following values:

  • average red, green and blue values accross all pixels of the image (dominant color extraction),
  • counts, quantiles and mean value of pixel values of a grayscale version of the image,
  • counts, quantiles and mean value of the pixel values of the image obtained after applying a sobel edge detection filter on the grayscale version of the images.

For listings with more than one photo, we considered the average of the image vector embeddings described above.

The model was fine tuned with using a randomize parameter search grid and 5 fold cross-validation.


Files are accessible when logged in and registered to the challenge

The challenge provider


Research network in Economics & Finance