Challenge Data 2023 | OWKIN
At the end of 2023, as part of the Statistical Learning course in the Master’s in Statistical Engineering at the University of Nantes (jointly with Centrale Nantes, Mathematics and Applications option), I had the opportunity to participate in the Challenge Data.
As a result, my teammates and I managed to achieve third place (out of 189 participants) in the challenge we chose.
I would like to warmly thank my teammates Noémie Matrat and Matthieu Trotreau, as well as the Collège de France for inviting us to the awards ceremony.
Jean Leray Mathematics Laboratory
Challenge Data
Each year, Data Science challenges are organized using data provided by public services, companies, or research laboratories. The seasons kick off in January, with the release of around ten new challenges. These challenges are presented as part of Stéphane Mallat’s course at the Collège de France in January. An awards ceremony recognizing the top participants from the previous season usually takes place in February at the Collège de France.
The Challenge Data platform is managed by the Data team (ENS Paris) in partnership with the Collège de France and the Data Lab of the Louis Bachelier Institute. It is supported by the CFM Chair, the PRAIRIE Institute, and the IDRIS of CNRS.
Context of the Challenge Proposed by Owkin
Anatomopathology
Anatomopathology is a medical specialty dedicated to the morphological study of abnormalities in biological tissues. The analysis of these tissues is a critical step for many diagnoses particularly in oncology where it serves as the gold standard (the reference clinical routine). Tissue samples are typically collected during surgery or biopsy . After preprocessing by expert technicians, pathologists examine the samples under a microscope to evaluate biomarkers such as tumor type, cancer stage, etc.
PIK3CA Mutation in Breast Cancer
Recent scientific studies have also shown that histopathology slides contain information underlying the tumor’s genotype . These slides could therefore be used to predict genomic alterations such as point mutations . One such alteration is the PIK3CA mutation in breast cancer.
This mutation is observed in approximately 30-40% of breast cancers, most commonly in estrogen receptor-positive (ER+) cancers . PIK3CA mutations have been associated with good prognoses. More importantly patients carrying this mutation, who are resistant to endocrine therapy, may respond to a targeted therapy class—PI3Kα inhibitors.
Objective of the Challenge
The current method for identifying PIK3CA mutations is DNA sequencing which requires technical and bioinformatics expertise that is not accessible in all laboratories.
An automated solution for detecting the PIK3CA mutation is clinically relevant as it could provide a fast and reliable screening tool, allowing more patients—especially in tertiary centers-to be eligible for personalized therapies leading to better outcomes.
The Owkin challenge is a weakly supervised binary classification problem: the goal is to predict, from a high-resolution digitized histological slide, whether a patient carries a PIK3CA gene mutation.
Weakly Supervised Learning in Digital Pathology
Weakly supervised learning is crucial in digital pathology due to the extremely large size of digitized histology slides (> 100,000 × 100,000 pixels). In their raw form, these slides cannot be directly processed by machine learning algorithms.
To address this, each slide is divided into smaller images (called tiles) of 224 × 224 pixels 112 µm²). Since a slide is labeled only at the global level (mutation present or absent), and is split into many tiles, the learning methods used must intelligently aggregate these tiles to produce a global label.
These methods fall within the field of multiple-instance learning (MIL). Specifically:
If at least one tile contains a mutation-associated pattern, the algorithm should predict mutation presence.
If none of the tiles show such patterns, the algorithm should predict mutation absence.
While weak supervision may introduce noise (some tiles contain no useful information for predicting the global label), it eliminates the need for manual annotation of histology slides by pathologists—a costly and time-consuming task.
Challenge Task
In this challenge, we aim to predict whether a patient carries a PIK3CA gene mutation from a digitized histology slide.
For computational reasons, we retained 1,000 tiles per slide, each selected to ensure it contains tissue.