Thesis is State-of-the-Art on MLE-bench

We're happy to share that Thesis has reached state-of-the-art performance on MLE-bench, OpenAI's benchmark for evaluating AI agents on machine learning engineering.

What is MLE-bench?

MLE-bench consists of 75 real Kaggle competitions spanning image classification, tabular data, time series, and more. Agents are given a task description, a dataset, and 24 hours to produce a submission. Performance is measured by how often an agent earns a medal, using Kaggle's own medal thresholds.

It's one of the hardest and most realistic benchmarks for AI agents because it requires the full ML engineering loop: understanding the problem, processing data, training models, and iterating on results.

Our Results

When it launched, the best result was o1-preview with AIDE scaffolding, which medaled on about 17% of competitions. Thesis medaled on 48.44% ± 3.64% of all competitions.

Difficulty	Medal Rate
Low / Lite	65.15% ± 1.52%
Medium	45.61% ± 7.18%
High	31.11% ± 2.22%
All	48.44% ± 3.64%

We have just begun.

MLE-bench paper: arxiv.org/abs/2410.07095