Thesis is State-of-the-Art on MLE-bench
We're happy to share that Thesis has reached state-of-the-art performance on MLE-bench, OpenAI's benchmark for evaluating AI agents on machine learning engineering.
What is MLE-bench?
MLE-bench consists of 75 real Kaggle competitions spanning image classification, tabular data, time series, and more. Agents are given a task description, a dataset, and 24 hours to produce a submission. Performance is measured by how often an agent earns a medal, using Kaggle's own medal thresholds.
It's one of the hardest and most realistic benchmarks for AI agents because it requires the full ML engineering loop: understanding the problem, processing data, training models, and iterating on results.
Our Results
When it launched, the best result was o1-preview with AIDE scaffolding, which medaled on about 17% of competitions. Thesis medaled on 48.44% ± 3.64% of all competitions.
| Difficulty | Medal Rate |
|---|---|
| Low / Lite | 65.15% ± 1.52% |
| Medium | 45.61% ± 7.18% |
| High | 31.11% ± 2.22% |
| All | 48.44% ± 3.64% |
We have just begun.
MLE-bench paper: arxiv.org/abs/2410.07095