Measuring AI Ability to Complete Long Tasks

Analysis code available on GitHub
Raw data available here

This is our most up-to-date measurement of the task-completion time horizons for public language models. We intend to update this graph periodically whenever we have new measurements to share. For methodological details, including a definition of the task-completion time horizon, see the blog post below and the associated paper.