🌟 Today's Headline
Anthropic's Mythos Model Breaks Long-Horizon Agent Benchmarks
METR, a nonprofit that measures AI capabilities, released research showing Anthropic's Mythos model significantly exceeds existing benchmarks for long-horizon AI agent reliability. At 50% success rate, Mythos breaks through METR's 16-hour measurement ceiling—the longest duration their current task suite can reliably test. More critically, at 80% reliability (the standard for practical deployment), Mythos handles tasks requiring humans over three hours, substantially surpassing Gemini 3.1 Pro, the closest competitor measured. The breakthrough reveals a key misconception: the 16-hour metric measures task complexity/difficulty, not actual runtime. As base models become more capable, agents can sustain focus on complex goals for longer before context degradation becomes fatal. This suggests autonomous AI systems capable of independent work over extended periods are arriving faster than many expected, with major implications for developers building agent-based products.
💬 Editor's Note
Reliability at scale—not speed—is the real inflection point. When agents can trustworthy execute multi-hour workflows without degradation, they graduate from experimental toys to production infrastructure.
New Product
The llm command-line tool releases version 0.32a2 alpha. A major update involves OpenAI's reasoning models: they now use the /v1/responses endpoint instead of the previous /v1/chat/completions endpoint. This change enables interleaved reasoning capabilities, allowing the tool to better support advanced reasoning workflows in applications.
Datasette releases version 1.0a29 as the project approaches the 1.0 production release milestone. Key updates include a new TokenRestrictions.abbreviated(datasette) utility method that simplifies creation of '_r' permission dictionaries, making permission management more accessible to developers.