OpenAI's new voice model brings GPT-5-level reasoning to real-time conversations
OpenAI launches three new voice models—GPT-Realtime-2, GPT-Realtime-Translate, GPT-Realtime-Whisper—enabling real-time reasoning at GPT-5 level, live translation across 70+ languages, and continuous speech transcription for conversational AI.
Anthropic announced a strategic partnership with SpaceX giving the AI company exclusive access to the full computational capacity of SpaceX's Colossus 1 data center in Memphis, Tennessee. The facility operates over 300 megawatts of power and houses more than 220,000 NVIDIA GPUs. Anthropic is expected to begin using compute within the month.
At its 2026 Developer Conference, Anthropic unveiled three significant new features for Claude Managed Agents, its hosted AI agent platform. Multi-agent orchestration enables a coordinator agent to spawn multiple subagents in parallel, improving efficiency for complex multi-step tasks.
Today, we’re thrilled to announce that Gemini 3.1 Flash-Lite, our fastest and most cost-efficient Gemini 3 series model yet, is now generally available . Designed for ultra-low latency, high-volume tasks, and unmatched cost-efficiency, Flash-Lite is already transform
OpenAI expands Trusted Access program with GPT-5.5 and GPT-5.5-Cyber models, providing verified cybersecurity defenders with enhanced tools for vulnerability research and critical infrastructure protection at frontier model capability.
Comparative study across five frontier LLMs (Claude Sonnet 4.6, GPT 5.5, Gemini 3 Flash, DeepSeek V3.1, Qwen3.5 397B) examining whether reasoning mode changes moral judgments. Results show statistically consistent moral verdict agreement between instant and thinking modes (Krippendorff's alpha: 0.78 vs 0.79).
Benchmark comparing Gosset, a specialized pharmaceutical AI platform with curated drug-target annotations, against four frontier LLMs with web search (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on oncology/immunology drug discovery tasks.
Bibliometric audit reveals systematic flaw in academic LLM evaluation literature: researchers evaluate older, cheaper models (e.g., GPT-4o-mini zero-shot) against frontier systems (GPT-5.5 Pro, Claude Opus 4.7) months or years later, causing capability misrepresentation and misleading conclusions.
Evaluation of four open-weight models (Gemma 3 4B, Llama 3.2 3B, Mistral 7B, OLMo 2 7B) and two domain-adapted models (AfroConfliBERT, AfroConfliLLAMA) on conflict-event classification in Nigeria and Cameroon against ACLED gold-standard benchmark, revealing systematic performance bifurcation.
Researchers introduce AuditRepairBench, a substantial dataset containing 576,000 paired execution traces specifically designed to evaluate stability and reliability in AI agent repair leaderboards. The work identifies and addresses a critical evaluation problem: leaderboard rankings fluctuate significantly when evaluator configurations change, suggesting that many top-ranked repair methods are actually overfitting to evaluator-specific signals rather than achieving genuine, transferable improvements. By operationalizing this "evaluator-channel-blocking" problem, the dataset provides tools for building more trustworthy and interpretable evaluation systems for AI agent repair methods.
Researchers present Lookahead Drifting Model, a refined approach that enhances the drifting model framework for high-quality image generation. The key innovation involves computing a forward-looking drift direction during each training iteration, which allows the model to optimize its generation trajectory more effectively. The method achieves state-of-the-art performance on ImageNet while requiring only one-step neural functional evaluation. This represents a significant computational efficiency gain over traditional multi-step generative approaches, making high-quality image synthesis more practical for resource-constrained deployments and real-world applications where speed matters.
This research addresses a core challenge in automated bail decision systems: when bail is denied, the counterfactual outcome—whether the defendant would have appeared in court—remains unobserved. This structural label indeterminacy in historical bail data creates a fundamental problem for building fair systems, as automated decision-making trained on such biased data risks perpetuating and amplifying existing inequities in criminal justice.
NVIDIA's GeForce NOW cloud gaming platform has integrated Gaijin single sign-on authentication to streamline the user login experience. By reducing authentication friction, the feature enables gamers to reach their gaming library and start playing with minimal steps.
OpenAI has released Codex version 0.130.0-alpha.1, continuing the rapid iteration cycle on its code generation platform. While the official announcement provides minimal changelog information, this version represents ongoing refinement and incremental improvements to Codex's capabilities.
OpenAI has released version 0.129.0-alpha.16 of its Rust SDK, continuing the incremental development of language-specific bindings for OpenAI's APIs. The official announcement provides minimal changelog details, which is typical for alpha releases that move quickly through iteration cycles.
This research proposes a new perspective on structural hallucinations in diffusion models—anomalies like hands with more than five fingers despite matching training data statistics. Using local intrinsic dimension analysis, the paper offers complementary insights beyond existing mode interpolation theories, advancing understanding of why generative models produce structurally invalid samples.
This paper revisits instruction-guided navigation, questioning how much performance improvement actually comes from LLMs versus simple geometric engineering. Through controlled experiments, authors introduce geometry-only baselines that match or exceed LLM performance, suggesting that engineering excellence and algorithmic design often matter more than leveraging large language models.
Research from Anthropic's Fellows Program demonstrates that training language models on texts explaining the rationale behind intended values—before teaching specific behaviors—leads to significantly better value adherence, even in novel situations. This approach proves more effective than behavioral training alone for achieving reliable AI alignment.
A developer has published four years of San Francisco criminal court data to Hugging Face, containing 77,000 detailed case records. This comprehensive dataset covers the entire judicial process from initial arrest through final sentencing, making it freely accessible for researchers, legal technologists, and policy advocates.
Real-world clinical evaluation of four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, MedGemma-4B-Instruct) and commercial GPT-4.1 across three public dermatology datasets. Study quantifies the benchmark-to-bedside performance gap in actual clinical dermatology decision-making scenarios.
Paper introduces the first physics-informed DLinear time-series model for forecasting GPU power demand in AI data centers. Addresses rapid power fluctuations from heterogeneous computational tasks, particularly distinct power profiles between LLM inference and training workloads that impact grid stability.
DeepMind announced EVE Online, the massive multiplayer online role-playing game, as its next benchmark environment for advancing multi-agent artificial intelligence research. EVE's complex in-game economy, persistent world with thousands of concurrent players, and emergent gameplay dynamics create an unprecedented testbed for studying AI agents operating in competitive, cooperative, and mixed-ince…
Anthropic has released three official free certification courses on anthropic.skilljar.com, authored by Claude's creators. The three courses total 6 hours: (1) Claude 101 (1 hour) covers how Claude works and effective prompt patterns; (2) AI Fluency, Framework and Foundations (3 hours) teaches mental models for genuine AI collaboration rather than one-off queries; (3) Intro to Cowork (2 hours) cov…
This paper introduces Dream-MPC, a hybrid reinforcement learning approach combining Model Predictive Control with learned models and policy priors. It addresses limitations of current methods by using gradient-based optimization for planning, effectively leveraging the advantages of both planning-based and policy-based paradigms to improve sample efficiency.
This research introduces SemGrad, the first gradient-based uncertainty quantification method for free-form LLM generation. Unlike existing sampling-heavy approaches that are computationally expensive, SemGrad is sampling-free and computationally efficient.