AI Dev and Latest Releases

[Agentic AI] Google AI Introduces Gemini 2.5 ‘Computer Use’ (Preview): A Browser-Control Model to Power AI Agents to Interact with User Interfaces. Which of your browser workflows would you delegate today if an agent could plan and execute predefined UI actions? Google AI introduces Gemini 2.5 Computer Use, a specialized variant of Gemini 2.5 that plans and executes real UI actions in a live browser via a constrained action API. It’s available in public preview through Google AI Studio and Vertex AI. The model targets web automation and UI testing, with documented, human-judged gains on standard web/mobile control benchmarks and a safety layer that can require human confirmation for risky steps.

[Anthropic Open Source] Anthropic AI Releases Petri: An Open-Source Framework for Automated Auditing by Using AI Agents to Test the Behaviors of Target Models on Diverse Scenarios. Anthropic’s Petri (Parallel Exploration Tool for Risky Interactions) is an MIT-licensed, open-source framework that automates alignment audits by orchestrating an auditor–target–judge loop over realistic, tool-augmented, multi-turn scenarios and scoring transcripts across 36 safety dimensions. In pilot runs on 14 models with 111 seed instructions, Petri surfaced behaviors including deception, whistleblowing, and cooperation with misuse; Claude Sonnet 4.5 and GPT-5 roughly tie on aggregate safety profiles (relative signals, not guarantees). Petri runs via AISI Inspect with a CLI and transcript viewer; docs and token-usage examples are provided..

[Science and AI] Microsoft Research Releases Skala: a Deep-Learning Exchange–Correlation Functional Targeting Hybrid-Level Accuracy at Semi-Local Cost. Skala is a deep-learning exchange–correlation functional for Kohn–Sham Density Functional Theory (DFT) that targets hybrid-level accuracy at semi-local cost, reporting MAE ≈ 1.06 kcal/mol on W4-17 (0.85 on the single-reference subset) and WTMAD-2 ≈ 3.89 kcal/mol on GMTKN55; evaluations use a fixed D3(BJ) dispersion correction. It is positioned for main-group molecular chemistry today, with transition metals and periodic systems slated as future extensions. Azure AI Foundry The model and tooling are available now via Azure AI Foundry Labs and the open-source microsoft/skala repository.

[Upcoming- AI Live Webinar] Scaling AI with Haystack Enterprise: A Developer’s Guide. You’ll learn how to: Extend your expertise with direct access to the Haystack engineering team through private support and consultation hours. Deploy with confidence using Helm charts and best-practice guides for secure, scalable Kubernetes setups across cloud (e.g., AWS, Azure, GCP) or on-prem. Accelerate iteration with pre-built templates for everything from simple RAG pipelines to agents and multimodal workflows, complete with Hayhooks and Open WebUI. Stay ahead of threats with early access to enterprise-grade, security-focused features like prompt injection countermeasures. sponsored

[Explainer] What are ‘Computer-Use Agents’? From Web to OS—A Technical Explainer. TL;DR: Computer-use agents are VLM-driven UI agents that act like users on unmodified software. Baselines on OSWorld started at 12.24% (human 72.36%); Claude Sonnet 4.5 now reports 61.4%. Gemini 2.5 Computer Use leads several web benchmarks (Online-Mind2Web 69.0%, WebVoyager 88.9%) but is not yet OS-optimized. Next steps center on OS-level robustness, sub-second action loops, and hardened safety policies, with transparent training/evaluation recipes emerging from the open community.

Editor’s Pick

You should not miss this one

[Small Language Model] Tiny Recursive Model (TRM): A Tiny 7M Model that Surpass DeepSeek-R1, Gemini 2.5 pro, and o3-mini at Reasoning on both ARG-AGI 1 and ARC-AGI 2. Can an iterative draft–revise solver that repeatedly updates a latent scratchpad outperform far larger autoregressive LLMs on ARC-AGI? Samsung SAIT (Montreal) has released Tiny Recursive Model (TRM)—a two-layer, ~7M-parameter recursive reasoner that reports 44.6–45% test accuracy on ARC-AGI-1 and 7.8–8% on ARC-AGI-2, surpassing results reported for substantially larger language models such as DeepSeek-R1, o3-mini-high, and Gemini 2.5 Pro on the same public evaluations. TRM also improves puzzle benchmarks Sudoku-Extreme (87.4%) and Maze-Hard (85.3%) over the prior Hierarchical Reasoning Model (HRM, 27M params), while using far fewer parameters and a simpler training recipe.

How was today’s email?

Awesome  |   Decent    |  Not Great

Keep Reading

No posts found