Hey folks!

Lets dive into today’s newsletter. You can reach out to me directly for any suggestion/comments [[email protected]]

-asIF

AI Dev and Latest Releases

[Multi-Agent Test-Time Scaling] Google Proposes TUMIX: Multi-Agent Test-Time Scaling With Tool-Use Mixture: What if, instead of re-sampling one agent, you could push Gemini-2.5 Pro to 34.1% on HLE by mixing 12–15 tool-using agents that share notes and stop early? Google Cloud AI Research, with collaborators from MIT, Harvard, and Google DeepMind, introduced TUMIX (Tool-Use Mixture)—a test-time framework that ensembles heterogeneous agent styles (text-only, code, search, guided variants) and lets them share intermediate answers over a few refinement rounds, then stop early via an LLM-based judge. The result: higher accuracy at lower cost on hard reasoning benchmarks such as HLE, GPQA-Diamond, and AIME (2024/2025).

[Small Model] Can a Small Language Model Predict Kernel Latency, Memory, and Model Accuracy from Code? A New Regression Language Model (RLM) Says Yes: Researchers from Cornell and Google introduce a unified Regression Language Model (RLM) that predicts numeric outcomes directly from code strings—covering GPU kernel latency, program memory usage, and even neural network accuracy and latency—without hand-engineered features. A 300M-parameter encoder–decoder initialized from T5-Gemma achieves strong rank correlations across heterogeneous tasks and languages, using a single text-to-number decoder that emits digits with constrained decoding.

[Open and Code Model] Salesforce AI Research Releases CoDA-1.7B: a Discrete-Diffusion Code Model with Bidirectional, Parallel Token Generation. CoDA-1.7B is a discrete-diffusion code LLM that denoises masked sequences with bidirectional context and updates multiple tokens per step (non-autoregressive). The team provides Base and Instruct checkpoints, a reproducible pipeline (TPU pre-training, post-training/SFT, evaluation), and a FastAPI server exposing OpenAI-compatible endpoints with a CLI; decoding is controlled via parameters such as STEPS, ALG="entropy", BLOCK_LENGTH, etc. Reported pass@1 for CoDA-1.7B-Instruct: HumanEval 54.3%, HumanEval+ 47.6%, MBPP 47.2%, MBPP+ 63.2%, EvalPlus aggregate 55.4%; the model card compares to diffusion baselines (e.g., Dream-7B-Instruct 57.9% HumanEval). Checkpoints are released on Hugging Face under CC BY-NC 4.0....

[Agentic AI] Do curated, tool-grounded demonstrations build stronger software agents than broad piles of generic instruction data? A team of researchers from Shanghai Jiao Tong University and SII Generative AI Research Lab (GAIR) proposes LIMI (“Less Is More for Agency”), a supervised fine-tuning method that turns a base model into a capable software/research agent using 78 samples. LIMI scores 73.5% average on AgencyBench (FTFC 71.7, RC@3 74.2, SR@3 74.6), beating strong baselines (GLM-4.5 45.1, Qwen3-235B-A22B 27.5, Kimi-K2 24.1, DeepSeek-V3.1 11.9) and even surpassing variants trained on 10,000 samples—with 128× less data.

Google DeepMind Introduces CodeMender: A New AI Agent that Uses Gemini Deep Think to Automatically Patch Critical Software Vulnerabilities. CodeMender is Google DeepMind’s AI agent for code security that uses Gemini “Deep Think” planning plus static/dynamic analysis, fuzzing, differential tests, and SMT solvers to localize root causes, synthesize patches, and auto-validate them before human review; in six months it upstreamed 72 security fixes across open-source projects, including repositories up to ~4.5M lines, and also performs proactive hardening such as inserting Clang -fbounds-safety annotations (e.g., in libwebp) to reduce memory-safety bug classes.

Editor’s Pick

You should not miss this one

OpenAI launched AgentKit, combining a visual Agent Builder, embeddable ChatKit UI, and expanded Evals on the Responses API to ship production agents faster. Agent Builder provides versioned, node-graph workflows with built-in tool calls and guardrails; ChatKit supplies a production chat surface with streaming, threading, and theming; Evals adds datasets and trace grading for end-to-end assessment and prompt optimization. Availability is GA for ChatKit and Evals, with Agent Builder in beta; pricing follows standard API usage.

3 Important OpenAI Updates:

  • Agent Builder: a visual canvas for creating and versioning multi-agent workflows

  • Connector Registry: a central place for admins to manage how data and tools connect across OpenAI products

  • ChatKit: a toolkit for embedding customizable chat-based agent experiences in your product

How was today’s email?

Awesome  |   Decent    |  Not Great

Keep Reading

No posts found