Inside: Xiaomi pushes 1T model past 1,000 tok/s → NVIDIA ships 550B agent MoE → Microsoft transcribes an hour of audio in 15 seconds → Google's RAG loop raises factuality 34%

In partnership with

👋 Hello. You’re reading the AI Dev Brief by MarkTechPost — the daily signal for AI engineers and researchers who build with AI, not just talk about it. No hype. No filler. Just the research, releases, and infrastructure moves that actually matter.

Want to promote your AI Product, GitHub repo, HuggingFace model, product release, or webinar in front of 1,000,000+ AI practitioners? Connect with us

🔥 TODAY’S BRIEFING — STORIES WORTH 5 MINUTES

1. Xiaomi MiMo + TileRT Push a 1-Trillion-Parameter Model Past 1,000 Tokens Per Second on Commodity GPUs — MiMo-V2.5-Pro-UltraSpeed in collaboration with TileRT, pushing their 1.02-trillion-parameter MIT-licensed MoE past 1,000 tokens per second on commodity GPU hardware. The model has 42B active parameters, a 1M-token context window, and was trained in FP8 across 27 trillion tokens. 1T-parameter inference on commodity GPUs at 1K tok/s is not a research demo — it's a deployment target that just became real.

2. NVIDIA AI Releases Nemotron 3 Ultra: Open 550B Hybrid Mamba-Transformer MoE — Built for Long-Running Agents — Nemotron 3 Ultra, a 550B total / 55B active parameter MoE with hybrid Mamba-Transformer layers, LatentMoE expert routing, and NVFP4 quantization delivering up to 5x higher throughput across GPU architectures. Trained via multi-teacher on-policy distillation from 10+ domain-specific teacher models. 1M context window. Designed specifically for complex, long-running agentic workflows.

3. Microsoft AI Introduces MAI-Transcribe-1.5: 2.4% WER, 43 Languages, Transcribes an Hour of Audio in Under 15 Seconds — MAI-Transcribe-1.5, a super fast speech-to-text model at its accuracy tier — 2.4% WER, 43 language coverage, and up to 5x faster than Gemini 3.1, Scribe v2, and GPT-4o-Transcribe on long audio. It transcribes a full hour of audio in under 15 seconds. Best-in-class on FLEURS accuracy. Leads the accuracy-speed Pareto frontier. Available on Azure AI Foundry now.

4. NVIDIA Releases Nemotron 3.5 ASR: 600M-Parameter Streaming Speech Model — 40 Language-Locales, Fully Local — Nemotron 3.5 ASR, a single 600M-parameter cache-aware streaming checkpoint transcribing 40 language-locale pairs in real time — English, Spanish, German, French, Italian, Arabic, Japanese, Korean, and more. Native punctuation, capitalization, and runtime-configurable language switching. Runs fully local via ONNX. Available on NGC, Hugging Face, and Together AI API. One model. 40 languages.

Fast browsing. Faster thinking.

Your browser gets you to a page. Norton Neo gets you to the answer. The first safe AI-native browser built by Norton moves with you from idea to action without slowing you down. Magic Box understands your intent before you finish typing. AI that works inside your flow, not beside it. No prompting. No copy-pasting. No switching apps.

Built-in AI, instantly and for free. Privacy handled by Norton. Built-in VPN and ad blocking protect you by default. No configuration. No extra apps. Nothing to think about.

Fast. Safe. Intelligent. That's Neo.

Download Norton Neo

📰 Secondary News

Meet Harness-1: A 20B Retrieval Subagent Trained With RL Inside a Stateful Search Harness — Harness-1 is a 20B search agent trained with reinforcement learning inside a stateful search harness on GPT-OSS-20B — reaching 0.730 average curated recall. The model decides how and when to search, not just what to retrieve. Weights and harness code are fully open.

Google's New Colab CLI Lets Developers and AI Agents Run Python on Remote GPUs and TPUs From the Terminal — Google has released the Colab CLI, a lightweight tool that bridges local terminals and remote Colab runtimes — provisioning GPUs and TPUs instantly, running scripts remotely, and recovering artifacts seamlessly. Ships with a prepackaged Colab skill file so AI agents have built-in context on how to use it. Antigravity can now fine-tune Gemma 3-1B via QLoRA on a Colab GPU in a single terminal command.

Google Research Adds Agentic RAG to Gemini Enterprise: Multi-Hop Queries, +34% Factuality, 90.1% Cross-Corpus Accuracy — Google Research has shipped an Agentic RAG framework into Gemini Enterprise — a multi-agent workflow that plans, rewrites, and re-searches until context is complete before answering. Compared to standard RAG, it raises factuality accuracy by up to 34% and hits 90.1% accuracy on cross-corpus retrieval tasks. The "Sufficient Context Agent" simply won't answer until it has enough information. No more hallucinated citations from incomplete retrieval.

🛠️ More Releases/Updates for AI Devs

A. Google: Merged Gemma 4 MTP into llama.cpp. Developers can now pair MTP with Gemma 4 QAT for a fast, lightweight setup to build super fast experiences.

B. Kimi. ai: Released Kimi Work — a local AI agent on your desktop that does the work for you. It features a native agent swarm capable of running up to 300 AI agents in parallel on a local machine and pairs with a WebBridge extension to autonomously navigate websites, search, and scroll.

C. RightNow AI: Built Autokernel, an AI agent that writes a single CUDA kernel and self-tunes it past NVIDIA's own libraries. It reportedly hit 14x on some internal kernels and has ended up inside big tech.

D. Ideogram: Highlighted Ideogram 4.0 as the Qwen3.6-27B for local text-to-image generation. It features 9.3 billion parameters with a Qwen3-VL-8B FP8 text encoder, requiring a 15–18 GB total download.

E. Perplexity: Published new research in collaboration with Harvard on the shift from chat interfaces to autonomous agents like Computer. Over 3 months, findings show workers using Computer finish tasks in 87% less time at a 94% lower cost than Search alone, with higher satisfaction.

❝

[Partner with us] Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Inside: Xiaomi pushes 1T model past 1,000 tok/s → NVIDIA ships 550B agent MoE → Microsoft transcribes an hour of audio in 15 seconds → Google's RAG loop raises factuality 34%

🔥 TODAY’S BRIEFING — STORIES WORTH 5 MINUTES

Fast browsing. Faster thinking.

📰 Secondary News

🛠️ More Releases/Updates for AI Devs

How was today’s email?

Awesome | Decent | Not Great

Keep Reading

The newsletter platform built for AI Devs