6 Major AI Model Drops: Local Agents, Open Speech, and Cosmos 3

In partnership with

Welcome back. The past four days were unusually loud on the release front — six launches stood out, spanning open-weight speech, on-device agents, encoder-free multimodal models, physical-AI world models, and a focused coding model. Here is the short, signal-only version, with a link to the full breakdown on each.

In this issue

MisoTTS — an 8B open-weights emotive text-to-speech model from Miso Labs
OpenJarvis — a local-first framework for on-device personal AI agents (Stanford)
Gemma 4 12B — Google DeepMind’s encoder-free multimodal model that runs on a 16 GB laptop
Cosmos 3 — NVIDIA’s open world model unifying physical reasoning, world generation, and action
Qwen3.7-Plus — Alibaba’s multimodal agent model lands on the Bailian platform
Mellum2 — JetBrains’ 12B MoE “focal model” built for fast, specialized coding tasks

1. MisoTTS — Emotive speech, open on day one

Miso Labs open-sourced an 8B text-to-speech model that conditions on both text and prior audio, so the output tracks the speaker’s tone instead of reading flat. Its trick is residual vector quantization — 32 codebooks over 2,048-way vectors — which scales the sonic range without growing the model. Weights ship under a modified MIT license.

Why it matters: open weights on day one, with one-shot voice cloning from a ~10s clip
Architecture: a 7.7B backbone over time plus a 300M decoder over depth
Claimed latency: ~110ms (vs. Miso’s cited 700ms for ElevenLabs, 300ms for Sesame)
Caveat: half-duplex and single-turn today; API access still pending

Read the full breakdown →

2. OpenJarvis — Personal AI that runs on your hardware

Researchers at Stanford and Lambda Labs released OpenJarvis, an open-source framework that runs inference, agents, memory, and learning entirely on-device. Configured well, local open-weight models land within 3.2 percentage points of the best cloud model — at roughly 800x lower marginal API cost and about 4x lower latency under the paper’s protocol.

The idea: a typed “spec” decomposes a personal AI into five swappable primitives — Intelligence, Engine, Agents, Tools & Memory, and Learning
Key method: LLM-guided spec search uses a cloud model as a search-time teacher, then runs locally with zero cloud calls
Reach: 25+ data connectors, 32+ messaging channels, and one-command install (~3 minutes)
License: Apache 2.0, with the gap concentrated on reasoning- and research-heavy tasks

Read the full breakdown →

4x more context into every prompt. Zero extra effort.

You think faster than you type. Which means every typed prompt leaves out the constraints, examples, and edge cases that would have made the output actually useful.

Wispr Flow turns your voice into paste-ready text inside any AI tool. Speak naturally — include "um"s, tangents, half-finished thoughts — and Flow cleans everything up. You get detailed, structured prompts without touching a keyboard.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Free on Mac, Windows, and iPhone.

Try Wispr Flow free

3. Gemma 4 12B — Multimodal, encoder-free, and laptop-ready

Google DeepMind’s new 12B dense model drops separate vision and audio encoders entirely — images and audio feed straight into the LLM backbone. The payoff is a multimodal model that runs agentic workflows locally on a 16 GB laptop, shipped under Apache 2.0.

Design: a 35M vision embedder (single matmul + X/Y position lookup) replaces a 550M encoder; raw 16 kHz audio is projected directly
First mid-sized Gemma with native audio, and it adds video too
Performance: Google reports it nears the 26B MoE at under half the memory footprint
Ecosystem: weights on Hugging Face and Kaggle; works with llama.cpp, MLX, vLLM, Ollama, and LM Studio

Read the full breakdown →

4. NVIDIA Cosmos 3 — One model for reasoning, world, and action

NVIDIA’s open family of “world models” for physical AI unifies three jobs earlier Cosmos releases kept separate: physical reasoning, world generation, and action generation. A two-tower Mixture-of-Transformers pairs an autoregressive VLM “reasoner” with a diffusion “generator,” conditioned one-way from reasoner to generator.

Who it’s for: robotics, autonomous vehicles, and warehouse monitoring teams
Two checkpoints ship now: Cosmos3-Nano (16B) for workstations and Cosmos3-Super (64B) for datacenters
Fully open: checkpoints, six SDG datasets, training recipes, and the HUE benchmark under OpenMDW-1.1
Claims: open-source SOTA on R-Bench and leading Artificial Analysis text-to-image and image-to-video results

Read the full breakdown →

5. Qwen3.7-Plus — Alibaba’s multimodal agent goes live

Alibaba’s Qwen team made Qwen3.7-Plus available via API on the Bailian platform (Model Studio internationally). It’s the multimodal half of the 3.7 family — it understands images and video (input, not generation) — and leans hard into agentic behavior.

New abilities: deep reasoning, self-programming, tool invocation, verification and testing, and autonomous iteration
Signal: the preview ranked #16 in Vision Arena, placing Alibaba as the #5 lab in vision
Platform extras: an Agentic RL loop and built-in safety guardrails for autonomous tool use

Read the full breakdown →

Built for builders. Not buzzwords. San José 2026

500+ speakers. 18 content tracks. Workshops, masterclasses, and the people actually shipping the tools you use every day. WeAreDevelopers World Congress — September 23–25. Use code GITPUSH26 for 10% off.

Secure Your Pass

6. JetBrains Mellum2 — A fast “focal model” for coding pipelines

JetBrains open-sourced Mellum2 under Apache 2.0 — a 12B Mixture-of-Experts model with just 2.5B active parameters per token. It’s deliberately not a frontier replacement; JetBrains frames it as a “focal model”: a fast, specialized component inside larger AI systems.

Efficiency: 8 of 64 experts fire per token, so per-token compute matches a 2.5B dense model
Built for: routing, low-latency RAG, and sub-agents in multi-model pipelines
Specs: 131K context, an MTP head for built-in speculative decoding, and six released checkpoints (Instruct and Thinking)

Read the full breakdown →

The throughline

Two themes tie this week together: open weights are now the default for serious releases, and capable models keep shrinking onto local hardware. From an 8B speech model to an encoder-free multimodal model on a 16 GB laptop to a full on-device agent framework, the center of gravity is moving toward what you can run yourself.

Forward this to a teammate who’d want the signal without the scroll. See you in the next issue.

— The MarkTechPost team

6 Major AI Model Drops: Local Agents, Open Speech, and Cosmos 3

In this issue

1. MisoTTS — Emotive speech, open on day one

Read the full breakdown →

2. OpenJarvis — Personal AI that runs on your hardware

Read the full breakdown →

4x more context into every prompt. Zero extra effort.

3. Gemma 4 12B — Multimodal, encoder-free, and laptop-ready

Read the full breakdown →

4. NVIDIA Cosmos 3 — One model for reasoning, world, and action

Read the full breakdown →

5. Qwen3.7-Plus — Alibaba’s multimodal agent goes live

Read the full breakdown →

Built for builders. Not buzzwords. San José 2026

6. JetBrains Mellum2 — A fast “focal model” for coding pipelines

Read the full breakdown →

The throughline

Keep Reading

The newsletter platform built for AI Devs