In partnership with

Welcome back. The past four days were unusually loud on the release front — six launches stood out, spanning open-weight speech, on-device agents, encoder-free multimodal models, physical-AI world models, and a focused coding model. Here is the short, signal-only version, with a link to the full breakdown on each.

In this issue

  • MisoTTS — an 8B open-weights emotive text-to-speech model from Miso Labs

  • OpenJarvis — a local-first framework for on-device personal AI agents (Stanford)

  • Gemma 4 12B — Google DeepMind’s encoder-free multimodal model that runs on a 16 GB laptop

  • Cosmos 3 — NVIDIA’s open world model unifying physical reasoning, world generation, and action

  • Qwen3.7-Plus — Alibaba’s multimodal agent model lands on the Bailian platform

  • Mellum2 — JetBrains’ 12B MoE “focal model” built for fast, specialized coding tasks

1. MisoTTS — Emotive speech, open on day one

Miso Labs open-sourced an 8B text-to-speech model that conditions on both text and prior audio, so the output tracks the speaker’s tone instead of reading flat. Its trick is residual vector quantization — 32 codebooks over 2,048-way vectors — which scales the sonic range without growing the model. Weights ship under a modified MIT license.

  • Why it matters: open weights on day one, with one-shot voice cloning from a ~10s clip

  • Architecture: a 7.7B backbone over time plus a 300M decoder over depth

  • Claimed latency: ~110ms (vs. Miso’s cited 700ms for ElevenLabs, 300ms for Sesame)

  • Caveat: half-duplex and single-turn today; API access still pending

2. OpenJarvis — Personal AI that runs on your hardware

Researchers at Stanford and Lambda Labs released OpenJarvis, an open-source framework that runs inference, agents, memory, and learning entirely on-device. Configured well, local open-weight models land within 3.2 percentage points of the best cloud model — at roughly 800x lower marginal API cost and about 4x lower latency under the paper’s protocol.

  • The idea: a typed “spec” decomposes a personal AI into five swappable primitives — Intelligence, Engine, Agents, Tools & Memory, and Learning

  • Key method: LLM-guided spec search uses a cloud model as a search-time teacher, then runs locally with zero cloud calls

  • Reach: 25+ data connectors, 32+ messaging channels, and one-command install (~3 minutes)

  • License: Apache 2.0, with the gap concentrated on reasoning- and research-heavy tasks

4x more context into every prompt. Zero extra effort.

You think faster than you type. Which means every typed prompt leaves out the constraints, examples, and edge cases that would have made the output actually useful.

Wispr Flow turns your voice into paste-ready text inside any AI tool. Speak naturally — include "um"s, tangents, half-finished thoughts — and Flow cleans everything up. You get detailed, structured prompts without touching a keyboard.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Free on Mac, Windows, and iPhone.

3. Gemma 4 12B — Multimodal, encoder-free, and laptop-ready

Google DeepMind’s new 12B dense model drops separate vision and audio encoders entirely — images and audio feed straight into the LLM backbone. The payoff is a multimodal model that runs agentic workflows locally on a 16 GB laptop, shipped under Apache 2.0.

  • Design: a 35M vision embedder (single matmul + X/Y position lookup) replaces a 550M encoder; raw 16 kHz audio is projected directly

  • First mid-sized Gemma with native audio, and it adds video too

  • Performance: Google reports it nears the 26B MoE at under half the memory footprint

  • Ecosystem: weights on Hugging Face and Kaggle; works with llama.cpp, MLX, vLLM, Ollama, and LM Studio

4. NVIDIA Cosmos 3 — One model for reasoning, world, and action

NVIDIA’s open family of “world models” for physical AI unifies three jobs earlier Cosmos releases kept separate: physical reasoning, world generation, and action generation. A two-tower Mixture-of-Transformers pairs an autoregressive VLM “reasoner” with a diffusion “generator,” conditioned one-way from reasoner to generator.

  • Who it’s for: robotics, autonomous vehicles, and warehouse monitoring teams

  • Two checkpoints ship now: Cosmos3-Nano (16B) for workstations and Cosmos3-Super (64B) for datacenters

  • Fully open: checkpoints, six SDG datasets, training recipes, and the HUE benchmark under OpenMDW-1.1

  • Claims: open-source SOTA on R-Bench and leading Artificial Analysis text-to-image and image-to-video results

5. Qwen3.7-Plus — Alibaba’s multimodal agent goes live

Alibaba’s Qwen team made Qwen3.7-Plus available via API on the Bailian platform (Model Studio internationally). It’s the multimodal half of the 3.7 family — it understands images and video (input, not generation) — and leans hard into agentic behavior.

  • New abilities: deep reasoning, self-programming, tool invocation, verification and testing, and autonomous iteration

  • Signal: the preview ranked #16 in Vision Arena, placing Alibaba as the #5 lab in vision

  • Platform extras: an Agentic RL loop and built-in safety guardrails for autonomous tool use

Built for builders. Not buzzwords. San José 2026

500+ speakers. 18 content tracks. Workshops, masterclasses, and the people actually shipping the tools you use every day. WeAreDevelopers World Congress — September 23–25. Use code GITPUSH26 for 10% off.

6. JetBrains Mellum2 — A fast “focal model” for coding pipelines

JetBrains open-sourced Mellum2 under Apache 2.0 — a 12B Mixture-of-Experts model with just 2.5B active parameters per token. It’s deliberately not a frontier replacement; JetBrains frames it as a “focal model”: a fast, specialized component inside larger AI systems.

  • Efficiency: 8 of 64 experts fire per token, so per-token compute matches a 2.5B dense model

  • Built for: routing, low-latency RAG, and sub-agents in multi-model pipelines

  • Specs: 131K context, an MTP head for built-in speculative decoding, and six released checkpoints (Instruct and Thinking)

The throughline

Two themes tie this week together: open weights are now the default for serious releases, and capable models keep shrinking onto local hardware. From an 8B speech model to an encoder-free multimodal model on a 16 GB laptop to a full on-device agent framework, the center of gravity is moving toward what you can run yourself.

Forward this to a teammate who’d want the signal without the scroll. See you in the next issue.

— The MarkTechPost team

Keep Reading