Here is your today’s AI Dev Brief from Marktechpost, covering core research, models, infrastructure tools, and applied updates for AI developers and researchers.
FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
Chroma 1.0 is a 4B parameter real time speech to speech dialogue model that takes audio as input and outputs audio while preserving speaker identity over multi turn conversations. The system removes the usual ASR plus LLM plus TTS cascade and operates directly on discrete codec tokens. A frozen Qwen based Reasoner handles multimodal understanding and text generation, then a 1B LLaMA style Backbone, a 100M Chroma Decoder and a Mimi based codec reconstruct personalized speech using 8 RVQ codebooks and an interleaved 1 to 2 text to audio token schedule. Chroma reaches a Speaker Similarity score of 0.81 on SEED TTS EVAL at 24 kHz, about 11 percent better than the human baseline, and runs with a Real Time Factor of 0.43, which is more than 2 times faster than real time while remaining competitive on URO Bench dialogue tasks..… Read the full analysis/article here.
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
Microsoft VibeVoice ASR is a unified speech to text model for 60 minute audio that runs in a single pass within a 64K token context window. It jointly performs ASR, diarization, and timestamping and returns structured transcripts that specify who spoke, when they spoke, and what they said. The model supports Customized Hotwords so you can inject product names, technical terms, or organization specific phrases at inference time to improve recognition without retraining. VibeVoice ASR targets meeting style and conversational scenarios and is evaluated with metrics such as DER, cpWER, and tcpWER. The model is released under the MIT license as microsoft/VibeVoice-ASR with official weights, fine tuning scripts, and an online Playground. This provides a single component for long context speech understanding that integrates cleanly into meeting assistants, analytics tools, and transcription pipelines...… Read the full analysis/article here.
Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents
Inworld AI releases Inworld TTS 1.5, a production grade text to speech system for realtime voice agents. The model targets strict latency, quality and cost constraints. P90 time to first audio is under 250 ms for the Max variant and under 130 ms for the Mini variant, about 4 times faster than the prior generation. Expression improves by about 30 percent and stability improves with about 40 percent lower word error rate. Pricing is about 5 dollars per 1 million characters for Mini and 10 dollars per 1 million characters for Max, which is significantly below many competing systems. TTS 1.5 supports 15 languages, provides instant and professional voice cloning and is available as both a cloud API..… Read the full analysis/article here.
Project Notebooks/Tutorials
▶ [Open Source] Rogue: An Open-Source AI Agent Evaluator worth trying Codes & Examples
▶ How AutoGluon Enables Modern AutoML Pipelines for Production-Grade Tabular Models with Ensembling and Distillation Codes Tutorial
▶ How to Design an Autonomous Multi-Agent Data and Infrastructure Strategy System Using Lightweight Qwen Models for Efficient Pipeline Intelligence? Codes Tutorial
▶ How to Build a Fully Functional Computer-Use Agent that Thinks, Plans, and Executes Virtual Actions Using Local AI Models Codes Tutorial
▶ A Coding Implementation of a Comprehensive Enterprise AI Benchmarking Framework to Evaluate Rule-Based LLM, and Hybrid Agentic AI Systems Across Real-World Tasks Codes Tutorial