Here is your today’s AI Dev Brief from Marktechpost, covering core research, models, infrastructure tools, and applied updates for AI developers and researchers.
NVIDIA AI Open-Sourced KVzap: A SOTA KV Cache Pruning Method that Delivers near-Lossless 2x-4x Compression
KVzap is a learned KV cache pruning module designed for long context LLMs that operate at sequence lengths in the 100k token range. KVzap trains small surrogate models on hidden states to approximate KVzip+ oracle scores, using data derived from Nemotron pretraining prompts to learn per head importance estimates for each token. At inference, KVzap applies a global score threshold and a fixed 128 token sliding window, which keeps recent tokens untouched and prunes low impact entries from the KV cache. This yields about 2 to 4 times compression on models such as Qwen3 8B, Llama 3.1 8B Instruct and Qwen3 32B with minimal accuracy loss on RULER, LongBench and AIME25, while adding at most around 1.1 percent FLOPs per layer and integrating cleanly into the open source KVpress framework...… Read the full analysis/article here.

Google AI Releases TranslateGemma: A New Family of Open Translation Models Built on Gemma 3 with Support for 55 Languages
TranslateGemma is Google AI’s new family of open translation models built on Gemma 3, released in 4B, 12B and 27B sizes and covering 55 languages. The models specialize Gemma 3 for translation using supervised fine tuning on Gemini generated synthetic parallel data combined with human corpora, followed by reinforcement learning driven by translation specific reward models. Benchmarks on WMT24++ show consistent gains over the corresponding Gemma 3 baselines, with the 12B TranslateGemma surpassing the 27B Gemma 3 model and the 4B variant reaching quality similar to the 12B baseline. The models retain Gemma 3 multimodal capabilities and are designed to run on resource constrained hardware such as laptops and modest cloud setups. TranslateGemma is available as open weights on Hugging Face, Vertex AI....… Read the full analysis/article here.

DeepSeek AI Researchers Introduce Engram: A Conditional Memory Axis For Sparse LLMs
Engram is a conditional memory module that adds a second sparsity axis next to Mixture of Experts in large language models. Engram uses hashed N gram embeddings with deterministic lookup so frequent phrases and entities are retrieved from a memory table, while the Transformer backbone focuses on reasoning. Under a fixed parameter and FLOPs budget, reallocating around 20 to 25 percent of sparse capacity from experts into Engram memory improves validation loss and downstream benchmarks. Engram 27B and Engram 40B outperform a MoE 27B baseline on language modeling, knowledge, reasoning, code and math, with the same 3.8B activated parameters. Long context extension to 32768 tokens shows clear gains on RULER and retrieval style tasks. A nano vLLM prototype also shows that a 100B parameter Engram table in host memory adds only a small throughput cost........ Read the full analysis/article here.
Project Notebooks/Tutorials
▶ [Open Source] Rogue: An Open-Source AI Agent Evaluator worth trying Codes & Examples
▶ How to Build a Stateless, Secure, and Asynchronous MCP-Style Protocol for Scalable Agent Workflows Codes Tutorial
▶ How to Design a Gemini-Powered Self-Correcting Multi-Agent AI System with Semantic Routing, Symbolic Guardrails, and Reflexive Orchestration Codes Tutorial
▶ How to Design a Fully Local Agentic Storytelling Pipeline Using Griptape Workflows, Hugging Face Models, and Modular Creative Task Orchestration Codes Tutorial
▶ A Coding Guide to Build a Procedural Memory Agent That Learns, Stores, Retrieves, and Reuses Skills as Neural Modules Over Time Codes Tutorial
▶ How to Build an Adaptive Meta-Reasoning Agent That Dynamically Chooses Between Fast, Deep, and Tool-Based Thinking Strategies Codes Tutorial