Google Blog

11th January 2026

Opensource.google.com

Events Projects Programs and services Documentation About Blog

Google Open Source Blog

The latest news from Google on open source releases, major projects, events, and outreach programs for early career developers.

This Week in Open Source #12

Friday, January 9, 2026

by Daryl Ducharme & amanda casari, Google Open Source

This Week in Open Source for January 9, 2026

A look around the world of open source

Here we are at the beginning of a new year. What will it bring to the open source world? What new projects will be started? What should we be focusing on? What is your open source resolution for 2026? One of ours is to better connect with various open source communities on social media. We've gotten off to a big start by launching an official Google Open Source account on Bluesky. Already, we are enjoying the community there.

Upcoming Events

January 21 - 23: Everything Open 2026 is happening in Canberra, Australia. Everything Open is a conference focused on open technologies, including Linux, open source software, open hardware and open data, and the communities that surround them. The conference provides technical deep-dives as well as updates from industry leaders and experts on a wide array of topics from these areas.

January 29: CHAOSScon Europe 2026 is co-located with FOSDEM in Brussels, Belgium. This conference revolves around discussing open source project health, CHAOSS updates, use cases, and hands-on workshops for developers, community managers, project managers, and anyone interested in measuring open source project health. It also shares insights from the CHAOSS context working groups including OSPOs, University Open Source, and Open Source in Science and Research.

January 31 - February 1: FOSDEM 2026 is happening at the Université Libre de Bruxelles in Brussels, Belgium. It is a free event for software developers to meet, share ideas and collaborate. Every year, thousands of developers of free and open source software from all over the world gather at the event in Brussels.

February 24 - 25: The Linux Foundation Member Summit is happening in Napa, California. It is the annual gathering for Linux Foundation members that fosters collaboration, innovation, and partnerships among the leading projects and organizations working to drive digital transformation with open source technologies.

Open Source Reads and Links

[Talk] State of the Source at ATO 2025: State of the "Open" AI - At the end of last year Open Source Initiative gave a summary of Gabriel Toscano's talk at All Things Open. In the talk he discusses how AI models call themselves "open" but often lack the legal or technical freedoms that true open source requires. Analysis of ~20,000 Hugging Face models found Apache 2.0 and MIT are common, but many models have no license or use restrictive custom terms. The study warns that inconsistent labeling and mutable restrictions muddy openness and urges clearer licensing and platform checks.

[Article] The Reality of Open Source: More Puppies, Less Beer - Bitnami's removal of popular containers last year shows that open source can suddenly change and disrupt users. Organizations must evaluate who funds and maintains each open source component, not just the code. Plan for business continuity, supply-chain visibility, and the ability to fork or replace critical components.

[Blog] The Open Source Community and U.S. Public Policy - The Open Source Initiative is increasing its U.S. policy work to ensure open source developers are part of technology and AI rulemaking. Since policymakers often lack deep knowledge of open source, the community must explain how shared code differs from deployed systems. Joining groups like the Open Policy Alliance helps nonprofits engage and influence policy.

[Article] Pebble, the e-ink smartwatch that refuses to die, just went fully open source - Pebble, the e-ink smartwatch with a tumultuous history, is making a move sure to please the DIY enthusiasts that make up the bulk of its fans: Its entire software stack is now fully open source, and key hardware design files are available too.

[Article] Forget Predictions: Tech Leaders' Actual 2026 Resolutions - We want to know your open source resolutions and perhaps these resolutions from some tech leaders (open source and otherwise) can point you in a direction. Their plans run the gamut of securing and managing AI responsibly, reducing noise in security data, and creating healthier tech habits. The common theme is intentional, measurable change over speculation.

[Paper] Everything is Context: Agentic File System Abstraction for Context Engineering - GenAI systems may produce inaccurate or misleading outputs due to limited contextual awareness and evolving data sources. Thus mechanisms are needed to govern how persistent knowledge transitions into bounded context in a traceable, verifiable, and human-aware manner, ensuring that human judgment and knowledge are embedded within the system's evolving context for reasoning and evaluation.

The paper proposes using a file-system abstraction based on the open-source AIGNE framework to manage all types of context for generative AI agents. This unified infrastructure makes context persistent, traceable, and governed so agents can read, write, and version memory, tools, and human input.

What exciting open source events and news are you hearing about? Let us know on our @GoogleOSS X account or our new @opensource.google Bluesky account.

Training Marin 32B: What an open lab can build with TPUs, JAX, and a little persistence

Thursday, December 18, 2025

by David Hall, Open Athena

Last summer, we partnered with Google to share how Marin trained a fully open 8B foundation model using JAX and TPUs. Since then, our process hasn't changed much, but the scale has. Over the summer, we trained a 32B model entirely in the open, and most days there was just one person keeping the run moving.

Large-scale training is usually associated with big teams and bigger infrastructure. Large model releases typically have hundreds of authors. Marin tests a different hypothesis: using open source software and data, small teams can train serious foundation models if the tooling is good, the platform is stable, and the process is transparent. The Marin 32B run was our strongest validation yet.

A model built with one hand on the helm

Marin was started at Stanford University's Center for Research on Foundation Models with the goal of building radically open foundation models. In May, we released Marin 8B Base, which bested the popular Llama 3.1 8B Base on 14 of 19 benchmarks. Marin 8B was trained using Google TPU v4 and TPU v5e from the TPU Research Cloud.

Building on that success, we set out to build a 32B model starting in June. Our 32B training run followed Marin's usual "Tootsie Roll" style: start with a solid recipe, instrument heavily, and adapt mid-flight when necessary. That flexibility matters, because the first time you train at a larger scale, issues inevitably arise.

The timing, however, was less than ideal, as universities tend to empty out over the summer. Students graduate, get internships, go home, or travel the world. Marin was no different. By June, our team was down to one full time research engineer, with a few PhD students providing guidance when they weren't busy with their dissertations. Nevertheless, we pushed forward.

To spoil the ending, the model turned out quite well. On release, Marin 32B Base was the best open source base model, and it outperformed comparable open-weights models like Google's Gemma 3 27B PT on 24 of 42 base-model evaluations.

There were many bumps along the way, resulting in multiple mid-run corrections, but through it all Google's TPU infrastructure stayed rock-solid, and JAX's predictable performance let us iterate quickly. This meant that even with a tiny team, we could diagnose, patch, and continue training without losing momentum.

To be blunt: one researcher kept the 32B run alive all summer, juggling preemptible slices, rebuilding optimizer state, switching architectures, and generally shepherding ~6.4 trillion tokens across v5p and v4 pods—while mostly working on other Marin projects. The fact that this was possible speaks to the stability of the TPU platform and the maturity of the JAX/Marin stack.

The short version of a long summer

Our retrospective goes into much more detail about every spike, switch and cooldown. Here's the condensed version.

We began with a Llama-3-style 32B backbone and our best 8B data mix, running on preemptible TPU v5p pods. Preemptions were predictable, and recovery was nearly automatic. As availability tightened, however, we moved to dedicated TPU v4 capacity. After a slight tweak to gradient checkpointing to accommodate the older hardware (made easy by JAX's built-in support), we were back up and running and performance stayed excellent.

Around 70k steps, persistent loss spikes appeared. We tried clipping, update-norm guards, skip-step heuristics, "necromancy" (rebuilding optimizer state), and swapping in optimizers like Muon. Nothing helped. The model needed architectural support.

So, we warm-started the run onto a Qwen3-style architecture, which is the same as the Llama 3 architecture, except that it adds QK-Norm to attention. After a brief loss bump, the spikes vanished. The model recovered to its expected trajectory within ~10 billion tokens and remained stable.

Towards the end of training, it was time for a cool down. When training LLMs, one "cools down" the model by lowering the learning rate and changing the data mix to higher quality data. Our first cooldown surfaced two issues: contamination from a cached math dataset, and a training-loss phase shift caused by our linear-congruential shuffle. Switching to a Feistel-based shuffle fixed the latter completely. After cleaning the data and re-running the cooldown, the second cooldown was smooth and produced the final model.

The result: a strong, open 32B base model

Marin 32B Base is a competitive open-source base model. It outperformed Olmo 2 32B Base—the previous best fully open-source base model—on 32 of 42 tasks, and it performs especially well on knowledge-heavy evaluations like ARC, BoolQ, and PIQA.

Head-to-head, Marin 32B Base also beat Gemma 3 27B PT on 24 of 42 tasks, and its overall average rank places it alongside Qwen 2.5 32B and the newer Olmo 3 32B models. On our evaluation suite, Marin 32B Base actually ties Olmo 3 32B Base in win rate, despite Olmo 3 being trained by a much larger team and arriving a month later.

Mean rank across our evaluation suite (lower is better). Marin 32B Base lands in the top cluster of open(-weight) models, alongside Qwen 2.5 and Olmo 3, and ahead of Gemma 3 27B PT and Olmo 2 32B. Gray bars indicate open weight models, while blue bars indicate open source models.

While Olmo 3 32B Base now comfortably leads on math and coding benchmarks, Marin 32B Base holds its own and still leads on many knowledge QA evaluations. For a model trained with a fraction of the team size typically expected for a 30B-scale run, we're proud of where it landed.

Because Marin 32B Base (like Olmo 3 32B) is open source, the weights, code, data recipes, and every experimental detour are public. Anyone can reproduce, audit, or build on the work.

The stack that made it possible

TPU stability across large slices

During the run, we moved across preemptible v5p-512 slices coordinated with Cloud TPU Multislice, a v4-2048 slice for the long middle, and several mid-run architectural transitions. Throughout, TPUs were completely reliable for us: no mysterious hangs, no collective-op debugging. Preemptions were predictable and easy to recover from.

JAX + Levanter = predictable performance

Levanter builds on JAX's XLA compilation. In practice, what mattered for us was deterministic restarts, stable MFU at scale without custom kernels, and JAX's activation checkpointing, which made the v5p to v4 migration easy.

Marin's experiment system

Marin logs every step of the experimental pipeline: hyperparameters, code versions, datasets, metrics, and artifacts. Even with architectural switches and restarts, the run never devolved into a tangle of scripts. And because it's all open, anyone can retrace or reproduce the training.

What's next

Marin 32B Base is a strong base model, but we're not done. Here's what's coming next:

A reasoning-optimized Marin 32B

Hardened multislice TPU support for smoother preemptible training

Exploring MoE variants for the next scale

Continuing to release everything, including successes and failures, openly

Closing thought

Training a 32B model with a small team isn't about heroics but about using the right tools and infrastructure. TPUs' reliability, JAX's clarity and performance, and Marin's open, reproducible process provided the leverage we needed. If the 8B run showed that open labs can build credible models, the 32B run showed they can do it at scale: quietly, steadily, and with far fewer people than you might expect.

SpatialReasoner: Teaching VLMs to "see" structure — Accelerated with Tunix on TPUs

Wednesday, December 17, 2025

by Yifan Shen & Ismini Lourentzou, University of Illinois Urbana-Champaign, and Srikanth Kilaru, Google ML Frameworks

Introduction

We are seeing an increasing interest in Tunix among researchers focusing on the post-training phase of model development. As a native JAX library, Tunix offers the flexibility needed to refine foundation models—including Vision-Language Models (VLMs) and not just LLMs—helping them significantly improve their spatial reasoning capabilities.

Today, we are highlighting the work of the PLAN Lab (Perception and LANguage Lab) at the University of Illinois Urbana-Champaign (UIUC). To address the critical lack of spatial awareness in VLMs, they built SpatialReasoner-R1, a model capable of fine-grained spatial logic. They utilized Tunix and leveraged the Google TPU Research Cloud (TRC) to scale their experiments.

In this blog, Professor Ismini Lourentzou and her team explain how they used Tunix's modular design to implement novel alignment algorithms and improve spatial reasoning in VLMs.

The "Where" Problem in VLMs

Modern Vision-Language Models (VLMs) can describe images and answer basic visual questions with impressive fluency. However, they often struggle with fine-grained spatial understanding. If you ask a VLM to estimate distances, directions, or the precise relative positions of objects, it frequently "hallucinates" coordinates or produces inconsistent reasoning with vague answers.

These capabilities are critical for real-world applications, such as robotics, where precise spatial reasoning enables safe and intelligent interaction with physical environments.

To bridge this gap, we developed the SpatialReasoner-R1 (4B and 8B versions), a model trained to perform step-by-step visually grounded spatial reasoning. It achieves 95.59 on Qualitative Accuracy and 77.3 on Quantitative Accuracy for our 8B fDPO model, outperforming the strongest baseline by ~9% in average accuracy on the SPATIALRGPT-Bench while preserving strong general vision-language abilities.

The Method: Fine-Grained Direct Preference Optimization (fDPO)

The secret sauce behind SpatialReasoner-R1 is a new technique called Fine-Grained Direct Preference Optimization (fDPO).

Standard alignment methods (like DPO) usually give a model a simple "thumbs up" or "thumbs down" for an entire response. But spatial reasoning is complex— for example, a model might correctly identify an object yet make a flawed logical inference about its location.

fDPO introduces segment-specific preference granularity. We optimize separate loss components for:

Descriptive Grounding: Does the model correctly perceive and describe the objects in the image?

Logical Reasoning: Is the step-by-step deduction sound and follows coherent spatial logic?

To generate high-quality training signals, we built a Multi-Model Monte Carlo Tree Search (M3CTS) data generation pipeline, which constructs diverse reasoning trajectories that guide the model toward reliable spatial understanding.

Tunix: Modularity for Novel Research

Implementing a custom objective like fDPO can be difficult in rigid frameworks. Tunix addresses this by providing a well-structured and extensible DPOTrainer that makes it possible to introduce new alignment objectives without reengineering the training pipeline.

This modularity meant we could reuse the entire underlying training stack—sharding, data loading, and loop management—while injecting our novel research logic with just a small amount of well-contained code.

While our backbone model (Sa2VA) required specific architectural handling, the core fDPO algorithm is model-agnostic. We found the Tunix experience smooth and well-documented, making it easy to prototype and iterate on fine-tuning workflows without reinventing the wheel.

Google TRC & TPUs: Reliability at Scale

Training a model to reason over long horizons requires significant compute. The Google TPU Research Cloud (TRC) provided the infrastructure we needed to make large-scale training practical.

Scalability: Tunix's integration with TPUs allowed us to scale our experiments seamlessly.

Reliability: The system performed reliably across multiple TPU runs, which was essential for conducting large-scale spatial reasoning benchmarks.

Support: The Google Tunix and TRC teams assisted with infrastructure setup and experiment design, helping us refine our multi-model exploration strategy.

Looking Ahead: Open Source Contributions

We believe that open-source, extensible tools like Tunix are vital for fostering innovation. They lower the barrier for researchers to experiment with new training objectives without rebuilding core infrastructure.

In that spirit, we contributed our fDPO implementation back to the Tunix ecosystem. We open-source the core fDPO components, enabling the community to apply segment-specific preference optimization to their own models.

Get Started

You can explore our research and the tools we used below:

SpatialReasoner Project Page

Tunix Documentation

Tunix GitHub Repository

Google TRC

JAX documentation

JAX AI Stack documentation

GRL: Turning verifiable games into a post-training suite for LLM agents with Tunix on TPUs

Tuesday, December 16, 2025

by The GRL Team, UC San Diego and Lin Chai & Srikanth Kilaru, Google ML Frameworks

Introduction

JAX is widely recognized for its power in training large-scale AI models. However, a primary bottleneck in the next phase of AI development—LLM post-training with Reinforcement Learning (RL)—is the scarcity of environments with verifiable rewards.

Today, we are highlighting the work of the GRL (Game Reinforcement Learning) team at UC San Diego. To solve the data bottleneck, they have built a pipeline to turn video games into rigorous reasoning benchmarks. They utilized Tunix, a JAX-native research-friendly RL framework that supports multi-host, multi-turn capabilities, and leveraged the Google TPU Research Cloud (TRC) to scale their experiments. The results are promising: this approach has yielded significant improvements in model quality, particularly in planning and reasoning tasks, proving that games can be a viable substrate for serious AI capability training.

In this blog the GRL team explains how they are combining game environments, modular Tunix library for RL post-training, and TPU compute to train the next generation of agents.

Why Verifiable Games for LLM Post-Training?

Current RL post-training has shown strong gains in domains like math and coding because success can be auto-checked. However, these settings are often narrow and short-term. We are effectively overfitting RL to clean problems, while the next generation of agents must operate in messy, multi-step worlds.

To unlock RL as a systematic method for reasoning, we need a diverse pool of environments where rewards are gross.

Authored By

2019 USA Vice President Mr. Alip Robin Clinton/Mohamad Alip Bin Abdullah.

Search This Blog

Google Ownership Blog Since 2003

Google Blog

Comments

Popular posts from this blog

Google(Google LLC) Information

Sypnosis

Google Privacy policy