The field of artificial intelligence (AI) has seen immense progress in recent years, largely driven by advances in deep learning and natural language processing (NLP). At the forefront of these advances are large language models (LLMs) – AI systems trained on massive amounts of text data that can generate human-like text and engage in conversational tasks.
LLMs like Google’s PaLM, Anthropic’s Claude, and DeepMind’s Gopher have demonstrated remarkable capabilities, from coding to common sense reasoning. However, most of these models have not been openly released, limiting their access for research, development, and beneficial applications.
This changed with the recent open sourcing of Gemma – a family of LLMs from Google’s DeepMind based on their powerful proprietary Gemini models. In this blog post, we’ll dive into Gemma, analyzing its architecture, training process, performance, and responsible release.
Overview of Gemma
In February 2023, DeepMind open sourced two sizes of Gemma models – a 2 billion parameter version optimized for on-device deployment, and a larger 7 billion parameter version designed for GPU/TPU usage.
Gemma leverages a similar transformer-based architecture and training methodology to DeepMind’s leading Gemini models. It was trained on up to 6 trillion tokens of text from web documents, math, and code.
DeepMind released both raw pretrained checkpoints of Gemma, as well as versions fine-tuned with supervised learning and human feedback for enhanced capabilities in areas like dialogue, instruction following, and coding.
Getting Started with Gemma
Gemma’s open release makes its advanced AI capabilities accessible to developers, researchers, and enthusiasts. Here’s a quick guide to getting started:
Platform Agnostic Deployment
A key strength of Gemma is its flexibility – you can run it on CPUs, GPUs, or TPUs. For CPU, leverage TensorFlow Lite or HuggingFace Transformers. For accelerated performance on GPU/TPU, use TensorFlow. Cloud services like Google Cloud’s Vertex AI also provide seamless scaling.
Access Pre-trained Models
Gemma comes in different pre-trained variants depending on your needs. The 2B and 7B models offer strong generative abilities out-of-the-box. For custom fine-tuning, the 2B-FT and 7B-FT models are ideal starting points.
Build Exciting Applications
You can build a diverse range of applications with Gemma, like story generation, language translation, question answering, and creative content production. The key is leveraging Gemma’s strengths through fine-tuning on your own datasets.
Architecture
Gemma utilizes a decoder-only transformer architecture, building on advances like multi-query attention and rotary positional embeddings:
- Transformers: Introduced in 2017, the transformer architecture based solely on attention mechanisms has become ubiquitous in NLP. Gemma inherits the transformer’s ability to model long-range dependencies in text.
- Decoder-only: Gemma only uses a transformer decoder stack, unlike encoder-decoder models like BART or T5. This provides strong generative capabilities for tasks like text generation.
- Multi-query attention: Gemma employs multi-query attention in its larger model, allowing each attention head to process multiple queries in parallel for faster inference.
- Rotary positional embeddings: Gemma represents positional information using rotary embeddings instead of absolute position encodings. This technique reduces model size while retaining position information.
The use of techniques like multi-query attention and rotary positional embeddings enable Gemma models to reach an optimal tradeoff between performance, inference speed, and model size.
Data and Training Process
Gemma was trained on up to 6 trillion tokens of text data, primarily in English. This included web documents, mathematical text, and source code. DeepMind invested significant efforts into data filtering, removing toxic or harmful content using classifiers and heuristics.
Training was performed using Google’s TPUv5 infrastructure, with up to 4096 TPUs used to train Gemma-7B. Efficient model and data parallelism techniques enabled training the massive models with commodity hardware.
Staged training was utilized, continuously adjusting the data distribution to focus on high-quality, relevant text. The final fine-tuning stages used a mixture of human-generated and synthetic instruction-following examples to enhance capabilities.
Model Performance
DeepMind rigorously evaluated Gemma models on a broad set of over 25 benchmarks spanning question answering, reasoning, mathematics, coding, common sense, and dialogue capabilities.
Gemma achieves state-of-the-art results compared to similarly sized open source models across the majority of benchmarks. Some highlights:
- Mathematics: Gemma excels on mathematical reasoning tests like GSM8K and MATH, outperforming models like Codex and Anthropic’s Claude by over 10 points.
- Coding: Gemma matches or exceeds the performance of Codex on programming benchmarks like MBPP, despite not being specifically trained on code.
- Dialogue: Gemma demonstrates strong conversational ability with 51.7% win rate over Anthropic’s Mistral-7B on human preference tests.
- Reasoning: On tasks requiring inference like ARC and Winogrande, Gemma outperforms other 7B models by 5-10 points.
Gemma’s versatility across disciplines demonstrates its strong general intelligence capabilities. While gaps to human-level performance remain, Gemma represents a leap forward in open source NLP.
Safety and Responsibility
Releasing open source weights of large models introduces challenges around intentional misuse and inherent model biases. DeepMind took steps to mitigate risks:
- Data filtering: Potentially toxic, illegal, or biased text was removed from the training data using classifiers and heuristics.
- Evaluations: Gemma was tested on 30+ benchmarks curated to assess safety, fairness, and robustness. It matched or exceeded other models.
- Fine-tuning: Model fine-tuning focused on improving safety capabilities like information filtering and appropriate hedging/refusal behaviors.
- Terms of use: Usage terms prohibit offensive, illegal, or unethical applications of Gemma models. However, enforcement remains challenging.
- Model cards: Cards detailing model capabilities, limitations, and biases were released to promote transparency.
While risks from open sourcing exist, DeepMind determined Gemma’s release provides net societal benefits based on its safety profile and enablement of research. However, vigilant monitoring of potential harms will remain critical.
Enabling the Next Wave of AI Innovation
Releasing Gemma as an open source model family stands to unlock progress across the AI community:
- Accessibility: Gemma reduces barriers for organizations to build with cutting-edge NLP, who previously faced high compute/data costs for training their own LLMs.
- New applications: By open sourcing pretrained and tuned checkpoints, DeepMind enables easier development of beneficial apps in areas like education, science, and accessibility.
- Customization: Developers can further customize Gemma for industry or domain-specific applications through continued training on proprietary data.
- Research: Open models like Gemma foster greater transparency and auditing of current NLP systems, illuminating future research directions.
- Innovation: Availability of strong baseline models like Gemma will accelerate progress on areas like bias mitigation, factuality, and AI safety.
By providing Gemma’s capabilities to all through open sourcing, DeepMind hopes to spur responsible development of AI for social good.
The Road Ahead
With each leap in AI, we inch closer towards models that rival or exceed human intelligence across all domains. Systems like Gemma underscore how rapid advances in self-supervised models are unlocking increasingly advanced cognitive capabilities.
However, work remains to improve reliability, interpretability, and controllability of AI – areas where human intelligence still reigns supreme. Domains like mathematics highlight these persistent gaps, with Gemma scoring 64% on MMLU compared to estimated 89% human performance.
Closing these gaps while ensuring the safety and ethics of ever-more-capable AI systems will be the central challenges in the years ahead. Striking the right balance between openness and caution will be critical, as DeepMind aims to democratize access to benefits of AI while managing emerging risks.
Initiatives to promote AI safety – like Dario Amodei’s ANC, DeepMind’s Ethics & Society team, and Anthropic’s Constitutional AI – signal growing recognition of this need for nuance. Meaningful progress will require open, evidence-based dialogue between researchers, developers, policymakers and the public.
If navigated responsibly, Gemma represents not the summit of AI, but a basecamp for the next generation of AI researchers following in DeepMind’s footsteps towards fair, beneficial artificial general intelligence.
Conclusion
DeepMind’s release of Gemma models signifies a new era for open source AI – one that transcends narrow benchmarks into generalized intelligence capabilities. Tested extensively for safety and broadly accessible, Gemma sets a new standard for responsible open sourcing in AI.
Driven by a competitive spirit tempered with cooperative values, sharing breakthroughs like Gemma raises all boats in the AI ecosystem. The entire community now has access to a versatile LLM family to drive or support their initiatives.
While risks remain, DeepMind’s technical and ethical diligence provides confidence that Gemma’s benefits outweigh its potential harms. As AI capabilities grow ever more advanced, maintaining this nuance between openness and caution will be critical.
Gemma takes us one step closer to AI that benefits all of humanity. But many grand challenges still await along the path to benevolent artificial general intelligence. If AI researchers, developers and society at large can maintain collaborative progress, Gemma may one day be seen as a historic basecamp, rather than the final summit.

 
  
 


