Loading models...

README

Model Information for Gemma 3

Model Page

Gemma

Resources and Technical Documentation

Authors: Google DeepMind


Model Information

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants.

  • Context Window: 128K tokens (4B, 12B, 27B), 32K tokens (1B)
  • Multilingual Support: Over 140 languages
  • Deployability: Designed for laptops, desktops, and private cloud infrastructure

Applications:
Question answering, summarization, reasoning, image understanding, and more.


Inputs and Outputs

Input:

  • Text string: questions, prompts, documents
  • Images: normalized to 896x896 resolution, encoded to 256 tokens each
  • Total input context: 128K or 32K tokens (depending on model size)

Output:

  • Generated text
  • Total output context: 8192 tokens

Usage

Installation

1pip install -U transformers

Running with the Pipeline API

1from transformers import pipeline 2import torch 3 4pipe = pipeline( 5 "image-text-to-text", 6 model="google/gemma-3-27b-it", 7 device="cuda", 8 torch_dtype=torch.bfloat16 9) 10 11messages = [ 12 { 13 "role": "system", 14 "content": [{"type": "text", "text": "You are a helpful assistant."}] 15 }, 16 { 17 "role": "user", 18 "content": [ 19 {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, 20 {"type": "text", "text": "What animal is on the candy?"} 21 ] 22 } 23] 24 25output = pipe(text=messages, max_new_tokens=200) 26print(output[0]["generated_text"][-1]["content"])

Example Output:

Based on the image, the animal on the candy is a turtle.


Running on Single/Multi-GPU

1# pip install accelerate 2from transformers import AutoProcessor, Gemma3ForConditionalGeneration 3from PIL import Image 4import requests 5import torch 6 7model_id = "google/gemma-3-27b-it" 8 9model = Gemma3ForConditionalGeneration.from_pretrained( 10 model_id, device_map="auto" 11).eval() 12 13processor = AutoProcessor.from_pretrained(model_id) 14 15messages = [ 16 { 17 "role": "system", 18 "content": [{"type": "text", "text": "You are a helpful assistant."}] 19 }, 20 { 21 "role": "user", 22 "content": [ 23 {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, 24 {"type": "text", "text": "Describe this image in detail."} 25 ] 26 } 27] 28 29inputs = processor.apply_chat_template( 30 messages, add_generation_prompt=True, tokenize=True, 31 return_dict=True, return_tensors="pt" 32).to(model.device, dtype=torch.bfloat16) 33 34input_len = inputs["input_ids"].shape[-1] 35 36with torch.inference_mode(): 37 generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) 38 generation = generation[0][input_len:] 39 40decoded = processor.decode(generation, skip_special_tokens=True) 41print(decoded)

Citation

1@article{gemma_2025, 2 title={Gemma 3}, 3 url={https://goo.gle/Gemma3Report}, 4 publisher={Kaggle}, 5 author={Gemma Team}, 6 year={2025} 7}

Model Data

Training Dataset

  • Text: Diverse web documents in over 140 languages
  • Code: Exposure to programming languages
  • Mathematics: Logical reasoning and symbolic representation
  • Images: Broad image data for vision tasks
Model SizeTokens Used for Training
27B14 trillion
12B12 trillion
4B4 trillion
1B2 trillion

Data Preprocessing

  • CSAM Filtering: To exclude harmful content
  • Sensitive Data Filtering: Removal of certain personal information
  • Content Quality Filtering: Following Google policies

Implementation Information

Hardware

  • Trained on TPUv4p, TPUv5p, TPUv5e
  • Advantages:
    • High performance for matrix operations
    • Large high-bandwidth memory
    • Scalability via TPU Pods
    • Cost-effective training at large scales

Software

  • Frameworks: JAX + ML Pathways
  • Benefits:
    • Single-controller programming model
    • Efficient orchestration of large-scale training

Evaluation

Reasoning and Factuality

BenchmarkMetric1B4B12B27B
HellaSwag10-shot62.377.284.285.6
BoolQ0-shot63.272.378.882.4
PIQA0-shot73.879.681.883.3
SocialIQA0-shot48.951.953.454.9
TriviaQA5-shot39.865.878.285.5
Natural Questions5-shot9.4820.031.436.1
ARC-c25-shot38.456.268.970.6
ARC-e0-shot73.082.488.389.0
WinoGrande5-shot58.264.774.378.8
BIG-Bench Hardfew-shot28.450.972.677.7
DROP1-shot42.460.172.277.2

STEM and Code

BenchmarkMetric4B12B27B
MMLU5-shot59.674.578.6
AGIEval3-5-shot42.157.466.2
GSM8K8-shot38.471.082.6
HumanEval0-shot36.045.748.8

Multilingual

Benchmark1B4B12B27B
MGSM2.0434.764.374.3
Global-MMLU-Lite24.957.069.475.7
FloRes29.539.246.048.8

Multimodal

Benchmark4B12B27B
COCOcap102111116
DocVQA (val)72.882.385.6
TextVQA (val)58.966.568.6
RealWorldQA45.552.253.9

Ethics and Safety

Evaluation Approach

  • Structured evaluation and internal red-teaming
  • Focus areas:
    • Child Safety
    • Content Safety
    • Representational Harms

Results

  • Major improvements over earlier Gemma models
  • Minimal policy violations observed
  • All testing without additional safety filters

Usage and Limitations

Intended Usage

  • Content creation
  • Chatbots and conversational AI
  • Text summarization
  • Image data extraction
  • Research and education

Limitations

  • Potential biases from training data
  • Challenges with open-ended or ambiguous tasks
  • Occasional factual inaccuracies
  • Limited common sense reasoning
  • English-centric safety evaluation