Loading models...

README

Model Information for Gemma 3

Model Page

Gemma

Resources and Technical Documentation

Authors: Google DeepMind

Model Information

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models.

Gemma 3 models are multimodal, handling text and image input and generating text output, with open weights for both pre-trained variants and instruction-tuned variants.

Context Window: 128K tokens (4B, 12B, 27B), 32K tokens (1B)
Multilingual Support: Over 140 languages
Deployability: Designed for laptops, desktops, and private cloud infrastructure

Applications:
Question answering, summarization, reasoning, image understanding, and more.

Inputs and Outputs

Input:

Text string: questions, prompts, documents
Images: normalized to 896x896 resolution, encoded to 256 tokens each
Total input context: 128K or 32K tokens (depending on model size)

Output:

Generated text
Total output context: 8192 tokens

Usage

Installation

1pip install -U transformers

Running with the Pipeline API

1from transformers import pipeline
2import torch
3
4pipe = pipeline(
5    "image-text-to-text",
6    model="google/gemma-3-27b-it",
7    device="cuda",
8    torch_dtype=torch.bfloat16
9)
10
11messages = [
12    {
13        "role": "system",
14        "content": [{"type": "text", "text": "You are a helpful assistant."}]
15    },
16    {
17        "role": "user",
18        "content": [
19            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
20            {"type": "text", "text": "What animal is on the candy?"}
21        ]
22    }
23]
24
25output = pipe(text=messages, max_new_tokens=200)
26print(output[0]["generated_text"][-1]["content"])

Example Output:

Based on the image, the animal on the candy is a turtle.

Running on Single/Multi-GPU

1# pip install accelerate
2from transformers import AutoProcessor, Gemma3ForConditionalGeneration
3from PIL import Image
4import requests
5import torch
6
7model_id = "google/gemma-3-27b-it"
8
9model = Gemma3ForConditionalGeneration.from_pretrained(
10    model_id, device_map="auto"
11).eval()
12
13processor = AutoProcessor.from_pretrained(model_id)
14
15messages = [
16    {
17        "role": "system",
18        "content": [{"type": "text", "text": "You are a helpful assistant."}]
19    },
20    {
21        "role": "user",
22        "content": [
23            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
24            {"type": "text", "text": "Describe this image in detail."}
25        ]
26    }
27]
28
29inputs = processor.apply_chat_template(
30    messages, add_generation_prompt=True, tokenize=True,
31    return_dict=True, return_tensors="pt"
32).to(model.device, dtype=torch.bfloat16)
33
34input_len = inputs["input_ids"].shape[-1]
35
36with torch.inference_mode():
37    generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
38    generation = generation[0][input_len:]
39
40decoded = processor.decode(generation, skip_special_tokens=True)
41print(decoded)

Citation

1@article{gemma_2025,
2    title={Gemma 3},
3    url={https://goo.gle/Gemma3Report},
4    publisher={Kaggle},
5    author={Gemma Team},
6    year={2025}
7}

Model Data

Training Dataset

Text: Diverse web documents in over 140 languages
Code: Exposure to programming languages
Mathematics: Logical reasoning and symbolic representation
Images: Broad image data for vision tasks

Model Size	Tokens Used for Training
27B	14 trillion
12B	12 trillion
4B	4 trillion
1B	2 trillion

Data Preprocessing

CSAM Filtering: To exclude harmful content
Sensitive Data Filtering: Removal of certain personal information
Content Quality Filtering: Following Google policies

Implementation Information

Hardware

Trained on TPUv4p, TPUv5p, TPUv5e
Advantages:
- High performance for matrix operations
- Large high-bandwidth memory
- Scalability via TPU Pods
- Cost-effective training at large scales

Software

Frameworks: JAX + ML Pathways
Benefits:
- Single-controller programming model
- Efficient orchestration of large-scale training

Evaluation

Reasoning and Factuality

Benchmark	Metric	1B	4B	12B	27B
HellaSwag	10-shot	62.3	77.2	84.2	85.6
BoolQ	0-shot	63.2	72.3	78.8	82.4
PIQA	0-shot	73.8	79.6	81.8	83.3
SocialIQA	0-shot	48.9	51.9	53.4	54.9
TriviaQA	5-shot	39.8	65.8	78.2	85.5
Natural Questions	5-shot	9.48	20.0	31.4	36.1
ARC-c	25-shot	38.4	56.2	68.9	70.6
ARC-e	0-shot	73.0	82.4	88.3	89.0
WinoGrande	5-shot	58.2	64.7	74.3	78.8
BIG-Bench Hard	few-shot	28.4	50.9	72.6	77.7
DROP	1-shot	42.4	60.1	72.2	77.2

STEM and Code

Benchmark	Metric	4B	12B	27B
MMLU	5-shot	59.6	74.5	78.6
AGIEval	3-5-shot	42.1	57.4	66.2
GSM8K	8-shot	38.4	71.0	82.6
HumanEval	0-shot	36.0	45.7	48.8

Multilingual

Benchmark	1B	4B	12B	27B
MGSM	2.04	34.7	64.3	74.3
Global-MMLU-Lite	24.9	57.0	69.4	75.7
FloRes	29.5	39.2	46.0	48.8

Multimodal

Benchmark	4B	12B	27B
COCOcap	102	111	116
DocVQA (val)	72.8	82.3	85.6
TextVQA (val)	58.9	66.5	68.6
RealWorldQA	45.5	52.2	53.9

Ethics and Safety

Evaluation Approach

Structured evaluation and internal red-teaming
Focus areas:
- Child Safety
- Content Safety
- Representational Harms

Results

Major improvements over earlier Gemma models
Minimal policy violations observed
All testing without additional safety filters

Usage and Limitations

Intended Usage

Content creation
Chatbots and conversational AI
Text summarization
Image data extraction
Research and education

Limitations

Potential biases from training data
Challenges with open-ended or ambiguous tasks
Occasional factual inaccuracies
Limited common sense reasoning
English-centric safety evaluation