GRPO (LLM)

Directory Structure

.
└── dataset_dir/
    ├── train.jsonl
    └── reward_models.py

The directory should be archived in a .zip file and stored in an object storage. Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*

Dataset Structure

The train.jsonl file should be a list of JSON-like objects, serialised in jsonl format, where each object has exactly three keys:

train_dataset = [
  {
    "prompt": "...",
    "answer": "..."
  },
]

Example JSONL File

{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}

Field Definitions

Key	Type	Description
prompt	List[Dict]	Chat-style prompt: a list of role-tagged messages.
answer	str	Ground-truth response.

Prompt Format

Each prompt entry is a list of one or more messages. Minimal single-turn example:

[
  {
    "role": "user",
    "content":"When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. … Provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>."  
 }
]

Message fields

role: usually user (extend with assistant for multi-turn data).
content: "<question/text> + <control tags>"- Text message from the user or assistant.

Answer Format

The expected answer to the question. Example:

1.2

Complete Example

train_dataset = [
  {
    "prompt": [
      {
        "role": "user",
        "content": "When a spring does work on an object, … provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>"
      }],

    "answer": "1.2",
  },
]

Reward functions

Depending on the dataset structure and task objectives, you may need to define reward functions for model training. These reward functions are accepted by the trainer through a special reward_models.py file. This sections outlines (with examples) the standard method for providing custom reward functions. Users should define their reward functions in a file named reward_models.py, which must expose a list named reward_functions containing callable functions.

The reward functions list is then passed directly to GRPOTrainer like:

GRPOTrainer(reward_functions=reward_functions, **kwargs)

Example implementation

from your_tags import reasoning_start, reasoning_end, solution_start, solution_end
import re
from typing import Callable, List

def formatting_reward_func(completions, **kwargs) -> List[float]:
    """
    Rewards the presence of both reasoning and answer tags in the model's output.
    """
    thinking_pattern = f'{reasoning_start}(.*?){reasoning_end}'
    answer_pattern   = f'{solution_start}(.*?){solution_end}'
    scores = []

    for completion in completions:
        text    = completion[0]['content']
        score   = 0.0
        if len(re.findall(thinking_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        if len(re.findall(answer_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        scores.append(score)

    return scores


def correctness_reward_func(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Rewards exact match of the numeric answer within the <answer> tags.
    """
    answer_pattern = f'{solution_start}(.*?){solution_end}'
    responses = [
        re.findall(answer_pattern, comp[0]['content'], re.DOTALL)
        for comp in completions
    ]

    # Example debug print
    q     = prompts[0][-1]['content']
    a     = answer[0]
    resp0 = completions[0][0]['content']
    print("-"*20, f"Q: {q}\nA: {a}\nR: {resp0}")

    return [
        2.0 if len(r)==1 and r[0].strip()==str(a) else 0.0
        for r in responses
    ]

reward_functions: List[Callable] = [formatting_reward_func, correctness_reward_func]

formatting_reward_func checks that <think>…</think> and <answer>…</answer> appear exactly once.
correctness_reward_func validates the extracted answer matches the ground truth.

Get Started

Types of Inference

Playground

Model Compilation

Deployment

Benchmarking

Training

Settings

References

Directory Structure

Dataset Structure

Example JSONL File

Field Definitions

Prompt Format

Message fields

Answer Format

Complete Example

Reward functions

Example implementation

Get Started

Types of Inference

Playground

Model Compilation

Deployment

Benchmarking

Training

Settings

References

​Directory Structure

​Dataset Structure

​Example JSONL File

​Field Definitions

​Prompt Format

​Message fields

​Answer Format

​Complete Example

​Reward functions

​Example implementation

Directory Structure

Dataset Structure

Example JSONL File

Field Definitions

Prompt Format

Message fields

Answer Format

Complete Example

Reward functions

Example implementation