Directory Structure
.
└── dataset_dir/
├── train.jsonl
└── reward_models.py
The directory should be archived in a .zip
file and stored in an object storage. Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*
Dataset Structure
The train.jsonl
file should be a list of JSON-like objects, serialised in jsonl
format, where each object has exactly three keys:
train_dataset = [
{
"prompt": "...",
"answer": "..."
},
]
Example JSONL File
{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}
{"prompt": "...","answer": "..."}
Field Definitions
Key | Type | Description |
---|
prompt | List[Dict] | Chat-style prompt: a list of role-tagged messages. |
answer | str | Ground-truth response. |
Each prompt entry is a list of one or more messages. Minimal single-turn example:
[
{
"role": "user",
"content":"When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. … Provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>."
}
]
Message fields
role
: usually user
(extend with assistant
for multi-turn data).
content
: "<question/text> + <control tags>"
- Text message from the user or assistant.
The expected answer to the question. Example:
Complete Example
train_dataset = [
{
"prompt": [
{
"role": "user",
"content": "When a spring does work on an object, … provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>"
}],
"answer": "1.2",
},
]
Reward functions
Depending on the dataset structure and task objectives, you may need to define reward functions for model training. These reward functions are accepted by the trainer through a special reward_models.py
file. This sections outlines (with examples) the standard method for providing custom reward functions.
Users should define their reward functions in a file named reward_models.py
, which must expose a list named reward_functions
containing callable functions.
The reward functions list is then passed directly to GRPOTrainer like:
GRPOTrainer(reward_functions=reward_functions, **kwargs)
Example implementation
from your_tags import reasoning_start, reasoning_end, solution_start, solution_end
import re
from typing import Callable, List
def formatting_reward_func(completions, **kwargs) -> List[float]:
"""
Rewards the presence of both reasoning and answer tags in the model's output.
"""
thinking_pattern = f'{reasoning_start}(.*?){reasoning_end}'
answer_pattern = f'{solution_start}(.*?){solution_end}'
scores = []
for completion in completions:
text = completion[0]['content']
score = 0.0
if len(re.findall(thinking_pattern, text, re.DOTALL)) == 1:
score += 1.0
if len(re.findall(answer_pattern, text, re.DOTALL)) == 1:
score += 1.0
scores.append(score)
return scores
def correctness_reward_func(prompts, completions, answer, **kwargs) -> List[float]:
"""
Rewards exact match of the numeric answer within the <answer> tags.
"""
answer_pattern = f'{solution_start}(.*?){solution_end}'
responses = [
re.findall(answer_pattern, comp[0]['content'], re.DOTALL)
for comp in completions
]
# Example debug print
q = prompts[0][-1]['content']
a = answer[0]
resp0 = completions[0][0]['content']
print("-"*20, f"Q: {q}\nA: {a}\nR: {resp0}")
return [
2.0 if len(r)==1 and r[0].strip()==str(a) else 0.0
for r in responses
]
reward_functions: List[Callable] = [formatting_reward_func, correctness_reward_func]
formatting_reward_func
checks that <think>…</think>
and <answer>…</answer>
appear exactly once.
correctness_reward_func
validates the extracted answer matches the ground truth.