Directory Structure

.
└── dataset_dir/
    ├── images/
    │   ├── /path/to/img_1
    │   ├── ...
    │   └── /path/to/img_i
    ├── train.jsonl
    └── reward_models.py
The directory should be archived in a .zip file and stored in an object storage. Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*

Dataset Structure

The train.jsonl file should be a list of JSON-like objects, serialised in jsonl format, where each object has exactly three keys:
train_dataset = [
  {
    "prompt": "...",
    "image": "...",
    "answer": "..."
  },
]

Example JSONL File

{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}

Field Definitions

KeyTypeDescription
promptList[Dict]Chat-style prompt: a list of role-tagged messages.
imagestrRelative path of the image.
answerstrGround-truth response.

Prompt Format

Each prompt entry is a list of one or more messages. Minimal single-turn example:
[
  {
    "role": "user",
    "content": [
      { "type": "image" },
      {
        "type": "text",
        "text": "When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. … Provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>."
      }
    ]
  }
]

Message fields

  • role: usually user (extend with assistant for multi-turn data).
  • content: ordered list of messages in a turn. Content can have two subtypes:
    • { "type": "image" } - placeholder indicating an image input accompanies this turn
    • { "type": "text", "text": "<question/text> + <control tags>" } - Text message from the user or assistant.

Image Format

Path to the image file relative to the root of the dataset files archive. Example:
images/cumin_canister_spring.jpg

Answer Format

The expected answer to the question. Example:
1.2

Complete Example

train_dataset = [
  {
    "prompt": [
      {
        "role": "user",
        "content": [
          { "type": "image" },
          {
            "type": "text",
            "text": (
              "When a spring does work on an object, … provide your reasoning between <think> and </think> "
              "and then your final answer between <answer> and (put a float here) </answer>"
            )
          }
        ]
      }
    ],
    "image": "images/cumin_canister_spring.jpg", 
    "answer": "1.2",
  },
]

Reward functions

Depending on the dataset structure and task objectives, you may need to define reward functions for model training. These reward functions are accepted by the trainer through a special reward_models.py file. This sections outlines (with examples) the standard method for providing custom reward functions. Users should define their reward functions in a file named reward_models.py, which must expose a list named reward_functions containing callable functions.
The reward functions list is then passed directly to GRPOTrainer like so:

GRPOTrainer(reward_functions=reward_functions, **kwargs)

Example implementation

from your_tags import reasoning_start, reasoning_end, solution_start, solution_end
import re
from typing import Callable, List

def formatting_reward_func(completions, **kwargs) -> List[float]:
    """
    Rewards the presence of both reasoning and answer tags in the model's output.
    """
    thinking_pattern = f'{reasoning_start}(.*?){reasoning_end}'
    answer_pattern   = f'{solution_start}(.*?){solution_end}'
    scores = []

    for completion in completions:
        text    = completion[0]['content']
        score   = 0.0
        if len(re.findall(thinking_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        if len(re.findall(answer_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        scores.append(score)

    return scores


def correctness_reward_func(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Rewards exact match of the numeric answer within the <answer> tags.
    """
    answer_pattern = f'{solution_start}(.*?){solution_end}'
    responses = [
        re.findall(answer_pattern, comp[0]['content'], re.DOTALL)
        for comp in completions
    ]

    # Example debug print
    q     = prompts[0][-1]['content']
    a     = answer[0]
    resp0 = completions[0][0]['content']
    print("-"*20, f"Q: {q}\nA: {a}\nR: {resp0}")

    return [
        2.0 if len(r)==1 and r[0].strip()==str(a) else 0.0
        for r in responses
    ]

reward_functions: List[Callable] = [formatting_reward_func, correctness_reward_func]

  • formatting_reward_func checks that <think>…</think> and <answer>…</answer> appear exactly once.
  • correctness_reward_func validates the extracted answer matches the ground truth.