Guidelines for preparing train_dataset entries to fine-tune a VLM with both images and text prompts.
.zip
file and stored in an object storage. Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*
train.jsonl
file should be a list of JSON-like objects, serialised in jsonl
format, where each object has exactly three keys:
Key | Type | Description |
---|---|---|
prompt | List[Dict] | Chat-style prompt: a list of role-tagged messages. |
image | str | Relative path of the image. |
answer | str | Ground-truth response. |
role
: usually user
(extend with assistant
for multi-turn data).content
: ordered list of messages in a turn. Content can have two subtypes:
{ "type": "image" }
- placeholder indicating an image input accompanies this turn{ "type": "text", "text": "<question/text> + <control tags>" }
- Text message from the user or assistant.reward_models.py
file. This sections outlines (with examples) the standard method for providing custom reward functions.
Users should define their reward functions in a file named reward_models.py
, which must expose a list named reward_functions
containing callable functions.
GRPOTrainer(reward_functions=reward_functions, **kwargs)
formatting_reward_func
checks that <think>…</think>
and <answer>…</answer>
appear exactly once.correctness_reward_func
validates the extracted answer matches the ground truth.