> ## Documentation Index
> Fetch the complete documentation index at: https://docs.simplismart.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# GRPO (VLM)

> Guidelines for preparing train_dataset entries to fine-tune a VLM with both images and text prompts.

## Directory Structure

```
.
└── dataset_dir/
    ├── images/
    │   ├── /path/to/img_1
    │   ├── ...
    │   └── /path/to/img_i
    ├── train.jsonl
    └── reward_models.py
```

<Note>
  The directory should be archived in a `.zip` file and stored in an object storage. Example zip command:

  `cd path/to/dataset_dir && zip -r dataset_dir.zip ./*`
</Note>

***

## Dataset Structure

The `train.jsonl` file should be a **list** of JSON-like objects, serialised in `jsonl` format, where each object has exactly three keys:

```json theme={null}
train_dataset = [
  {
    "prompt": "...",
    "image": "...",
    "answer": "..."
  },
]
```

***

## Example JSONL File

```json theme={null}
{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}
{"prompt": "...","image": "...","answer": "..."}
```

***

## **Field Definitions**

| **Key** | **Type**    | **Description**                                    |
| ------- | ----------- | -------------------------------------------------- |
| prompt  | List\[Dict] | Chat-style prompt: a list of role-tagged messages. |
| image   | str         | Relative path of the image.                        |
| answer  | str         | Ground-truth response.                             |

***

## **Prompt Format**

Each prompt entry is a list of one or more messages. Minimal single-turn example:

```json theme={null}
[
  {
    "role": "user",
    "content": [
      { "type": "image" },
      {
        "type": "text",
        "text": "When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. … Provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>."
      }
    ]
  }
]
```

***

## Message fields

* `role`: usually `user` (extend with `assistant` for multi-turn data).
* `content`: ordered list of messages in a turn. Content can have two subtypes:
  * `{ "type": "image" }` - placeholder indicating an image input accompanies this turn
  * `{ "type": "text", "text": "<question/text> + <control tags>" }` - Text message from the user or assistant.

***

## **Image Format**

Path to the image file relative to the root of the dataset files archive. Example:

```
images/cumin_canister_spring.jpg
```

***

## **Answer Format**

The expected answer to the question. Example:

```
1.2
```

***

## **Complete Example**

```json theme={null}
train_dataset = [
  {
    "prompt": [
      {
        "role": "user",
        "content": [
          { "type": "image" },
          {
            "type": "text",
            "text": (
              "When a spring does work on an object, … provide your reasoning between <think> and </think> "
              "and then your final answer between <answer> and (put a float here) </answer>"
            )
          }
        ]
      }
    ],
    "image": "images/cumin_canister_spring.jpg", 
    "answer": "1.2",
  },
]
```

***

## Reward functions

Depending on the dataset structure and task objectives, you may need to define reward functions for model training. These reward functions are accepted by the trainer through a special `reward_models.py` file. This sections outlines (with examples) the standard method for providing custom reward functions.

Users should define their reward functions in a file named `reward_models.py`, which must expose a list named `reward_functions` containing callable functions.

<Note>
  The **reward functions** list is then passed directly to GRPOTrainer like so: \
  \
  `GRPOTrainer(reward_functions=reward_functions, **kwargs)`
</Note>

***

## Example implementation

```python theme={null}
from your_tags import reasoning_start, reasoning_end, solution_start, solution_end
import re
from typing import Callable, List

def formatting_reward_func(completions, **kwargs) -> List[float]:
    """
    Rewards the presence of both reasoning and answer tags in the model's output.
    """
    thinking_pattern = f'{reasoning_start}(.*?){reasoning_end}'
    answer_pattern   = f'{solution_start}(.*?){solution_end}'
    scores = []

    for completion in completions:
        text    = completion[0]['content']
        score   = 0.0
        if len(re.findall(thinking_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        if len(re.findall(answer_pattern, text, re.DOTALL)) == 1:
            score += 1.0
        scores.append(score)

    return scores


def correctness_reward_func(prompts, completions, answer, **kwargs) -> List[float]:
    """
    Rewards exact match of the numeric answer within the <answer> tags.
    """
    answer_pattern = f'{solution_start}(.*?){solution_end}'
    responses = [
        re.findall(answer_pattern, comp[0]['content'], re.DOTALL)
        for comp in completions
    ]

    # Example debug print
    q     = prompts[0][-1]['content']
    a     = answer[0]
    resp0 = completions[0][0]['content']
    print("-"*20, f"Q: {q}\nA: {a}\nR: {resp0}")

    return [
        2.0 if len(r)==1 and r[0].strip()==str(a) else 0.0
        for r in responses
    ]

reward_functions: List[Callable] = [formatting_reward_func, correctness_reward_func]
```

***

* `formatting_reward_func` checks that `<think>…</think>` and `<answer>…</answer>` appear exactly once.
* `correctness_reward_func` validates the extracted answer matches the ground truth.
