Dataset Format

Choose the file type for your dataset. Currently supported types are:
  • jsonl (JSON Lines)
  • zipThe directory should be archived in a .zip file and stored in an object storage.
    Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*
Each line in a.jsonlfile should represent a complete training example. The supported format styles are:

ShareGPT Format

{
  "system": "<system>",
  "conversation": [
    {"human": "<query1>", "assistant": "<response1>"},
    {"human": "<query2>", "assistant": "<response2>"}
  ]
}

Example JSONL File

{"system": "...", "conversation": ["...."]}
{"system": "...", "conversation": ["...."]}
{"system": "...", "conversation": ["...."]}

Message fields

  • system: The initial system instruction that sets the behavior or tone for the assistant.
  • conversation: A list of human-assistant message pairs forming the dialogue history.
    • human: A user query or input in the conversation.
    • assistant: The assistant’s response to the corresponding human input.

OpenAI SFT Format

{
  "messages": [
    {"role": "system", "content": "<system>"},
    {"role": "user", "content": "<query1>"},
    {"role": "assistant", "content": "<response1>"},
    {"role": "user", "content": "<query2>"},
    {"role": "assistant", "content": "<response2>"}
  ]
}

Example JSONL File

{"messages": [{"role": "...", "content": "..."},]}
{"messages": [{"role": "...", "content": "..."},]}
{"messages": [{"role": "...", "content": "..."},]}

Message fields

  • messages: A sequential list of role-based messages representing a full conversation.
  • role: The identity of the message sender (e.g., system, user, assistant).
  • content: The actual text of the message corresponding to the role.