Supervised Fine-Tuning (SFT) Guide

This updated guide provides an overview of our enhanced UI for fine-tuning large language models using both full-model fine-tuning (SFT) and parameter-efficient methods like LoRA. While full-model SFT is fully supported, we recommend using LoRA for most use cases due to its faster convergence, reduced GPU memory requirements, and simplified checkpointing. LoRA enables more efficient experimentation while maintaining strong model performance.
Note: This guide refers to the latest version (v2). If you’re using the older interface, please refer to the Legacy v1 Guide.

Starting a Training Experiment

Experiment Name: A unique identifier for each training job within your organization.
  • Base Model: Select a supported model from the list below.
  • Source Type: Currently supports models from Hugging Face.
  • Model Type: Auto-filled based on the selected base model.
Supported Models:
  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.2-1B-Instruct
  • meta-llama/Llama-3.2-3B-Instruct
  • meta-llama/Llama-3.2-11B-Vision-Instruct
  • Qwen/Qwen2.5-3B-Instruct
  • Qwen/Qwen2.5-14B-Instruct
  • Qwen/Qwen2.5-VL-7B-Instruct
  • tiiuae/falcon-7b-instruct
title

Dataset Selection

Configure your dataset for training using the following fields:
  • Source Options: Select the source of your dataset. Supported options include
    • Hugging Face (public Hub)
    • AWS S3
    • GCP Storage (GCS)
  • Dataset Name
    This should be unique within your organization to help with organizing and reusing datasets.
  • Dataset Path
    Specify the dataset location. For AWS S3, use the full path in the format.
    e.g., s3://your-bucket/your-file.jsonl
  • Dataset Description (Optional)
    Provide a brief description of the dataset’s contents or purpose. Optional but useful for reference.
  • Secret (Required for AWS S3 or GCP GCS)
    Provide your cloud credentials to enable secure access to private storage buckets.
  • Region (Required for AWS S3 or GCP GCS)
    Select the region where your storage bucket is located.

Dataset Format

Choose the file type for your dataset. Currently supported types are:
  • jsonl (JSON Lines)
  • zipThe directory should be archived in a .zip file and stored in an object storage.
    Example zip command:cd path/to/dataset_dir && zip -r dataset_dir.zip ./*
Each line in a .jsonl file should represent a complete training example. The supported format styles are:
  1. ShareGPT Format
    {
      "system": "<system>",
      "conversation": [
        {"human": "<query1>", "assistant": "<response1>"},
        {"human": "<query2>", "assistant": "<response2>"}
      ]
    }
    
  2. OpenAI SFT Format
    {
      "messages": [
        {"role": "system", "content": "<system>"},
        {"role": "user", "content": "<query1>"},
        {"role": "assistant", "content": "<response1>"},
        {"role": "user", "content": "<query2>"},
        {"role": "assistant", "content": "<response2>"}
      ]
    }
    
  3. OpenAI DPO Format (for preference training)
    {
      "messages": [
        {"role": "system", "content": "You are a useful and harmless assistant"},
        {"role": "user", "content": "Tell me tomorrow's weather"},
        {"role": "assistant", "content": "Tomorrow's weather will be sunny"}
      ],
      "rejected_response": "I don't know"
    }    
    

title

Dataset Configuration

  • Lazy Tokenize: Delay tokenization until needed. Speeds up dataset loading for large files.
  • Streaming: Enable only for public HF Datasets to load records on-the-fly, reducing local storage needs.
  • Prompt Max Length: Maximum token length for prompt. Longer sequences will be truncated.
    Recommended: 2048
    Note: Must be a multiple of 1024
  • System Prompt: (Optional) A global prefix to every example, e.g., You are a helpful assistant.
  • Prompt Template: (Optional) If your data needs wrapping in a custom template, e.g., <system> {system_prompt} <user> {prompt}.
  • Train/Validation Split: Percentage or absolute count for splitting your .jsonl into training and validation sets.
    • Split Type
      Currently, only random split is supported. The dataset will be randomly divided into training and validation sets.
    • Train Split Ratio
      Enter the ratio of data to be used for training (e.g., 0.9 for 90%).
    • Validation Split Ratio
      Enter the ratio of data to be used for validation (e.g., 0.1 for 10%).
      Train Split Ratio should be greater than 0.8

Training Configuration

  1. Core Options
ParameterDescriptionExample
Train TypeSelect the tuning algorithmSFT
Adapter TypeChoose adapter methodLoRA, Full
Torch DTypePrecision setting for trainingbfloat1
Adapter Type
  • Full – Use this option for full-model fine-tuning, where all model parameters are updated.
  • LoRA – Use this for parameter-efficient fine-tuning using Low-Rank Adapters (LoRA), which updates a small subset of weights for faster training and lower resource usage.
Note: LoRA is generally recommended for efficiency and ease of deployment.
title
  1. Optimization Hyperparameters
    ParameterDescriptionDefault
    Values
    Recommended ValuesPermissible Range
    Num EpochsNumber of full passes through the dataset12-550
    Gradient Accumulation StepsSteps to accumulate gradients before an optimizer step11-2<256
    Train Batch SizeSamples per device for training8816
    Eval Batch SizeSamples per device for evaluation18 16
    Max StepsTotal training steps; overrides epochs if specified

    (Applicable only when the num_epochs = 1)
    1001001000
    Learning RateInitial learning rate for optimizer0.00011×10⁻⁵ to 2×10⁻⁵< 5×10⁻⁵
    Dataloader Num WorkersParallel data-loading threads per device14<10
    Gradient CheckpointingSaves memory by checkpointing activationsDisabledDisabledNA
  1. Checkpointing & Monitoring
    ParameterDescriptionDefaultRecommended ValuesPermissible Range
    Save StepsInterval (in steps) between saving model checkpoints.100100<= Max Steps
    Save Total LimitMax number of checkpoints to keep locally.22-5<10
    Eval StepsInterval (in steps) between running evaluation loop.100100 100 - 200
    Logging StepsInterval (in steps) between logging metrics to the dashboard.55< 20
title

LoRA Adapter Configuration

ParameterDescriptionDefaultRecommended ValuePermissible Range
Rank (r)Dimensionality of the low-rank decomposition.161664
AlphaScaling factor for the adapter output.163264
DropoutDropout probability for adapter layers.0.10.11
TargetsWhich modules to apply adapters to (e.g., all-linear).all-linearall-linearNA
These settings control the LoRA injection into your base model. Higher rank increases capacity but uses more memory. title

Distributed Training Configuration

ParameterDescriptionDefaultRecommended ValueAvailable Options
TypeChoose your distributed backendDeepSpeedDeepSpeedDeepSpeed, DDP
StrategyOnly available for deepseedzero3_offloadzero3_offloadzero1,
zero2,
zero2_offload,
zero3,
zero3_offload
Set Type to DeepSpeed to enable ZeRO optimizations, or DDP for native PyTorch distributed training.
When using DeepSpeed, select the zero3_offload strategy to maximize memory savings by offloading optimizer states to CPU/GPU.
title

Infrastructure Configuration

  • GPU Type: Select instance GPU, e.g., H100, L40s.
  • GPU Count: Number of GPUs to allocate for this job.
Adjust based on model size and dataset scale. More GPUs reduce wall-clock time but increase cost. title

Launching Your Job

  1. Review all settings.
  2. Click Create Job.
  3. Monitor progress under My Trainings > Your Training Job > Metrics .
  4. Compile the model and deploy when training completes.