meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.2-1B-Instruct
meta-llama/Llama-3.2-3B-Instruct
meta-llama/Llama-3.2-11B-Vision-Instruct
Qwen/Qwen2.5-3B-Instruct
Qwen/Qwen2.5-14B-Instruct
Qwen/Qwen2.5-VL-7B-Instruct
tiiuae/falcon-7b-instruct
e.g., s3://your-bucket/your-file.jsonl
jsonl
(JSON Lines)zip
The directory should be archived in a .zip
file and stored in an object storage. cd path/to/dataset_dir && zip -r dataset_dir.zip ./*
.jsonl
file should represent a complete training example. The supported format styles are:
Recommended: 2048
Note: Must be a multiple of 1024
You are a helpful assistant.
<system> {system_prompt} <user> {prompt}
.
.jsonl
into training and validation sets.
0.9
for 90%).
0.1
for 10%).
Parameter | Description | Example |
---|---|---|
Train Type | Select the tuning algorithm | SFT |
Adapter Type | Choose adapter method | LoRA , Full |
Torch DType | Precision setting for training | bfloat1 |
Note: LoRA is generally recommended for efficiency and ease of deployment.
Parameter | Description | Default Values | Recommended Values | Permissible Range |
---|---|---|---|---|
Num Epochs | Number of full passes through the dataset | 1 | 2-5 | 50 |
Gradient Accumulation Steps | Steps to accumulate gradients before an optimizer step | 1 | 1-2 | <256 |
Train Batch Size | Samples per device for training | 8 | 8 | 16 |
Eval Batch Size | Samples per device for evaluation | 1 | 8 | 16 |
Max Steps | Total training steps; overrides epochs if specified (Applicable only when the num_epochs = 1) | 100 | 100 | 1000 |
Learning Rate | Initial learning rate for optimizer | 0.0001 | 1×10⁻⁵ to 2×10⁻⁵ | < 5×10⁻⁵ |
Dataloader Num Workers | Parallel data-loading threads per device | 1 | 4 | <10 |
Gradient Checkpointing | Saves memory by checkpointing activations | Disabled | Disabled | NA |
Parameter | Description | Default | Recommended Values | Permissible Range |
---|---|---|---|---|
Save Steps | Interval (in steps) between saving model checkpoints. | 100 | 100 | <= Max Steps |
Save Total Limit | Max number of checkpoints to keep locally. | 2 | 2-5 | <10 |
Eval Steps | Interval (in steps) between running evaluation loop. | 100 | 100 | 100 - 200 |
Logging Steps | Interval (in steps) between logging metrics to the dashboard. | 5 | 5 | < 20 |
Parameter | Description | Default | Recommended Value | Permissible Range |
---|---|---|---|---|
Rank (r) | Dimensionality of the low-rank decomposition. | 16 | 16 | 64 |
Alpha | Scaling factor for the adapter output. | 16 | 32 | 64 |
Dropout | Dropout probability for adapter layers. | 0.1 | 0.1 | 1 |
Targets | Which modules to apply adapters to (e.g., all-linear). | all-linear | all-linear | NA |
Parameter | Description | Default | Recommended Value | Available Options |
---|---|---|---|---|
Type | Choose your distributed backend | DeepSpeed | DeepSpeed | DeepSpeed , DDP |
Strategy | Only available for deepseed | zero3_offload | zero3_offload | zero1 ,zero2 ,zero2_offload ,zero3 ,zero3_offload |
DeepSpeed
to enable ZeRO optimizations, or DDP
for native PyTorch distributed training.zero3_offload
strategy to maximize memory savings by offloading optimizer states to CPU/GPU.H100
, L40s
.