Prerequisites
Before starting, ensure you have:- A Simplismart account
- Dataset formatted according to the Encoder training requirements
Supported Model Architectures
Simplismart supports training for the following encoder-only transformer architectures:- BERT (Bidirectional Encoder Representations from Transformers)
- RoBERTa (Robustly Optimized BERT Pretraining Approach)
- DeBERTa (Decoding-enhanced BERT with Disentangled Attention)
- DistilBERT (Distilled version of BERT)
Dataset Preparation
Your dataset must be in JSONL format, where each line contains the input text and its corresponding label for sequence classification. Example JSONL Entry:messages– Array containing the input text for classification. Each message has aroleandcontent.role– Indicates the message source. Use"user"for encoder model training.content– The text sequence to be classified (e.g., product review, customer query, email content).label– Integer representing the target class (e.g.,0for negative sentiment,1for positive sentiment).
Example JSONL File
Here’s a complete example for sentiment analysis (binary classification with 2 labels):Label Guidelines:
- Labels must be integers starting from
0 - For binary classification: use
0and1 - For multi-class: use
0,1,2, etc. (e.g., 5 classes = labels 0-4) - Ensure all labels in your dataset are represented in the training data
Creating a Training Job
To create a new training job, navigate toMy Trainings > LLM/VLM Model > Add a Training Job
1
Configure Basic Settings

- Experiment Name: Enter a descriptive name for your training experiment
-
Model Details:
- Base Model – Select the base model you want to fine-tune. Supported models (e.g.,
FacebookAI/roberta-base) are available in the dropdown. - Source Type – Automatically filled based on the selected model source (e.g.,
Hugging Face). - Model Type – Defines the architecture type for training. (here
Encoder)
- Base Model – Select the base model you want to fine-tune. Supported models (e.g.,
When the base model is selected, the rest of the parameters get updated automatically with recommended defaults for that model and training type.
2
Dataset Details

- Source – Choose the dataset source (e.g., AWS S3, GCP).
- Dataset Name – Provide a friendly name for your dataset.
- Dataset Path – Specify the full path to your dataset (e.g.,
s3://bucket/file.jsonl). - Dataset Description – Optional field for describing your dataset.
- Secret – If AWS/GCP source, select the credential secret required to access private buckets. Learn how to configure cloud credentials.
- Region – If AWS/GCP source, choose the region where your bucket is located.
- Dataset Type – Specify the data format, such as JSONL.
- In the Dataset Details section, select Use Existing Dataset.
- A dropdown will appear listing all datasets available under your organization.
- Choose the dataset you want to attach to this training job.
- Once selected, key information such as Dataset Name, Source, Path, and Region will auto-populate based on the saved configuration.
- Review the prefilled values to ensure the dataset is still valid and accessible.
- After selection, proceed to configure Dataset Configuration parameters.
3
Dataset Configuration

- Lazy Tokenize – Tokenizes text during training rather than upfront, reducing memory usage and initial load time.
- System Prompt – Optional instruction prepended to each input sequence (e.g., “Classify the sentiment of the following text:”).
- Prompt Template – Template for formatting inputs consistently (supports variables like
{content}). - Split Type – Method for dividing data into train/validation sets. Currently supports
randomsplitting. - Train Split Ratio – Proportion of data used for training (default:
0.9or 90%). - Validation Split Ratio – Proportion reserved for validation to monitor overfitting (default:
0.1or 10%).
4
Infrastructure Configuration

-
Infrastructure Type – Choose where to run training:
- Simplismart Cloud – Fully managed infrastructure
- Bring Your Own Compute – Use your own cloud resources
- Imported Cluster – Use a pre-configured standalone cluster
- GPU Type – Select GPU hardware based on your performance needs
- Node Count – Number of machines to use
- GPU Count per Node – GPUs per machine
5
Set Training Parameters


Basic Training Configuration
| Parameter | Description | Default Value |
|---|---|---|
| Training Type | Training methodology. Auto-selected as SFT (Supervised Fine-Tuning) for encoder models. | SFT |
| Torch Dtype | Numerical precision for model weights and activations. bfloat16 or float32 | bfloat16 |
| Adapter Type | Parameter-efficient fine-tuning method. LoRA or Full (full finetuning) | LoRA |
Tuner Configuration
| Parameter | Description | Default Value |
|---|---|---|
| Tuner Backend | Framework for parameter-efficient fine-tuning. PEFT (Parameter-Efficient Fine-Tuning) is recommended. | PEFT |
| Task Type | Defines the model’s objective. For encoder training, use Sequence Classification. | Sequence Classification |
| Number of Labels | Total number of classes in your dataset (e.g., 2 for binary classification, 5 for 5-class). | Required |
Hyperparameters
| Parameter | Description | Default Value |
|---|---|---|
| Num Epochs | Number of complete passes through the training dataset. | 1 |
| Train Batch Size | Number of samples processed together per GPU during training. | 8 |
| Eval Batch Size | Batch size during validation. | 8 |
| Save Steps | Checkpoint frequency. Model is saved every N training steps for recovery and evaluation. | 100 |
| Save Total Limit | Maximum checkpoints to keep. Older checkpoints are deleted to save storage. | 2 |
| Eval Steps | Validation frequency. Model performance is evaluated on validation set every N steps. | 100 |
| Logging Steps | How often metrics (loss, accuracy) are recorded to tracking systems | 5 |
| Learning Rate | Initial learning rate for optimizer | 0.00001 |
| Dataloader Num Workers | Parallel data-loading threads per device | 1 |
Adapter Configuration
Configure fine-tuning parameters based on your selected Adapter Type (LoRA or Full). Different parameters apply depending on your choice. Learn more about adapter configuration.| Parameter | Description | Default Value | Applies To |
|---|---|---|---|
| Rank (r) | Adapter rank determines capacity. Higher rank = more expressive but slower. 16-64 works for most tasks. | 16 | LoRA only |
| Alpha | Scaling factor for adapter updates. Typically set equal to rank. Higher alpha = stronger influence. | 16 | LoRA only |
| Dropout | Regularization to prevent overfitting. Randomly drops adapter weights during training. | 0.1 | LoRA & Full |
| Targets | Which model layers to fine-tune. all-linear targets all linear/attention layers for maximum adaptation. | all-linear | LoRA & Full |
Distributed Configuration
Configures multi-GPU or multi-node training for large-scale training.| Parameter | Description | Default Value |
|---|---|---|
| Type | Distributed training framework. DeepSpeed enables memory-efficient training across GPUs. | DeepSpeed |
| Strategy | Memory optimization strategy. zero3_offload splits model states across GPUs and CPU for large models. | zero3_offload |
6
Create and Monitor Training
- Review all settings carefully
- Click
Create Jobto start training - Monitor training progress in the
My Trainings>Your Training Job>Metricstab.
Deployment and Inference
Once training completes successfully, you can compile and deploy your Encoder model for inference. For encoder models, once deployed, you can run inferences using the payload structure shown below. You can refer to the HuggingFace page of the respective model for more information about the<MASK> token.