Skip to main content
Train encoder models on Simplismart for sequence classification tasks, enabling efficient feature extraction and accurate prediction from textual data. Encoder models are ideal for tasks like sentiment analysis, spam detection, intent classification, and other text categorization problems.

Prerequisites

Before starting, ensure you have:

Supported Model Architectures

Simplismart supports training for the following encoder-only transformer architectures:
  • BERT (Bidirectional Encoder Representations from Transformers)
  • RoBERTa (Robustly Optimized BERT Pretraining Approach)
  • DeBERTa (Decoding-enhanced BERT with Disentangled Attention)
  • DistilBERT (Distilled version of BERT)

Dataset Preparation

Your dataset must be in JSONL format, where each line contains the input text and its corresponding label for sequence classification. Example JSONL Entry:
{
  "messages": [
    {
      "role": "user",
      "content": "The weather is really nice today"
    }
  ],
  "label": 1
}
{
  "messages": [
    {
      "role": "user",
      "content": "Today is really unlucky"
    }
  ],
  "label": 0
}
Field Descriptions:
  • messages – Array containing the input text for classification. Each message has a role and content.
  • role – Indicates the message source. Use "user" for encoder model training.
  • content – The text sequence to be classified (e.g., product review, customer query, email content).
  • label – Integer representing the target class (e.g., 0 for negative sentiment, 1 for positive sentiment).

Example JSONL File

Here’s a complete example for sentiment analysis (binary classification with 2 labels):
{"messages": [{"role": "user", "content": "The weather is really nice today"}], "label": 1}
{"messages": [{"role": "user", "content": "Today is really unlucky"}], "label": 0}
{"messages": [{"role": "user", "content": "So happy"}], "label": 1}
{"messages": [{"role": "user", "content": "This product is terrible"}], "label": 0}
{"messages": [{"role": "user", "content": "Excellent service and quality"}], "label": 1}
Label Guidelines:
  • Labels must be integers starting from 0
  • For binary classification: use 0 and 1
  • For multi-class: use 0, 1, 2, etc. (e.g., 5 classes = labels 0-4)
  • Ensure all labels in your dataset are represented in the training data

Creating a Training Job

To create a new training job, navigate to My Trainings > LLM/VLM Model > Add a Training Job
1

Configure Basic Settings

Add training job interfaceProvide the following details:
  1. Experiment Name: Enter a descriptive name for your training experiment
  2. Model Details:
    • Base Model – Select the base model you want to fine-tune. Supported models (e.g., FacebookAI/roberta-base) are available in the dropdown.
    • Source Type – Automatically filled based on the selected model source (e.g., Hugging Face).
    • Model Type – Defines the architecture type for training. (here Encoder)
When the base model is selected, the rest of the parameters get updated automatically with recommended defaults for that model and training type.
2

Dataset Details

Dataset DetailsYou can either create a new dataset or select an existing one.Create New Dataset
  • Source – Choose the dataset source (e.g., AWS S3, GCP).
  • Dataset Name – Provide a friendly name for your dataset.
  • Dataset Path – Specify the full path to your dataset (e.g., s3://bucket/file.jsonl).
  • Dataset Description – Optional field for describing your dataset.
  • Secret – If AWS/GCP source, select the credential secret required to access private buckets. Learn how to configure cloud credentials.
  • Region – If AWS/GCP source, choose the region where your bucket is located.
  • Dataset Type – Specify the data format, such as JSONL.
Select Existing DatasetYou can reuse a previously uploaded dataset instead of creating a new one.
  1. In the Dataset Details section, select Use Existing Dataset.
  2. A dropdown will appear listing all datasets available under your organization.
  3. Choose the dataset you want to attach to this training job.
  4. Once selected, key information such as Dataset Name, Source, Path, and Region will auto-populate based on the saved configuration.
  5. Review the prefilled values to ensure the dataset is still valid and accessible.
  6. After selection, proceed to configure Dataset Configuration parameters.
3

Dataset Configuration

Dataset ConfigConfigure how your dataset will be processed and split for training:
  • Lazy Tokenize – Tokenizes text during training rather than upfront, reducing memory usage and initial load time.
  • System Prompt – Optional instruction prepended to each input sequence (e.g., “Classify the sentiment of the following text:”).
  • Prompt Template – Template for formatting inputs consistently (supports variables like {content}).
  • Split Type – Method for dividing data into train/validation sets. Currently supports random splitting.
  • Train Split Ratio – Proportion of data used for training (default: 0.9 or 90%).
  • Validation Split Ratio – Proportion reserved for validation to monitor overfitting (default: 0.1 or 10%).
4

Infrastructure Configuration

Infra ConfigSelect the compute resources for your training job:
  • Infrastructure Type – Choose where to run training:
    • Simplismart Cloud – Fully managed infrastructure
    • Bring Your Own Compute – Use your own cloud resources
    • Imported Cluster – Use a pre-configured standalone cluster
  • GPU Type – Select GPU hardware based on your performance needs
  • Node Count – Number of machines to use
  • GPU Count per Node – GPUs per machine
5

Set Training Parameters

Training configurationAdd HyperparametersConfigure your training parameters based on your use case. The configuration is organized into several sections:

Basic Training Configuration

ParameterDescriptionDefault Value
Training TypeTraining methodology. Auto-selected as SFT (Supervised Fine-Tuning) for encoder models.SFT
Torch DtypeNumerical precision for model weights and activations. bfloat16 or float32bfloat16
Adapter TypeParameter-efficient fine-tuning method. LoRA or Full (full finetuning)LoRA

Tuner Configuration

ParameterDescriptionDefault Value
Tuner BackendFramework for parameter-efficient fine-tuning. PEFT (Parameter-Efficient Fine-Tuning) is recommended.PEFT
Task TypeDefines the model’s objective. For encoder training, use Sequence Classification.Sequence Classification
Number of LabelsTotal number of classes in your dataset (e.g., 2 for binary classification, 5 for 5-class).Required

Hyperparameters

ParameterDescriptionDefault Value
Num EpochsNumber of complete passes through the training dataset.1
Train Batch SizeNumber of samples processed together per GPU during training.8
Eval Batch SizeBatch size during validation.8
Save StepsCheckpoint frequency. Model is saved every N training steps for recovery and evaluation.100
Save Total LimitMaximum checkpoints to keep. Older checkpoints are deleted to save storage.2
Eval StepsValidation frequency. Model performance is evaluated on validation set every N steps.100
Logging StepsHow often metrics (loss, accuracy) are recorded to tracking systems5
Learning RateInitial learning rate for optimizer0.00001
Dataloader Num WorkersParallel data-loading threads per device1

Adapter Configuration

Configure fine-tuning parameters based on your selected Adapter Type (LoRA or Full). Different parameters apply depending on your choice. Learn more about adapter configuration.
ParameterDescriptionDefault ValueApplies To
Rank (r)Adapter rank determines capacity. Higher rank = more expressive but slower. 16-64 works for most tasks.16LoRA only
AlphaScaling factor for adapter updates. Typically set equal to rank. Higher alpha = stronger influence.16LoRA only
DropoutRegularization to prevent overfitting. Randomly drops adapter weights during training.0.1LoRA & Full
TargetsWhich model layers to fine-tune. all-linear targets all linear/attention layers for maximum adaptation.all-linearLoRA & Full

Distributed Configuration

Configures multi-GPU or multi-node training for large-scale training.
ParameterDescriptionDefault Value
TypeDistributed training framework. DeepSpeed enables memory-efficient training across GPUs.DeepSpeed
StrategyMemory optimization strategy. zero3_offload splits model states across GPUs and CPU for large models.zero3_offload
6

Create and Monitor Training

  1. Review all settings carefully
  2. Click Create Job to start training
  3. Monitor training progress in the My Trainings > Your Training Job > Metrics tab.

Deployment and Inference

Once training completes successfully, you can compile and deploy your Encoder model for inference. For encoder models, once deployed, you can run inferences using the payload structure shown below. You can refer to the HuggingFace page of the respective model for more information about the <MASK> token.
import requests

url = "YOUR_MODEL_ENDPOINT"
data = {
    "text": "The capital of France is [MASK]."
}

headers = {
    "Authorization": "Bearer <api-key>"
}

response = requests.post(url, json=data, headers=headers)
print(response.json())
For RoBERTa base model replace [MASK] with <mask>