Skip to main content
Train CLIP (Contrastive Language-Image Pre-training) models on Simplismart to align visual and textual representations for your specific use cases.

Prerequisites

Before starting, ensure you have:

Supported Model Architectures

Simplismart currently supports the following CLIP model configurations:
ComponentModelHuggingFace Link
Vision Encoderopenai/clip-vit-base-patch32clip-vit-base-patch32
Language EncoderFacebookAI/roberta-baseroberta-base
CLIP training on Simplismart uses full fine-tuning and runs on a single GPU. Distributed training and LoRA are not currently supported.

Dataset Preparation

Your dataset must be in JSONL format where each line contains an image path and corresponding captions. Example JSONL Entry:
{
  "image": "images/airport_1.jpg",
  "captions": [
    "Many aircraft are parked next to a long building in an airport.",
    "many planes are parked next to a long building in an airport."
  ]
}
Field Descriptions
  • image: Relative path to the image file within your ZIP archive
  • captions: Array of text descriptions for the image (2-5 captions recommended)

Example JSONL File

{"image": "images/airport_1.jpg", "captions": ["Many aircraft are parked next to a long building in an airport.", "Planes parked at the airport terminal."]}
{"image": "images/beach_scene.jpg", "captions": ["A sandy beach with blue ocean waves.", "People enjoying a sunny day at the seaside."]}
{"image": "images/city_street.jpg", "captions": ["Busy urban street with cars and pedestrians.", "Downtown city traffic during rush hour."]}

Dataset Format

  1. Organize your files in a directory:
    dataset/
    ├── images/
    │   ├── airport_1.jpg
    │   ├── beach_scene.jpg
    │   └── city_street.jpg
    └── metadata.jsonl
    
  2. Create a ZIP archive:
    cd /path/to/dataset
    zip -r dataset.zip .
    
Ensure all image paths in your JSONL file match the relative paths within the ZIP archive.

Creating a Training Job

To create a new training job, navigate to My Trainings > LLM/VLM Model > Add a Training Job
1

Configure Basic Settings

Add training job interfaceProvide the following details:
  1. Experiment Name: Enter a descriptive name for your training experiment
  2. Model Details:
  • Base Model – Select the base model you want to fine-tune. Supported models (e.g., FacebookAI/roberta-base) are available in the dropdown.
  • Source Type – Automatically filled based on the selected model source (e.g., Hugging Face).
  • Model Type – Defines the architecture type for training. (here CLIP)
  • Vision Encoder – Select the vision encoder used for CLIP-based training (e.g. openai/clip-vit-base-patch32).
When a base model is selected, the rest of the parameters get updated automatically with recommended defaults for that model and training type.
2

Dataset Details

Dataset DetailsYou can either create a new dataset or select an existing one.Create New Dataset
  • Source – Choose the dataset source (e.g., AWS S3, GCP).
  • Dataset Name – Provide a friendly name for your dataset.
  • Dataset Path – Specify the full path to your dataset (e.g., s3://bucket/file.zip).
  • Dataset Description – Optional field for describing your dataset.
  • Secret – If AWS/GCP source, select the AWS/GCP credential secret required to access private buckets.
  • Region – If AWS/GCP source, choose the AWS/GCP region where your bucket is located.
  • Dataset Type – Specify the data format, such as JSONL.
Select Existing DatasetYou can reuse a previously uploaded dataset instead of creating a new one.
  1. In the Dataset Details section, select Use Existing Dataset.
  2. A dropdown will appear listing all datasets available under your organization.
  3. Choose the dataset you want to attach to this training job.
  4. Once selected, key information such as Dataset Name, Source, Path, and Region will auto-populate based on the saved configuration.
  5. Review the prefilled values to ensure the dataset is still valid and accessible.
  6. After selection, proceed to configure Dataset Configuration parameters.
3

Dataset Configuration

Dataset Config
  • Lazy Tokenize – If enabled, tokenization happens during training to reduce load time.
  • Prompt Max Length – Sets the maximum token length per sample (default: 128 for CLIP).
  • System Prompt – Optional static prompt prepended to every sample.
  • Prompt Template – Optional templating format for structured prompt creation.
  • Split Type – Defines how the dataset is split. Currently, only “random” is supported.
  • Train Split Ratio – Specifies how much data to use for training (default: 0.9).
  • Validation Split Ratio – Remaining portion used for validation (default: 0.1).
4

Infrastructure Configuration

Infra Config
  • Infrastructure Type – Choose where the training job runs: Simplismart-managed infrastructure, your own compute, or a standalone cluster.
  • GPU Type – Select the GPU type (e.g., H100, A100, L40).
  • Node Count – Specify the number of nodes to allocate. (default 1 for CLIP training)
  • GPU Count per Node – Define the number of GPUs per node. (default 1 for CLIP training)
5

Set Training Parameters

Training configurationConfigure your training parameters based on your use case:
ParameterDescriptionDefault Value / Example
Training TypeAuto-selected based on the chosen model (e.g., CLIP).CLIP
Torch DtypePrecision type used during training.bfloat16
Num EpochsNumber of epochs to train for.
Train Batch SizeBatch size per device during training.8
Eval Batch SizeBatch size per device during evaluation.1
Save StepsDefines how often model checkpoints are saved.100
Save Total LimitSets how many checkpoints to retain.2
Eval StepsDetermines how frequently evaluations are run.100
Logging StepsFrequency at which logs are recorded.5
Learning RateSets the learning rate for the optimizer.0.00001
Dataloader Num WorkersNumber of parallel workers for loading data.1
Distributed Configuration TypeDefines training mode.Single (for the CLIP training)
6

Create and Monitor Training

  1. Review all settings carefully
  2. Click Create Job to start training
  3. Monitor training progress in the My Trainings > Your Training Job > Metrics tab.
Metrics are updated in real-time as your training progresses. Use these metrics to evaluate whether your model is learning effectively.

Next Steps

Once training completes successfully, you can compile and deploy your CLIP model for inference.