Skip to main content

Required Infrastructure Configuration

ParameterValueNotes
ClusterYou can either use the Simplismart Infrastructure or select your cluster where you want the model to be compiled. If you haven’t created a cluster yet, you can proceed with the Simplismart Infrastructure option.Always use the Compilation cluster for optimization tasks
RegionIf you have selected Simplismart Infrastructure as your cluster option, the region should be set to Global. If you are using your own cluster, you will need to select a region where you have sufficient quota available for the machine that will be used to compile the model.Ensures access to globally available compute resources
Instance TypeModel-dependentSelect a type based on model size (see below) ✅ Always choose the (any) variant for better availability (If you have selected Simplismart Infrastructure as your cluster option)
Machine TypeDefaultAlways choose Default. (If you have selected Simplismart Infrastructure as your cluster option)

What is Tensor Parallelism (TP)?

Tensor Parallelism allows a model’s computation to be split across multiple GPUs or devices, enabling:
  • Faster inference for large models
  • Support for models too large to fit on a single device
tensor_parallel_size controls how many devices the computation is split across:
  • 1 = No tensor parallelism (single GPU)
  • 2, 4, 8, etc. = Enable TP across multiple GPUs
You must have at least as many GPUs as the value set in tensor_parallel_size.

Model Mode Types and Compatibility

Different model versions require specific modes to function correctly. When selecting models from Hugging Face, you’ll typically find two variants:
  • Base models (e.g., meta-llama/Meta-Llama-3-8B) - designed for completion tasks.
  • Instruct models (e.g., meta-llama/Meta-Llama-3-8B-Instruct) - optimized for conversational/chat interactions.
Configuration Rule: The pipeline mode must match the model type, otherwise compilation will fail. For instance, using a base completion model like meta-llama/Meta-Llama-3-8B with chat mode will cause errors - you must set the mode to completion instead. Pipeline Mode Settings: Configure the mode in your pipeline based on your intended use case:
  • Set mode to embedding when compiling _embedding _models
  • Set mode to chat when compiling _conversational/instruct _models
  • Set mode to completion when compiling base/completion models
The key is ensuring alignment between your model choice and the corresponding pipeline mode configuration.

Instance Type Selection Guide

Use the table below to guide instance and TP configuration based on your model size:
ModelRecommended Instance TypeSuggested TPNotes
Gemma 2BL40s (any)1Lightweight model; fits on a single GPU
LLaMA 3BL40s (any)1Also fits on single GPU with headroom
Gemma 7B / LLaMA 8BH100 (any)2–4Benefits from multi-GPU setup
LLaMA 70BH100 (any)2–4+Requires high TP and multi-GPU infrastructure

Common Issues & Fixes

IssueCauseFix
🚫 Job stuck in queue or not scheduled (machine not available)Instance type too specific or unavailable.✅ Use (any) variant of instance type
🚫 Out of memory / crashesModel too large for single GPU.✅ Increase tensor_parallel_size or upgrade instance type
🚫 TP value ignored or job fails to startTP set higher than available GPUs.✅ Ensure instance has ≥ GPUs than tensor_parallel_size
🐢 Slow inferenceUnderutilised hardware or no parallelism.✅ Tune tensor_parallel_size and use multi-GPU instances
🚫 Unsupported model type


Error message: The given model path is invalid
The model which you trying to compile is currently not supported through the Simplismart Platform.You can contact support@simplismart.ai we will check the feasibility and add support for the model.
Mode Types in Pipeline config Model compilation failing even if the correct model and other parameters are selected.The selected model mode in the pipeline config may be incorrect.Pipeline Mode Settings: Configure the mode in your pipeline based on your intended use case:

- Set mode to embedding when compiling embedding models.

- Set mode to chat when compiling conversational/instruct models.

- Set mode to completion when compiling base/completion models. The key is ensuring alignment between your model choice and the corresponding pipeline mode configuration.
Machine Clean-up failed


Error Message: Cleanup Failure: Exception occured while cleaning up : Error cleaning up Azure resource group
You can contact support@simplismart.ai we will check the reason for the failure.You can contact support@simplismart.ai we will check the feasibility and add support for the model.

 FAQs

  1. Can I edit the model name later?
No, renaming a model after it’s been added and compiled is not supported. The model name must be set during the initial setup.
  1. What options are available for model sources?
You can choose HuggingFace, where the base model is downloaded directly, or use AWS S3, GCS, or DockerHub by providing the appropriate path (S3 URL, GCS URL, or DockerHub registry link) along with the required credentials so we can retrieve your custom model.
  1. Do I need authentication keys for external sources?
Yes, for S3, GCS, and DockerHub, you must add your authentication keys on the Secrets page in the Simplismart platform and use those credentials during the compilation process.
  1. Why am I getting a “The given model path is invalid” error? How do I verify if my model path is valid?
While the platform supports most LLM models, this error can occur if a particular model type isn’t supported yet. If you encounter this issue, please contact us at support@simplismart.ai, and we’ll work on enabling support for your model.
  1. How do I link my AWS/GCP/Azure account?
You will have to your cloud account details in the integrations section. Refer to this doc on BYOC.
  1. Can I use multiple cloud accounts? Yes, the Simplismart platform supports adding and managing multiple cloud accounts.

  1. Does region selection affect latency or costs?
    For model compilation jobs, the selected region typically doesn’t have a major impact on latency or cost. If you’re using your own cloud account, you can run the compilation in any region where you have available resource quotas.

  1. How do I choose the right accelerator for my model?

    The appropriate accelerator depends largely on the size of your model. Larger models require higher-spec machines for optimization and deployment.

  1. Why is my selected accelerator not available in the region I picked?

    Some high-end GPUs like A100s or H100s are not available in all regions across major cloud providers such as AWS, Azure, or GCP. As a result, those accelerator options may not appear in the region’s dropdown list.

  1. What happens if I run out of quota for GPUs in my cloud account?

    If your cloud account lacks sufficient GPU quota, the optimization job may fail to start or could get stuck partway through, leading to a failed optimization process.

  1. What does _accelerator count _mean?

    Accelerator count refers to the number of GPU instances allocated for a job. For example, if you select H100 as the accelerator and set the count to 2, two H100 machines will be provisioned. This is especially important when the tensor parallelism (TP) value is greater than 1.

  1. How do I know which machine type to select?

    Even within the same instance family (e.g., AWS EC2 G5), all instances typically share the same underlying GPU for example, the NVIDIA A10G. However, they differ in the number of vCPUs and amount of RAM. It’s generally recommended to choose a medium or large instance size to ensure sufficient CPU and memory resources for your ML workload.

  1. What is quantization and why should I use it?
Quantization is the process of reducing the precision of the numbers used to represent a language model’s parameters (e.g., from 32-bit floating point to 8-bit integers) to make the model smaller and faster, with minimal loss in accuracy, helpful for running large language models (LLMs) efficiently
  1. Which quantization levels are supported?

    We support FP16, FP8, and AWQ quantization.

    Note that FP8 is not supported on Ampere architecture GPUs like A100 and A10G, as these devices do not natively support FP8 precision.

  1. Does quantization affect accuracy?

    Yes, quantization can result in a slight reduction in model accuracy.

I