Required Infrastructure Configuration
Parameter | Value | Notes |
---|---|---|
Cluster | You can either use the Simplismart Infrastructure or select your cluster where you want the model to be compiled. If you haven’t created a cluster yet, you can proceed with the Simplismart Infrastructure option. | Always use the Compilation cluster for optimization tasks |
Region | If you have selected Simplismart Infrastructure as your cluster option, the region should be set to Global. If you are using your own cluster, you will need to select a region where you have sufficient quota available for the machine that will be used to compile the model. | Ensures access to globally available compute resources |
Instance Type | Model-dependent | Select a type based on model size (see below) ✅ Always choose the (any) variant for better availability (If you have selected Simplismart Infrastructure as your cluster option) |
Machine Type | Default | Always choose Default. (If you have selected Simplismart Infrastructure as your cluster option) |
What is Tensor Parallelism (TP)?
Tensor Parallelism allows a model’s computation to be split across multiple GPUs or devices, enabling:- Faster inference for large models
- Support for models too large to fit on a single device
- 1 = No tensor parallelism (single GPU)
- 2, 4, 8, etc. = Enable TP across multiple GPUs
You must have at least as many GPUs as the value set in tensor_parallel_size.
Model Mode Types and Compatibility
Different model versions require specific modes to function correctly. When selecting models from Hugging Face, you’ll typically find two variants:- Base models (e.g., meta-llama/Meta-Llama-3-8B) - designed for completion tasks.
- Instruct models (e.g., meta-llama/Meta-Llama-3-8B-Instruct) - optimized for conversational/chat interactions.
- Set mode to embedding when compiling _embedding _models
- Set mode to chat when compiling _conversational/instruct _models
- Set mode to completion when compiling base/completion models
Instance Type Selection Guide
Use the table below to guide instance and TP configuration based on your model size:Model | Recommended Instance Type | Suggested TP | Notes |
---|---|---|---|
Gemma 2B | L40s (any) | 1 | Lightweight model; fits on a single GPU |
LLaMA 3B | L40s (any) | 1 | Also fits on single GPU with headroom |
Gemma 7B / LLaMA 8B | H100 (any) | 2–4 | Benefits from multi-GPU setup |
LLaMA 70B | H100 (any) | 2–4+ | Requires high TP and multi-GPU infrastructure |
Common Issues & Fixes
Issue | Cause | Fix |
---|---|---|
🚫 Job stuck in queue or not scheduled (machine not available) | Instance type too specific or unavailable. | ✅ Use (any) variant of instance type |
🚫 Out of memory / crashes | Model too large for single GPU. | ✅ Increase tensor_parallel_size or upgrade instance type |
🚫 TP value ignored or job fails to start | TP set higher than available GPUs. | ✅ Ensure instance has ≥ GPUs than tensor_parallel_size |
🐢 Slow inference | Underutilised hardware or no parallelism. | ✅ Tune tensor_parallel_size and use multi-GPU instances |
🚫 Unsupported model typeError message: The given model path is invalid | The model which you trying to compile is currently not supported through the Simplismart Platform. | You can contact support@simplismart.ai we will check the feasibility and add support for the model. |
Mode Types in Pipeline config Model compilation failing even if the correct model and other parameters are selected. | The selected model mode in the pipeline config may be incorrect. | Pipeline Mode Settings: Configure the mode in your pipeline based on your intended use case: - Set mode to embedding when compiling embedding models. - Set mode to chat when compiling conversational/instruct models. - Set mode to completion when compiling base/completion models. The key is ensuring alignment between your model choice and the corresponding pipeline mode configuration. |
Machine Clean-up failedError Message: Cleanup Failure: Exception occured while cleaning up : Error cleaning up Azure resource group | You can contact support@simplismart.ai we will check the reason for the failure. | You can contact support@simplismart.ai we will check the feasibility and add support for the model. |
FAQs
- Can I edit the model name later?
- What options are available for model sources?
- Do I need authentication keys for external sources?
- Why am I getting a “The given model path is invalid” error? How do I verify if my model path is valid?
- How do I link my AWS/GCP/Azure account?
- Can I use multiple cloud accounts? Yes, the Simplismart platform supports adding and managing multiple cloud accounts.
-
Does region selection affect latency or costs?
For model compilation jobs, the selected region typically doesn’t have a major impact on latency or cost. If you’re using your own cloud account, you can run the compilation in any region where you have available resource quotas.
- How do I choose the right accelerator for my model?
The appropriate accelerator depends largely on the size of your model. Larger models require higher-spec machines for optimization and deployment.
- Why is my selected accelerator not available in the region I picked?
Some high-end GPUs like A100s or H100s are not available in all regions across major cloud providers such as AWS, Azure, or GCP. As a result, those accelerator options may not appear in the region’s dropdown list.
- What happens if I run out of quota for GPUs in my cloud account?
If your cloud account lacks sufficient GPU quota, the optimization job may fail to start or could get stuck partway through, leading to a failed optimization process.
- What does _accelerator count _mean?
Accelerator count refers to the number of GPU instances allocated for a job. For example, if you select H100 as the accelerator and set the count to 2, two H100 machines will be provisioned. This is especially important when the tensor parallelism (TP) value is greater than 1.
- How do I know which machine type to select?
Even within the same instance family (e.g., AWS EC2 G5), all instances typically share the same underlying GPU for example, the NVIDIA A10G. However, they differ in the number of vCPUs and amount of RAM. It’s generally recommended to choose a medium or large instance size to ensure sufficient CPU and memory resources for your ML workload.
- What is quantization and why should I use it?
- Which quantization levels are supported?
We support FP16, FP8, and AWQ quantization.
Note that FP8 is not supported on Ampere architecture GPUs like A100 and A10G, as these devices do not natively support FP8 precision.
- Does quantization affect accuracy?
Yes, quantization can result in a slight reduction in model accuracy.