Deployment Naming
Deployment Name Requirements
- Uniqueness: Each deployment name must be unique within the cluster
- Naming Convention: Use descriptive names like
llama-8b-chat-v1
or gemma-7b-prod
- Character Limits: Use alphanumeric characters and hyphens only. The deployment name should not start with a number.
If you use a duplicate name, you’ll receive an error message and cannot proceed with deployment.
Best Practices
✅ Good Examples:
- llama-3b-dev-v2
- gemma-7b-production
- mistral-8b-api-staging
❌ Avoid:
- model1, model2 (not descriptive)
- test (too generic)
- prod.model (special characters)
Cluster Selection
- Cluster: Always select
Simplismart Cloud
Accelerator Type Selection
Accelerator TypeDecision Matrix
Model Type | Model Size | **Recommended Accelerator Type ** | Auto-Selected Instance |
---|
Small Models | Llama 3B, Gemma 2B | T4 | Based on model TP value |
Medium Models | Llama 8B, Gemma 7B | L40s | Single-GPU or Based on model TP value |
Large Models | Llama 70B, Qwen 32B | H100 | Multi-GPU configuration |
Automatic GPU Instance Selection
The system automatically determines GPU instance count based on your model’s Tensor Parallel (TP) configuration:
Model TP = 1 → 1x GPU instance
Model TP = 2 → 2x GPU instance
Model TP = 4 → 4x GPU instance
Model TP = 8 → 8x GPU instance
Example: If you select a model with TP=4 and choose H100 node group → System deploys 4xH100
automatically.
Scaling Configuration
Scaling Parameters
Pod Scaling Settings
- Min Pods: Minimum number of replicas.
- Max Pods: Maximum replicas.
Metric | Use Case | Recommended Threshold | Notes |
---|
GPU Utilization % | GPU-intensive inference | 80% | Best for ML model scaling |
Memory Usage | Memory-bound applications | 80% | Prevents OOM errors |
GPU Memory Usage | Large model deployments | 85% | Critical for model performance |
Latency | Response time sensitive apps | 500ms | User experience focused |
Throughput | High-volume applications | 100 req/sec | Capacity-based scaling |
Scaling Strategy Examples
# Aggressive Scaling (Variable Traffic)
Min Pods: 1
Max Pods: 20
Metric: GPU Utilisation
Threshold: 60%
# Latency-Sensitive Scaling
Min Pods: 3
Max Pods: 15
Metric: Latency
Threshold: 300ms
Rapid Autoscaling (Simplismart Cloud)
Feature Overview
- Activation: Toggle switch at bottom of deployment screen.
- Benefit: Pods spin up in seconds to minutes instead of standard deployment times
How It Works
- Pre-cached Images: Model images are cached in pre-puller system.
- Instant Scaling: No image download time during scale-up events.
- Resource Optimization: Faster response to traffic spikes.
When to Enable
✅ Enable for:
- Production workloads with variable traffic
- Applications requiring rapid scaling response
❌ Skip for:
- Stable workloads with predictable traffic
- Development/testing environments
Deployment Execution
Deployment Process
-
Click Deploy Button: Initiates deployment process
-
Monitor Progress: Watch deployment status in real-time
-
Health Check Monitoring: Check the health check status bar on the right side
-
Wait for “Healthy” Status: Indicates model is fully loaded and ready
Health Check States
Status | Meaning | Action Required |
---|
🟡 Pending | Deployment in progress | Wait for completion |
🟢 Healthy | Ready for inference | Proceed to testing |
🔴 Unhealthy | Deployment failed | Check logs |
API Integration & Testing
Getting API Credentials
- Navigate to API tab in the deployment page
- Copy the provided Python script or cURL command
- Replace placeholder values with your actual parameters
Sample Integration Code
Python Example
import requests
# Copied from API tab
endpoint = "https://your-deployment-endpoint.com/v1/chat/completions"
headers = {
"Authorization": "Bearer <your-api-key>",
"Content-Type": "application/json"
}
payload = {
"model": "your-deployed-model",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}
response = requests.post(endpoint, headers=headers, json=payload)
print(response.json())
cURL Example
curl -X POST "https://your-deployment-endpoint.com/v1/chat/completions" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "your-deployed-model",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
Monitoring & Metrics
Monitoring Dashboard
- Real-time Data: Live metrics and performance indicators
- Historical Data: Trends and usage patterns over time
Key Metrics Available
Infrastructure Metrics
- Active Pods: Current number of running instances
- Pod Health: Health status of each replica
- Resource Usage: CPU, GPU, memory consumption per pod
Request Metrics
- Request Count: Total API calls received
- Success Rate: Percentage of 2XX responses
- Error Distribution: Breakdown of 4XX/5XX errors by type
- Response Times: Latency percentiles (P50, P95, P99)
Monitoring Best Practices
📊 Daily Monitoring:
- Check overall health status
- Review error rates and types
- Monitor resource utilization trends
📈 Weekly Analysis:
- Analyze traffic patterns
- Review scaling effectiveness
- Plan capacity adjustments
Common Deployment Troubleshooting
Deployment Stuck in Pending
Possible Causes:
- Insufficient resources in selected node group
- Image pull failures
Solutions:
- Check node group capacity
- Verify model image availability
Health Check Failing
Possible Causes:
- Model loading timeout
- Insufficient memory allocation
- Network connectivity issues
Solutions:
- Increase resource allocation
- Check deployment logs for specific errors
- Verify endpoint configuration
Possible Causes:
- Suboptimal scaling configuration
- Wrong accelerator type chosen for model size
- Network latency issues
Solutions:
- Adjust scaling metrics and thresholds
- Switch to higher-spec accelerator
- Enable rapid autoscaling.
Deployment Success Checklist
- ✅ Unique deployment name configured
- ✅ Sakthi Cloud cluster selected
- ✅ Appropriate node group chosen for model size
- ✅ Scaling parameters configured based on expected traffic
- ✅ Node affinity strategy selected
- ✅ Rapid autoscaling enabled (if in Simplismart Cloud)
- ✅ Health check shows “Healthy” status
- ✅ API integration tested successfully
- ✅ Monitoring dashboard configured and reviewed
FAQs
- Can I deploy the same model multiple times?
Yes, you can deploy the same model more than once. However, please note that each deployment may spin up a new machine, which could lead to increased costs.
- Can I change the slug after deployment?
No, currently we do not support renaming the deployment slug or name once it has been created.
- Why don’t I see any clusters?
When a new organization is created on the Simplismart Platform, clusters are not visible by default. Initially, you can deploy using Simplismart Cloud (our managed cluster). To deploy on your own infrastructure (such as your VPC), you’ll need to link your cloud account and create a custom cluster. Refer to the documentation for detailed steps.
- Can a deployment span multiple clusters?
No, we currently do not support deployments across multiple clusters. Each deployment is limited to a single cluster. If you need to deploy the same model in another cluster, you will need to create a separate deployment there.
-
Why is my deployment stuck in “Pending” state?
For large models like LLaMA 70B, the container image size can be substantial. This may cause the deployment to remain in the Pending state temporarily as the model downloads and initializes.
If the deployment stays in Pending for more than 30 minutes, please reach out to support@simplismart.ai for assistance.
- How do I monitor logs of my model container?
Each deployment has the logs tabs on it’s deployment page, you can see the deployment logs over there.
-
Can I pause a deployment to save costs?
Yes, you can pause any active deployment. To do so, go to the Deployments section, select the deployment you want to pause, and click the Pause button at the top right. While paused, you won’t incur charges for that deployment.
- What monitoring/metrics are available for deployed models?
The platform provides real-time metrics including:
- Data throughput
- Resource usage
- Pod health
- Number of active pods
- Response times
- Success rates
- Request counts
These metrics can be used for monitoring model performance and system health.
-
How do I troubleshoot failed deployments?
You can check the logs on the status bar in the deployment page of failed deployment.
Please reach out to support@simplismart.ai for further assistance.
- Does region selection affect latency or costs?
Yes, selecting a region can impact both latency and cost. Latency is primarily influenced by the geographic location of your end users. For example, if a model is deployed in the India region and requests come from the US, network latency may increase, which can slightly affect performance and costs.
- Why am I getting an OOM error? What changes do I need to make next?
An Out of Memory (OOM) error usually indicates that the selected GPU doesn’t have sufficient VRAM to load your model. In such cases, it is recommended to switch to a GPU with a higher memory configuration.
-
How do I know how much memory my model needs?
The GPU memory requirements for running a model primarily depend on two factors: Model Size and Quantization.
- For FP16 precision, the required GPU memory is approximately 2× the model size. For example, a 70B parameter model would need a minimum of 140 GB of GPU memory just to load the model, with additional memory needed for inference or serving workloads.
- For FP8 precision, the memory requirement is roughly half the model size. So, a 70B model would require about 35 GB of GPU memory.
These are general guidelines and actual requirements may vary based on implementation and additional runtime overhead.