Model Deployment Troubleshooting Guide

Deployment Naming

Deployment Name Requirements

Uniqueness: Each deployment name must be unique within the organisation.
Naming Convention: Use descriptive names like llama-8b-chat-v1 or gemma-7b-prod
Character Limits: Use alphanumeric characters and hyphens only. The deployment name should not start with a number.

If you use a duplicate name, you’ll receive an error message and cannot proceed with deployment.

Best Practices

✅ Good Examples:
- llama-3b-dev-v2
- gemma-7b-production
- mistral-8b-api-staging 

❌ Avoid:
- model1, model2 (not descriptive)
- test (too generic)
- prod.model (special characters)

Cluster Selection

Cluster: Always select Simplismart Cloud

Accelerator Type Selection

Accelerator TypeDecision Matrix

Model Type	Model Size	Recommended Accelerator Type	Auto-Selected Instance
Small Models	Llama 3B, Gemma 2B	`T4`	Based on model TP value
Medium Models	Llama 8B, Gemma 7B	`L40s`	Single-GPU or Based on model TP value
Large Models	Llama 70B, Qwen 32B	`H100`	Multi-GPU configuration

Automatic GPU Instance Selection

The system automatically determines GPU instance count based on your model’s Tensor Parallel (TP) configuration:

Model TP = 1  →  1x GPU instance
Model TP = 2  →  2x GPU instance  
Model TP = 4  →  4x GPU instance
Model TP = 8  →  8x GPU instance

Example: If you select a model with TP=4 and choose H100 node group → System deploys 4xH100 automatically.

Scaling Configuration

Scaling Parameters Pod Scaling Settings

Min Pods: Minimum number of replicas.
Max Pods: Maximum replicas.

Metric	Use Case	Recommended Threshold	Notes
GPU Utilization %	GPU-intensive inference	80%	Best for ML model scaling
Memory Usage	Memory-bound applications	80%	Prevents OOM errors
GPU Memory Usage	Large model deployments	85%	Critical for model performance
Latency	Response time sensitive apps	500ms	User experience focused
Throughput	High-volume applications	100 req/sec	Capacity-based scaling

Scaling Strategy Examples

# Aggressive Scaling (Variable Traffic)
Min Pods: 1  
Max Pods: 20
Metric: GPU Utilisation
Threshold: 60%   

# Latency-Sensitive Scaling
Min Pods: 3
Max Pods: 15
Metric: Latency
Threshold: 300ms

Rapid Autoscaling (Simplismart Cloud)

Feature Overview

Activation: Toggle switch at bottom of deployment screen.
Benefit: Pods spin up in seconds to minutes instead of standard deployment times

How It Works

Pre-cached Images: Model images are cached in pre-puller system.
Instant Scaling: No image download time during scale-up events.
Resource Optimization: Faster response to traffic spikes.

When to Enable ✅ Enable for:

Production workloads with variable traffic
Applications requiring rapid scaling response

❌ Skip for:

Stable workloads with predictable traffic
Development/testing environments

Deployment Execution

Deployment Process

Click Deploy Button: Initiates deployment process
Monitor Progress: Watch deployment status in real-time
Health Check Monitoring: Check the health check status bar on the right side
Wait for “Healthy” Status: Indicates model is fully loaded and ready
Health Check States
Status Meaning Action Required
🟡 Pending Deployment in progress Wait for completion
🟢 Healthy Ready for inference Proceed to testing
🔴 Unhealthy Deployment failed Check logs

Status	Meaning	Action Required
🟡 Pending	Deployment in progress	Wait for completion
🟢 Healthy	Ready for inference	Proceed to testing
🔴 Unhealthy	Deployment failed	Check logs

API Integration & Testing

Getting API Credentials

Navigate to API tab in the deployment page
Copy the provided Python script or cURL command
Replace placeholder values with your actual parameters

Sample Integration Code

Python Example

import requests
# Copied from API tab
endpoint = "https://your-deployment-endpoint.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer <your-api-key>",
    "Content-Type": "application/json"
}
payload = {
    "model": "your-deployed-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
}
response = requests.post(endpoint, headers=headers, json=payload)
print(response.json())

cURL Example

curl -X POST "https://your-deployment-endpoint.com/v1/chat/completions" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-deployed-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Monitoring & Metrics

Monitoring Dashboard

Real-time Data: Live metrics and performance indicators
Historical Data: Trends and usage patterns over time

Key Metrics Available

Infrastructure Metrics

Active Pods: Current number of running instances
Pod Health: Health status of each replica
Resource Usage: CPU, GPU, memory consumption per pod

Request Metrics

Request Count: Total API calls received
Success Rate: Percentage of 2XX responses
Error Distribution: Breakdown of 4XX/5XX errors by type
Response Times: Latency percentiles (P50, P95, P99)

Monitoring Best Practices

📊 Daily Monitoring:
- Check overall health status
- Review error rates and types
- Monitor resource utilization trends

📈 Weekly Analysis:
- Analyze traffic patterns
- Review scaling effectiveness
- Plan capacity adjustments

Common Deployment Troubleshooting

Deployment Stuck in Pending

Possible Causes:

Insufficient resources in selected node group
Image pull failures

Solutions:

Check node group capacity
Verify model image availability

Health Check Failing

Possible Causes:

Model loading timeout
Insufficient memory allocation
Network connectivity issues

Solutions:

Increase resource allocation
Check deployment logs for specific errors
Verify endpoint configuration

Poor Performance After Deployment

Possible Causes:

Suboptimal scaling configuration
Wrong accelerator type chosen for model size
Network latency issues

Solutions:

Adjust scaling metrics and thresholds
Switch to higher-spec accelerator
Enable rapid autoscaling.

Deployment Success Checklist

✅ Unique deployment name configured
✅ Sakthi Cloud cluster selected
✅ Appropriate node group chosen for model size
✅ Scaling parameters configured based on expected traffic
✅ Node affinity strategy selected
✅ Rapid autoscaling enabled (if in Simplismart Cloud)
✅ Health check shows “Healthy” status
✅ API integration tested successfully
✅ Monitoring dashboard configured and reviewed

FAQs

Can I deploy the same model multiple times?

Yes, you can deploy the same model more than once. However, please note that each deployment may spin up a new machine, which could lead to increased costs.

Can I change the slug after deployment?

No, currently we do not support renaming the deployment slug or name once it has been created.

Why don’t I see any clusters?

When a new organization is created on the Simplismart Platform, clusters are not visible by default. Initially, you can deploy using Simplismart Cloud (our managed cluster). To deploy on your own infrastructure (such as your VPC), you’ll need to link your cloud account and create a custom cluster. Refer to the documentation for detailed steps.

Can a deployment span multiple clusters?

No, we currently do not support deployments across multiple clusters. Each deployment is limited to a single cluster. If you need to deploy the same model in another cluster, you will need to create a separate deployment there.

Why is my deployment stuck in “Pending” state?
For large models like LLaMA 70B, the container image size can be substantial. This may cause the deployment to remain in the Pending state temporarily as the model downloads and initializes.

If the deployment stays in Pending for more than 30 minutes, please reach out to [email protected] for assistance.

How do I monitor logs of my model container?

Each deployment has the logs tabs on it’s deployment page, you can see the deployment logs over there.

Can I pause a deployment to save costs? Yes, you can pause any active deployment. To do so, go to the Deployments section, select the deployment you want to pause, and click the Pause button at the top right. While paused, you won’t incur charges for that deployment.

What monitoring/metrics are available for deployed models?

The platform provides real-time metrics including:
- Throughput
- Resource usage
- Pod health
- Number of active pods
- Response times
- Success rates
- Request counts
  These metrics can be used for monitoring model performance and system health.

How do I troubleshoot failed deployments?

You can check the logs on the status bar in the deployment page of failed deployment.
Please reach out to [email protected] for further assistance.

Does region selection affect latency or costs?

Yes, selecting a region can impact both latency and cost. Latency is primarily influenced by the geographic location of your end users. For example, if a model is deployed in the India region and requests come from the US, network latency may increase, which can slightly affect performance and costs.

Why am I getting an OOM error? What changes do I need to make next?

An Out of Memory (OOM) error usually indicates that the selected GPU doesn’t have sufficient VRAM to load your model. In such cases, it is recommended to switch to a GPU with a higher memory configuration.

How do I know how much memory my model needs?

The GPU memory requirements for running a model primarily depend on two factors: Model Size and Quantization.
- For FP16 precision, the required GPU memory is approximately 2× the model size. For example, a 70B parameter model would need a minimum of 140 GB of GPU memory just to load the model, with additional memory needed for inference or serving workloads.
- For FP8 precision, the memory requirement is roughly half the model size. So, a 70B model would require about 35 GB of GPU memory.
These are general guidelines and actual requirements may vary based on implementation and additional runtime overhead.

Whisper Troubleshooting

Model Optimization

Model Deployment

Model Deployment Troubleshooting Guide

Deployment Naming

Deployment Name Requirements

Best Practices

Cluster Selection

Accelerator Type Selection

Accelerator TypeDecision Matrix

Automatic GPU Instance Selection

Scaling Configuration

Scaling Strategy Examples

Rapid Autoscaling (Simplismart Cloud)

Deployment Execution

API Integration & Testing

Getting API Credentials

Sample Integration Code

Python Example

cURL Example

Monitoring & Metrics

Monitoring Dashboard

Key Metrics Available

Infrastructure Metrics

Request Metrics

Monitoring Best Practices

Common Deployment Troubleshooting

Deployment Stuck in Pending

Health Check Failing

Poor Performance After Deployment

Deployment Success Checklist

FAQs

Whisper Troubleshooting

Model Optimization

Model Deployment

​Deployment Naming

​Deployment Name Requirements

​Best Practices

​Cluster Selection

​Accelerator Type Selection

​Accelerator TypeDecision Matrix

​Automatic GPU Instance Selection

​Scaling Configuration

​Scaling Strategy Examples

​Rapid Autoscaling (Simplismart Cloud)

​Deployment Execution

​API Integration & Testing

​Getting API Credentials

​Sample Integration Code

​Python Example

​cURL Example

​Monitoring & Metrics

​Monitoring Dashboard

​Key Metrics Available

​Infrastructure Metrics

​Request Metrics

​Monitoring Best Practices

​Common Deployment Troubleshooting

​Deployment Stuck in Pending

​Health Check Failing

​Poor Performance After Deployment

​Deployment Success Checklist

​FAQs

​

Deployment Naming

Deployment Name Requirements

Best Practices

Cluster Selection

Accelerator Type Selection

Accelerator TypeDecision Matrix

Automatic GPU Instance Selection

Scaling Configuration

Scaling Strategy Examples

Rapid Autoscaling (Simplismart Cloud)

Deployment Execution

API Integration & Testing

Getting API Credentials

Sample Integration Code

Python Example

cURL Example

Monitoring & Metrics

Monitoring Dashboard

Key Metrics Available

Infrastructure Metrics

Request Metrics

Monitoring Best Practices

Common Deployment Troubleshooting

Deployment Stuck in Pending

Health Check Failing

Poor Performance After Deployment

Deployment Success Checklist

FAQs