Skip to main content

Deployment Naming

Deployment Name Requirements

  • Uniqueness: Each deployment name must be unique within the cluster
  • Naming Convention: Use descriptive names like llama-8b-chat-v1 or gemma-7b-prod
  • Character Limits: Use alphanumeric characters and hyphens only. The deployment name should not start with a number.
If you use a duplicate name, you’ll receive an error message and cannot proceed with deployment.

Best Practices

✅ Good Examples:
- llama-3b-dev-v2
- gemma-7b-production
- mistral-8b-api-staging 

❌ Avoid:
- model1, model2 (not descriptive)
- test (too generic)
- prod.model (special characters)

Cluster Selection

  • Cluster: Always select Simplismart Cloud

Accelerator Type Selection

Accelerator TypeDecision Matrix

Model TypeModel Size**Recommended Accelerator Type **Auto-Selected Instance
Small ModelsLlama 3B, Gemma 2BT4Based on model TP value
Medium ModelsLlama 8B, Gemma 7BL40sSingle-GPU or Based on model TP value
Large ModelsLlama 70B, Qwen 32BH100Multi-GPU configuration

Automatic GPU Instance Selection

The system automatically determines GPU instance count based on your model’s Tensor Parallel (TP) configuration:
Model TP = 1  →  1x GPU instance
Model TP = 2  →  2x GPU instance  
Model TP = 4  →  4x GPU instance
Model TP = 8  →  8x GPU instance
Example: If you select a model with TP=4 and choose H100 node group → System deploys 4xH100 automatically.

Scaling Configuration

Scaling Parameters Pod Scaling Settings
  • Min Pods: Minimum number of replicas.
  • Max Pods: Maximum replicas.
MetricUse CaseRecommended ThresholdNotes
GPU Utilization %GPU-intensive inference80%Best for ML model scaling
Memory UsageMemory-bound applications80%Prevents OOM errors
GPU Memory UsageLarge model deployments85%Critical for model performance
LatencyResponse time sensitive apps500msUser experience focused
ThroughputHigh-volume applications100 req/secCapacity-based scaling

Scaling Strategy Examples

# Aggressive Scaling (Variable Traffic)
Min Pods: 1  
Max Pods: 20
Metric: GPU Utilisation
Threshold: 60%   

# Latency-Sensitive Scaling
Min Pods: 3
Max Pods: 15
Metric: Latency
Threshold: 300ms

Rapid Autoscaling (Simplismart Cloud)

Feature Overview
  • Activation: Toggle switch at bottom of deployment screen.
  • Benefit: Pods spin up in seconds to minutes instead of standard deployment times
How It Works
  • Pre-cached Images: Model images are cached in pre-puller system.
  • Instant Scaling: No image download time during scale-up events.
  • Resource Optimization: Faster response to traffic spikes.
When to Enable Enable for:
  • Production workloads with variable traffic
  • Applications requiring rapid scaling response
Skip for:
  • Stable workloads with predictable traffic
  • Development/testing environments

Deployment Execution

Deployment Process
  1. Click Deploy Button: Initiates deployment process
  2. Monitor Progress: Watch deployment status in real-time
  3. Health Check Monitoring: Check the health check status bar on the right side
  4. Wait for “Healthy” Status: Indicates model is fully loaded and ready
    Health Check States
    StatusMeaningAction Required
    🟡 PendingDeployment in progressWait for completion
    🟢 HealthyReady for inferenceProceed to testing
    🔴 UnhealthyDeployment failedCheck logs

API Integration & Testing

Getting API Credentials

  1. Navigate to API tab in the deployment page
  2. Copy the provided Python script or cURL command
  3. Replace placeholder values with your actual parameters

Sample Integration Code

Python Example

import requests
# Copied from API tab
endpoint = "https://your-deployment-endpoint.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer <your-api-key>",
    "Content-Type": "application/json"
}
payload = {
    "model": "your-deployed-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
}
response = requests.post(endpoint, headers=headers, json=payload)
print(response.json())

cURL Example

curl -X POST "https://your-deployment-endpoint.com/v1/chat/completions" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-deployed-model",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Monitoring & Metrics

Monitoring Dashboard

  • Real-time Data: Live metrics and performance indicators
  • Historical Data: Trends and usage patterns over time

Key Metrics Available

Infrastructure Metrics

  • Active Pods: Current number of running instances
  • Pod Health: Health status of each replica
  • Resource Usage: CPU, GPU, memory consumption per pod

Request Metrics

  • Request Count: Total API calls received
  • Success Rate: Percentage of 2XX responses
  • Error Distribution: Breakdown of 4XX/5XX errors by type
  • Response Times: Latency percentiles (P50, P95, P99)

Monitoring Best Practices

📊 Daily Monitoring:
- Check overall health status
- Review error rates and types
- Monitor resource utilization trends

📈 Weekly Analysis:
- Analyze traffic patterns
- Review scaling effectiveness
- Plan capacity adjustments

Common Deployment Troubleshooting

Deployment Stuck in Pending

Possible Causes:
  • Insufficient resources in selected node group
  • Image pull failures
Solutions:
  1. Check node group capacity
  2. Verify model image availability

Health Check Failing

Possible Causes:
  • Model loading timeout
  • Insufficient memory allocation
  • Network connectivity issues
Solutions:
  1. Increase resource allocation
  2. Check deployment logs for specific errors
  3. Verify endpoint configuration

Poor Performance After Deployment

Possible Causes:
  • Suboptimal scaling configuration
  • Wrong accelerator type chosen for model size
  • Network latency issues
Solutions:
  1. Adjust scaling metrics and thresholds
  2. Switch to higher-spec accelerator
  3. Enable rapid autoscaling.

Deployment Success Checklist

  • ✅ Unique deployment name configured
  • ✅ Sakthi Cloud cluster selected
  • ✅ Appropriate node group chosen for model size
  • ✅ Scaling parameters configured based on expected traffic
  • ✅ Node affinity strategy selected
  • ✅ Rapid autoscaling enabled (if in Simplismart Cloud)
  • ✅ Health check shows “Healthy” status
  • ✅ API integration tested successfully
  • ✅ Monitoring dashboard configured and reviewed

FAQs

  1. Can I deploy the same model multiple times?
Yes, you can deploy the same model more than once. However, please note that each deployment may spin up a new machine, which could lead to increased costs.
  1. Can I change the slug after deployment?

    No, currently we do not support renaming the deployment slug or name once it has been created.

  1. Why don’t I see any clusters?

    When a new organization is created on the Simplismart Platform, clusters are not visible by default. Initially, you can deploy using Simplismart Cloud (our managed cluster). To deploy on your own infrastructure (such as your VPC), you’ll need to link your cloud account and create a custom cluster. Refer to the documentation for detailed steps.

  1. Can a deployment span multiple clusters?

    No, we currently do not support deployments across multiple clusters. Each deployment is limited to a single cluster. If you need to deploy the same model in another cluster, you will need to create a separate deployment there.

  1. Why is my deployment stuck in “Pending” state?
    For large models like LLaMA 70B, the container image size can be substantial. This may cause the deployment to remain in the Pending state temporarily as the model downloads and initializes.

    If the deployment stays in Pending for more than 30 minutes, please reach out to support@simplismart.ai for assistance.

  1. How do I monitor logs of my model container?

    Each deployment has the logs tabs on it’s deployment page, you can see the deployment logs over there.

  1. Can I pause a deployment to save costs? Yes, you can pause any active deployment. To do so, go to the Deployments section, select the deployment you want to pause, and click the Pause button at the top right. While paused, you won’t incur charges for that deployment.

  1. What monitoring/metrics are available for deployed models?

    The platform provides real-time metrics including:
    • Data throughput
    • Resource usage
    • Pod health
    • Number of active pods
    • Response times
    • Success rates
    • Request counts
      These metrics can be used for monitoring model performance and system health.

  1. How do I troubleshoot failed deployments?

    You can check the logs on the status bar in the deployment page of failed deployment.

    Please reach out to support@simplismart.ai for further assistance.

  1. Does region selection affect latency or costs?

    Yes, selecting a region can impact both latency and cost. Latency is primarily influenced by the geographic location of your end users. For example, if a model is deployed in the India region and requests come from the US, network latency may increase, which can slightly affect performance and costs.

  1. Why am I getting an OOM error? What changes do I need to make next?

    An Out of Memory (OOM) error usually indicates that the selected GPU doesn’t have sufficient VRAM to load your model. In such cases, it is recommended to switch to a GPU with a higher memory configuration.

 

  1. How do I know how much memory my model needs?

    The GPU memory requirements for running a model primarily depend on two factors: Model Size and Quantization.
    • For FP16 precision, the required GPU memory is approximately 2× the model size. For example, a 70B parameter model would need a minimum of 140 GB of GPU memory just to load the model, with additional memory needed for inference or serving workloads.
    • For FP8 precision, the memory requirement is roughly half the model size. So, a 70B model would require about 35 GB of GPU memory.
    These are general guidelines and actual requirements may vary based on implementation and additional runtime overhead.
I