Click on the Create button to start a new deployment.
Fill in the deployment details:
Deployment Name: Enter a unique name.
Cluster: Select the target cluster.
Node Group: Select the node group based on the GPU type and compute specs required by your model (e.g., A100, H100, T4). This ensures compatibility and optimal resource allocation during deployment
Model: Select your compiled model from the list.
Processing Type: Choose how you want requests to be handled:
Sync
Async
Cluster Selection
If deploying on your own infrastructure (BYOC), select the cluster configured during the clusters setup stage.
For Private Endpoint deployments, choose Simplismart Datacenter-IN, which offers GPU options like H100 and L40S. Select based on your model’s performance requirements and deployment preferences.
Set threshold values at which the deployment should scale.
Node Affinity (Optional)
Node affinity controls how pods are distributed across nodes within a cluster. This helps manage resource distribution and avoid overloading a single node.
No Affinity: Pods are scheduled wherever resources are available. No placement preference is applied.
Preferred: The scheduler will try to place pods on different nodes to improve distribution, but it’s not enforced. If not possible, pods may still be placed on the same node.
Required: Pods are strictly placed on different nodes. If a separate node isn’t available, the pod will remain in a pending state until the condition is met.
Example: With two nodes (8 GPUs each) and two pods (2 GPUs each), using Preferred will try to place each pod on a separate node. Using Required will force them onto separate nodes — even if one node could run both.
Rapid Autoscaling (Available in Simplismart Datacenter-IN)
Enable this via the toggle at the bottom of the screen for faster scaling. Pods spin up in seconds or minutes based on the model.
Resource and scaling configurations remain similar.
An additional scaling parameter is available:
Queue Length:
Set the number of messages (requests) per pod.
This helps scale based on the request queue per pod, depending on how many concurrent requests your model can handle.
After filling in all the fields, click Add Deployment to start the deployment process.