Model Deployment Guide - Documentation

On this page

1. Model Specifications
2. GPU Memory Requirements
3. GPU vs. CPU RAM Clarification
4. Resource Allocation Best Practices
5. Deployment Considerations

Before deploying any machine learning model, it is critical to perform a series of infrastructure checks to ensure optimal performance and cost-efficiency. Below are the key considerations to evaluate:

1. Model Specifications

Model Size: Determine the model’s parameter size (e.g., 8B, 13B, etc.).
Precision Format: Know the floating-point format (e.g., FP16, INT8), as this impacts memory requirements.
Tensor Parallelism:

2. GPU Memory Requirements

For large models (e.g., an 8B model using FP16), ensure a minimum of 16 GB GPU memory to avoid Out-of-Memory (OOM) errors.
In such cases, opt for higher-spec GPUs:
- NVIDIA L4: 24 GB VRAM
- NVIDIA L40s: 48 GB VRAM

3. GPU vs. CPU RAM Clarification

It’s important to distinguish between CPU RAM (displayed as system memory) and GPU VRAM.
For example, instances like g4dn.xlarge and g4dn.2xlarge offer:
- g4dn.xlarge: 4 vCPUs, 16 GB CPU RAM
- g4dn.2xlarge: 8 vCPUs, 32 GB CPU RAM
Note: Across a given instance family, the GPU VRAM typically remains constant, even though CPU resources scale up.

4. Resource Allocation Best Practices

To ensure system stability and allow room for background processes:
- Allocate only 80% of the available CPU and RAM to the model or service.
- Example: On a g4dn.2xlarge (8 vCPUs, 32 GB RAM), limit allocation to:
  - 6–7 vCPUs
  - ~26 GB RAM

5. Deployment Considerations

Identify the deployment region and preferred instance family.
Define scaling ranges and metrics (e.g., CPU/GPU utilization, request latency) to enable autoscaling effectively.

E2E Cluster Import Prerequisites