1. Model Specifications
- Model Size: Determine the model’s parameter size (e.g., 8B, 13B, etc.).
- Precision Format: Know the floating-point format (e.g., FP16, INT8), as this impacts memory requirements.
- Tensor Parallelism: Distribute model layers across multiple GPUs to handle models too large for single GPU memory.
2. GPU Memory Requirements
- For large models (e.g., an 8B model using FP16), ensure a minimum of 16 GB GPU memory to avoid Out-of-Memory (OOM) errors.
- In such cases, opt for higher-spec GPUs:
- NVIDIA L4: 24 GB VRAM
- NVIDIA L40s: 48 GB VRAM
3. GPU vs. CPU RAM Clarification
- It’s important to distinguish between CPU RAM (displayed as system memory) and GPU VRAM.
- For example, instances like
g4dn.xlarge
andg4dn.2xlarge
offer:g4dn.xlarge
: 4 vCPUs, 16 GB CPU RAMg4dn.2xlarge
: 8 vCPUs, 32 GB CPU RAM
- Note: Across a given instance family, the GPU VRAM typically remains constant, even though CPU resources scale up.
4. Resource Allocation Best Practices
- To ensure system stability and allow room for background processes:
- Allocate only 80% of the available CPU and RAM to the model or service.
- Example: On a
g4dn.2xlarge
(8 vCPUs, 32 GB RAM), limit allocation to:- 6–7 vCPUs
- ~26 GB RAM
5. Deployment Considerations
- Identify the deployment region and preferred instance family.
- Define scaling ranges and metrics (e.g., CPU/GPU utilization, request latency) to enable autoscaling effectively.