Before deploying any machine learning model, it is critical to perform a series of infrastructure checks to ensure optimal performance and cost-efficiency. Below are the key considerations to evaluate:

1. Model Specifications

  • Model Size: Determine the model’s parameter size (e.g., 8B, 13B, etc.).
  • Precision Format: Know the floating-point format (e.g., FP16, INT8), as this impacts memory requirements.
  • Tensor Parallelism:

2. GPU Memory Requirements

  • For large models (e.g., an 8B model using FP16), ensure a minimum of 16 GB GPU memory to avoid Out-of-Memory (OOM) errors.
  • In such cases, opt for higher-spec GPUs:
    • NVIDIA L4: 24 GB VRAM
    • NVIDIA L40s: 48 GB VRAM

3. GPU vs. CPU RAM Clarification

  • It’s important to distinguish between CPU RAM (displayed as system memory) and GPU VRAM.
  • For example, instances like g4dn.xlarge and g4dn.2xlarge offer:
    • g4dn.xlarge: 4 vCPUs, 16 GB CPU RAM
    • g4dn.2xlarge: 8 vCPUs, 32 GB CPU RAM
  • Note: Across a given instance family, the GPU VRAM typically remains constant, even though CPU resources scale up.

4. Resource Allocation Best Practices

  • To ensure system stability and allow room for background processes:
    • Allocate only 80% of the available CPU and RAM to the model or service.
    • Example: On a g4dn.2xlarge (8 vCPUs, 32 GB RAM), limit allocation to:
      • 6–7 vCPUs
      • ~26 GB RAM

5. Deployment Considerations

  • Identify the deployment region and preferred instance family.
  • Define scaling ranges and metrics (e.g., CPU/GPU utilization, request latency) to enable autoscaling effectively.