Documentation home page
Search...
⌘K
Ask AI
Support
Sign Up
Sign Up
Search...
Navigation
Deployment Guides
Model Deployment Guide
Documentation
API Reference
Guides
FAQ
Configuration References
Blog
Flux Compilation
Overview
Multi Pipeline
Multi Control Net
Multi Mode Control Net
LLM Compilation
Dynamic Lora Compilation
Cluster Import
E2E Cluster Import Prerequisites
Deployment Guides
Model Deployment Guide
On this page
1. Model Specifications
2. GPU Memory Requirements
3. GPU vs. CPU RAM Clarification
4. Resource Allocation Best Practices
5. Deployment Considerations
Deployment Guides
Model Deployment Guide
Before deploying any machine learning model, it is critical to perform a series of infrastructure checks to ensure optimal performance and cost-efficiency. Below are the key considerations to evaluate:
1. Model Specifications
Model Size
: Determine the model’s parameter size (e.g., 8B, 13B, etc.).
Precision Format
: Know the floating-point format (e.g., FP16, INT8), as this impacts memory requirements.
Tensor Parallelism:
2. GPU Memory Requirements
For large models (e.g., an 8B model using FP16), ensure a minimum of
16 GB GPU memory
to avoid Out-of-Memory (OOM) errors.
In such cases, opt for higher-spec GPUs:
NVIDIA L4
: 24 GB VRAM
NVIDIA L40s
: 48 GB VRAM
3. GPU vs. CPU RAM Clarification
It’s important to distinguish between
CPU RAM
(displayed as system memory) and
GPU VRAM
.
For example, instances like
g4dn.xlarge
and
g4dn.2xlarge
offer:
g4dn.xlarge
: 4 vCPUs, 16 GB CPU RAM
g4dn.2xlarge
: 8 vCPUs, 32 GB CPU RAM
Note
: Across a given instance family, the
GPU VRAM typically remains constant
, even though CPU resources scale up.
4. Resource Allocation Best Practices
To ensure system stability and allow room for background processes:
Allocate
only 80%
of the available CPU and RAM to the model or service.
Example: On a
g4dn.2xlarge
(8 vCPUs, 32 GB RAM), limit allocation to:
6–7 vCPUs
~26 GB RAM
5. Deployment Considerations
Identify the
deployment region
and
preferred instance family
.
Define
scaling ranges and metrics
(e.g., CPU/GPU utilization, request latency) to enable autoscaling effectively.
E2E Cluster Import Prerequisites
Assistant
Responses are generated using AI and may contain mistakes.