auto
, it chooses best backend based on configuration provided.Parameter | Type | Default | Description |
---|---|---|---|
context_length | integer | null | Maximum context length for the model |
gpu_memory_utilization | float | null | Fraction of GPU memory to reserve (0.0-1.0) |
max_running_requests | integer | null | Maximum concurrent requests |
max_total_tokens | integer | null | Maximum total tokens across all requests |
chunked_prefill_size | integer | null | Chunk size for prefill processing |
max_prefill_tokens | integer | 16384 | Maximum tokens processed in prefill phase |
reasoning_parser | string | null | Parser for reasoning task outputs |
enable_torch_compile | bool | false | Enable PyTorch 2.0 compilation |
torch_compile_max_bs | integer | 32 | Max batch size for compilation |
cuda_graph_max_bs | integer | 32 | Max batch size for CUDA graphs |
tool_call_parser | string | null | Parser for function call parsing |
deepseek-r1
, qwen3
mistral
, llama4
, llama3
, qwen25
Parameter | Type | Default | Description |
---|---|---|---|
context_length | integer | null | Maximum context length for the model |
gpu_memory_utilization | float | null | Fraction of GPU memory to reserve (0.0-1.0) |
reasoning_parser | string | null | Parser for reasoning task outputs |
tool_call_parser | string | null | Parser for function call parsing |
deepseek_r1
, qwen3
mistral
, llama3_json
, llama4_json
, hermes
float16
, for no quantizationfloat16
, float8
, w4
float16
for standard precision, or select a quantized format for reduced memory usage."draft"
"draft"
, "n-gram"
, "eagle"
, "eagle3"
, "nextn"
null
64
5
4
1
tensor_parallel_size
1
1
for no tensor parallelism (single process).2
, 4
, 8
) to enable tensor parallelism across that many devices.tensor_parallel_size
(e.g., you have at least as many GPUs as the value you set)."chat"
"chat"
, "completion"
, "embedding"
"chat"
: For conversational/chat-based interactions."completion"
: For standard text completion tasks."embedding"
: For generating vector embeddings from input text.[]