Overview
This document details all configuration parameters available for the backend server, a high-performance framework designed for serving Large Language Models. It covers optimization and pipeline settings, including backend selection, quantization, parallelism, and advanced features such as speculative decoding and LoRA integration, supporting over 140 configurable options for flexible and efficient model deployment.Table of Contents
Quick Start
Default Optimization Configuration
Default Pipeline Configuration
Optimization Configuration
Backend Settings
Name
- default: “auto”
- description: This takes backend name to compile your model with, with value
auto
, it chooses best backend based on configuration provided. - values: “auto”, “v3”, “v4”
Extra Parameters
This takes dicitonary as params, and these params are the configuration for server started with backend v3 and v4V4 Backend
Parameter | Type | Default | Description |
---|---|---|---|
context_length | integer | null | Maximum context length for the model |
gpu_memory_utilization | float | null | Fraction of GPU memory to reserve (0.0-1.0) |
max_running_requests | integer | null | Maximum concurrent requests |
max_total_tokens | integer | null | Maximum total tokens across all requests |
chunked_prefill_size | integer | null | Chunk size for prefill processing |
max_prefill_tokens | integer | 16384 | Maximum tokens processed in prefill phase |
reasoning_parser | string | null | Parser for reasoning task outputs |
enable_torch_compile | bool | false | Enable PyTorch 2.0 compilation |
torch_compile_max_bs | integer | 32 | Max batch size for compilation |
cuda_graph_max_bs | integer | 32 | Max batch size for CUDA graphs |
tool_call_parser | string | null | Parser for function call parsing |
Notes
- Possible Values for Reasoning Parser:
deepseek-r1
,qwen3
- Possible values for tool_call_parser:
mistral
,llama4
,llama3
,qwen25
V3 Backend
Parameter | Type | Default | Description |
---|---|---|---|
context_length | integer | null | Maximum context length for the model |
gpu_memory_utilization | float | null | Fraction of GPU memory to reserve (0.0-1.0) |
reasoning_parser | string | null | Parser for reasoning task outputs |
tool_call_parser | string | null | Parser for function call parsing |
Notes
- Possible Values for Reasoning Parser:
deepseek_r1
,qwen3
- Possible values for tool_call_parser:
mistral
,llama3_json
,llama4_json
,hermes
Model Settings
Quantization
- default:
float16
, for no quantization - values:
float16
,float8
,w4
- description: Data type or quantization format to use for model weights. Use
float16
for standard precision, or select a quantized format for reduced memory usage.
Optimisations
Speculative Decoding
- type: “auto”
- enabled: true
- extra_params:
- algorithm:
- type: string
- default:
"draft"
- values:
"draft"
,"n-gram"
,"eagle"
,"eagle3"
,"nextn"
- description: Decoding algorithm to use for speculative decoding.
- draft_model_path:
- type: string or null
- default:
null
- description: Path to the draft model used in speculative decoding. If not set, uses the main model.
- num_draft_tokens:
- type: integer
- default:
64
- description: Number of tokens to generate in each speculative draft step.
- num_steps:
- type: integer
- default:
5
- description: Number of speculative decoding steps to perform.
- topk:
- type: integer
- default:
4
- description: Top-k sampling parameter for candidate selection during decoding.
- tp_size:
- type: integer
- default:
1
- description: Tensor parallel size for distributed speculative decoding.
- algorithm:
Parallelism
Tensor Parallel Size
- Parameter:
tensor_parallel_size
- Type: integer
- Default:
1
- Description:
Specifies the number of tensor parallel processes to use for model inference. Increasing this value enables the model to split its computations across multiple GPUs or nodes, which can accelerate inference and allow for larger models to be served.- Set to
1
for no tensor parallelism (single process). - Set to a higher integer (e.g.,
2
,4
,8
) to enable tensor parallelism across that many devices. - Ensure your hardware setup matches the specified
tensor_parallel_size
(e.g., you have at least as many GPUs as the value you set).
- Set to
Pipeline Configuration
- mode:
- type: string
- default:
"chat"
- values:
"chat"
,"completion"
,"embedding"
- description: Specifies the pipeline mode.
"chat"
: For conversational/chat-based interactions."completion"
: For standard text completion tasks."embedding"
: For generating vector embeddings from input text.
- loras:
- type: list of objects
- default:
[]
- description: List of LoRA (Low-Rank Adaptation) adapters to load with the model. Each object should specify the LoRA’s unique ID and its source location. See the documentation for more details.
- Note: If only 1 lora is provided then it will be merged with base model, else they will loaded dynamically.