Overview

This document details all configuration parameters available for the backend server, a high-performance framework designed for serving Large Language Models. It covers optimization and pipeline settings, including backend selection, quantization, parallelism, and advanced features such as speculative decoding and LoRA integration, supporting over 140 configurable options for flexible and efficient model deployment.

Table of Contents

Quick Start

Default Optimization Configuration

{
  "backend": {
    "name": "auto",
    "version": "latest",
    "extra_params": {
      "enable_torch_compile": true
    }
  },
  "warmups": {
    "enabled": true,
    "iterations": 5,
    "sample_input_data": []
  },
  "model_type": "llm",
  "quantization": "float16",
  "optimisations": {
    "model_type": "llm",
    "attention_caching": {
      "type": "auto",
      "enabled": false,
      "extra_params": {}
    },
    "speculative_decoding": {
      "type": "auto",
      "enabled": false,
      "extra_params": {}
    }
  },
  "tensor_parallel_size": 1
}

Default Pipeline Configuration

{
  "mode": "chat",
  "type": "llm",
  "loras": [],
  "lora_repo": {
    "path": "",
    "type": "",
    "secret": {
      "type": ""
    },
    "ownership": ""
  },
  "load_lora_dynamic": false,
}

Optimization Configuration

Backend Settings

Name

  • default: “auto”
  • description: This takes backend name to compile your model with, with value auto, it chooses best backend based on configuration provided.
  • values: “auto”, “v3”, “v4”

Extra Parameters

This takes dicitonary as params, and these params are the configuration for server started with backend v3 and v4

V4 Backend

ParameterTypeDefaultDescription
context_lengthintegernullMaximum context length for the model
gpu_memory_utilizationfloatnullFraction of GPU memory to reserve (0.0-1.0)
max_running_requestsintegernullMaximum concurrent requests
max_total_tokensintegernullMaximum total tokens across all requests
chunked_prefill_sizeintegernullChunk size for prefill processing
max_prefill_tokensinteger16384Maximum tokens processed in prefill phase
reasoning_parserstringnullParser for reasoning task outputs
enable_torch_compileboolfalseEnable PyTorch 2.0 compilation
torch_compile_max_bsinteger32Max batch size for compilation
cuda_graph_max_bsinteger32Max batch size for CUDA graphs
tool_call_parserstringnullParser for function call parsing

Notes

  • Possible Values for Reasoning Parser: deepseek-r1, qwen3
  • Possible values for tool_call_parser: mistral, llama4, llama3, qwen25

V3 Backend

ParameterTypeDefaultDescription
context_lengthintegernullMaximum context length for the model
gpu_memory_utilizationfloatnullFraction of GPU memory to reserve (0.0-1.0)
reasoning_parserstringnullParser for reasoning task outputs
tool_call_parserstringnullParser for function call parsing

Notes

  • Possible Values for Reasoning Parser: deepseek_r1, qwen3
  • Possible values for tool_call_parser: mistral, llama3_json, llama4_json, hermes

Model Settings

Quantization

  • default: float16, for no quantization
  • values: float16, float8, w4
  • description: Data type or quantization format to use for model weights. Use float16 for standard precision, or select a quantized format for reduced memory usage.

Optimisations

Speculative Decoding

  • type: “auto”
  • enabled: true
  • extra_params:
    • algorithm:
      • type: string
      • default: "draft"
      • values: "draft", "n-gram", "eagle", "eagle3", "nextn"
      • description: Decoding algorithm to use for speculative decoding.
    • draft_model_path:
      • type: string or null
      • default: null
      • description: Path to the draft model used in speculative decoding. If not set, uses the main model.
    • num_draft_tokens:
      • type: integer
      • default: 64
      • description: Number of tokens to generate in each speculative draft step.
    • num_steps:
      • type: integer
      • default: 5
      • description: Number of speculative decoding steps to perform.
    • topk:
      • type: integer
      • default: 4
      • description: Top-k sampling parameter for candidate selection during decoding.
    • tp_size:
      • type: integer
      • default: 1
      • description: Tensor parallel size for distributed speculative decoding.

Parallelism

Tensor Parallel Size

  • Parameter: tensor_parallel_size
  • Type: integer
  • Default: 1
  • Description:
    Specifies the number of tensor parallel processes to use for model inference. Increasing this value enables the model to split its computations across multiple GPUs or nodes, which can accelerate inference and allow for larger models to be served.
    • Set to 1 for no tensor parallelism (single process).
    • Set to a higher integer (e.g., 2, 4, 8) to enable tensor parallelism across that many devices.
    • Ensure your hardware setup matches the specified tensor_parallel_size (e.g., you have at least as many GPUs as the value you set).

Pipeline Configuration

  • mode:
    • type: string
    • default: "chat"
    • values: "chat", "completion", "embedding"
    • description: Specifies the pipeline mode.
      • "chat": For conversational/chat-based interactions.
      • "completion": For standard text completion tasks.
      • "embedding": For generating vector embeddings from input text.
  • loras:
    • type: list of objects
    • default: []
    • description: List of LoRA (Low-Rank Adaptation) adapters to load with the model. Each object should specify the LoRA’s unique ID and its source location. See the documentation for more details.
    • Note: If only 1 lora is provided then it will be merged with base model, else they will loaded dynamically.