Optimization Guide

Overview

This document details all configuration parameters available for the backend server, a high-performance framework designed for serving Large Language Models. It covers optimization and pipeline settings, including backend selection, quantization, parallelism, and advanced features such as speculative decoding and LoRA integration, supporting over 140 configurable options for flexible and efficient model deployment.

Quick Start
Optimization Configuration
Pipeline Configuration

Quick Start

Default Optimization Configuration

{
  "backend": {
    "name": "auto",
    "version": "latest",
    "extra_params": {
      "enable_torch_compile": true
    }
  },
  "warmups": {
    "enabled": true,
    "iterations": 5,
    "sample_input_data": []
  },
  "model_type": "llm",
  "quantization": "float16",
  "optimisations": {
    "model_type": "llm",
    "attention_caching": {
      "type": "auto",
      "enabled": false,
      "extra_params": {}
    },
    "speculative_decoding": {
      "type": "auto",
      "enabled": false,
      "extra_params": {}
    }
  },
  "tensor_parallel_size": 1
}

Default Pipeline Configuration

{
  "mode": "chat",
  "type": "llm",
  "loras": [],
  "lora_repo": {
    "path": "",
    "type": "",
    "secret": {
      "type": ""
    },
    "ownership": ""
  },
  "load_lora_dynamic": false,
}

Optimization Configuration

Backend Settings

Name

default: “auto”
description: This takes backend name to compile your model with, with value auto, it chooses best backend based on configuration provided.
values: “auto”, “v3”, “v4”

Extra Parameters

This takes dicitonary as params, and these params are the configuration for server started with backend v3 and v4

V4 Backend

Parameter	Type	Default	Description
`context_length`	integer	`null`	Maximum context length for the model
`gpu_memory_utilization`	float	`null`	Fraction of GPU memory to reserve (0.0-1.0)
`max_running_requests`	integer	`null`	Maximum concurrent requests
`max_total_tokens`	integer	`null`	Maximum total tokens across all requests
`chunked_prefill_size`	integer	`null`	Chunk size for prefill processing
`max_prefill_tokens`	integer	`16384`	Maximum tokens processed in prefill phase
`reasoning_parser`	string	`null`	Parser for reasoning task outputs
`enable_torch_compile`	bool	`false`	Enable PyTorch 2.0 compilation
`torch_compile_max_bs`	integer	`32`	Max batch size for compilation
`cuda_graph_max_bs`	integer	`32`	Max batch size for CUDA graphs
`tool_call_parser`	string	`null`	Parser for function call parsing

Notes

Possible Values for Reasoning Parser: deepseek-r1, qwen3
Possible values for tool_call_parser: mistral, llama4, llama3, qwen25

V3 Backend

Parameter	Type	Default	Description
`context_length`	integer	`null`	Maximum context length for the model
`gpu_memory_utilization`	float	`null`	Fraction of GPU memory to reserve (0.0-1.0)
`reasoning_parser`	string	`null`	Parser for reasoning task outputs
`tool_call_parser`	string	`null`	Parser for function call parsing

Notes

Possible Values for Reasoning Parser: deepseek_r1, qwen3
Possible values for tool_call_parser: mistral, llama3_json, llama4_json, hermes

Model Settings

Quantization

default: float16, for no quantization
values: float16, float8, w4
description: Data type or quantization format to use for model weights. Use float16 for standard precision, or select a quantized format for reduced memory usage.

Optimisations

Speculative Decoding

type: “auto”
enabled: true
extra_params:
- algorithm:
  - type: string
  - default: "draft"
  - values: "draft", "n-gram", "eagle", "eagle3", "nextn"
  - description: Decoding algorithm to use for speculative decoding.
- draft_model_path:
  - type: string or null
  - default: null
  - description: Path to the draft model used in speculative decoding. If not set, uses the main model.
- num_draft_tokens:
  - type: integer
  - default: 64
  - description: Number of tokens to generate in each speculative draft step.
- num_steps:
  - type: integer
  - default: 5
  - description: Number of speculative decoding steps to perform.
- topk:
  - type: integer
  - default: 4
  - description: Top-k sampling parameter for candidate selection during decoding.
- tp_size:
  - type: integer
  - default: 1
  - description: Tensor parallel size for distributed speculative decoding.

Parallelism

Tensor Parallel Size

Parameter: tensor_parallel_size
Type: integer
Default: 1
Description:
Specifies the number of tensor parallel processes to use for model inference. Increasing this value enables the model to split its computations across multiple GPUs or nodes, which can accelerate inference and allow for larger models to be served.
- Set to 1 for no tensor parallelism (single process).
- Set to a higher integer (e.g., 2, 4, 8) to enable tensor parallelism across that many devices.
- Ensure your hardware setup matches the specified tensor_parallel_size (e.g., you have at least as many GPUs as the value you set).

Pipeline Configuration

mode:
- type: string
- default: "chat"
- values: "chat", "completion", "embedding"
- description: Specifies the pipeline mode.
  - "chat": For conversational/chat-based interactions.
  - "completion": For standard text completion tasks.
  - "embedding": For generating vector embeddings from input text.
loras:
- type: list of objects
- default: []
- description: List of LoRA (Low-Rank Adaptation) adapters to load with the model. Each object should specify the LoRA’s unique ID and its source location. See the documentation for more details.
- Note: If only 1 lora is provided then it will be merged with base model, else they will loaded dynamically.

Flux Compilation

LLM Compilation

Cluster Import

Deployment Guides

Overview

Table of Contents

Quick Start

Default Optimization Configuration

Default Pipeline Configuration

Optimization Configuration

Backend Settings

Name

Extra Parameters

V4 Backend

Notes

V3 Backend

Notes

Model Settings

Quantization

Optimisations

Speculative Decoding

Parallelism

Tensor Parallel Size

Pipeline Configuration

Flux Compilation

LLM Compilation

Cluster Import

Deployment Guides

​Overview

​Table of Contents

​Quick Start

​Default Optimization Configuration

​Default Pipeline Configuration

​Optimization Configuration

​Backend Settings

​Name

​Extra Parameters

​V4 Backend

​Notes

​V3 Backend

​Notes

​Model Settings

​Quantization

​Optimisations

​Speculative Decoding

​Parallelism

​Tensor Parallel Size

​Pipeline Configuration

Overview

Table of Contents

Quick Start

Default Optimization Configuration

Default Pipeline Configuration

Optimization Configuration

Backend Settings

Name

Extra Parameters

V4 Backend

Notes

V3 Backend

Notes

Model Settings

Quantization

Optimisations

Speculative Decoding

Parallelism

Tensor Parallel Size

Pipeline Configuration