Optimization Configuration

{
  "warmups": {
    "enabled": true,
    "iterations": 5,
    "sample_input_data": []
  },
  "backend": {
    "name": "auto",
    "version": "latest",
    "extra_params": {}
  },
  "optimisations": {
    "speculative_decoding": {
      "enabled": false,
      "type": "auto",
      "extra_params": {}
    },
    "attention_caching": {
      "enabled": false,
      "type": "auto",
      "extra_params": {}
    }
  },
  "tensor_parallel_size": 1,
  "quantization": "float16"
}

Quantization Types

  1. Float 32 (FP32)
    • Full precision.
    • Highest accuracy.
    • Maximum memory usage.
  2. Float 16 (FP16)
    • Reduced precision.
    • Minimal accuracy loss.
    • Recommended for most use cases.
    • Balances performance and accuracy.
  3. Float 8 (FP8)
    • Advanced reduced precision.
    • Hardware Limitations
      • Not supported on A100 GPU architecture.
      • Only available on H100 GPUs.
  4. INT4 Quantization
    • Extreme compression.
    • Substantial memory reduction.
    • Noticeable accuracy degradation.
  5. AWQ (Activation-aware Weight Quantization)
    • Advanced compression technique.
    • Maintains model performance.
    • Minimal accuracy loss.

Model Configuration

{
  "type": "llm",
  "loras": [],
  "lora_repo": {
    "type": "",
    "path": "",
    "ownership": "",
    "secret": {
      "type": ""
    }
  },
  "quantized_model_path": {
    "type": "",
    "path": "",
    "ownership": "",
    "secret": {
      "type": ""
    }
  }
}