Creating a Quality Benchmark

Start a New Benchmark

  1. Go to Benchmarking → Create.
  2. Choose Quality as the benchmark type.
Choose Quality Benchmark
  1. Select LLM as the model type.
Select LLM Model Type

General Information

  • Evaluation Name — Name for this evaluation.
  • Select Deployments — Choose one or more deployments to evaluate.

Dataset Configuration

  • Select Datasets — Pick one or more datasets (e.g., gsm8k).
Dataset Configuration

Generation Configuration

  • Max Tokens — Maximum tokens the model can generate per response.
  • Temperature — Controls randomness; lower = more focused, higher = more creative.
  • Top P — Nucleus sampling; limits token choices to the top probability mass (e.g., 0.9 = top 90%).
Generation Configuration

Execution Configuration

  • Batch Size — Requests processed together.
  • Evaluation Limit — Limit number of dataset samples to evaluate (e.g., 10).

Run the Evaluation

  • Click Create Benchmark to start.