Creating a Quality Benchmark
Start a New Benchmark
- Go to Benchmarking → Create.
- Choose Quality as the benchmark type.

- Select LLM as the model type.

General Information
- Evaluation Name — Name for this evaluation.
- Select Deployments — Choose one or more deployments to evaluate.
Dataset Configuration
- Select Datasets — Pick one or more datasets (e.g., gsm8k).

Generation Configuration
- Max Tokens — Maximum tokens the model can generate per response.
- Temperature — Controls randomness; lower = more focused, higher = more creative.
- Top P — Nucleus sampling; limits token choices to the top probability mass (e.g.,
0.9
= top 90%).

Execution Configuration
- Batch Size — Requests processed together.
- Evaluation Limit — Limit number of dataset samples to evaluate (e.g.,
10
).
Run the Evaluation
- Click Create Benchmark to start.