Creating a Quality Benchmark
Start a New Benchmark
General Information
Dataset Configuration
Generation Configuration
Execution Configuration
Run the Evaluation

Creating a Quality Benchmark

Start a New Benchmark

Go to Benchmarking → Create.
Choose Quality as the benchmark type.

Choose Quality Benchmark

Select LLM as the model type.

Select LLM Model Type

General Information

Evaluation Name — Name for this evaluation.
Select Deployments — Choose one or more deployments to evaluate.

Dataset Configuration

Select Datasets — Pick one or more datasets (e.g., gsm8k).

Dataset Configuration

Generation Configuration

Max Tokens — Maximum tokens the model can generate per response.
Temperature — Controls randomness; lower = more focused, higher = more creative.
Top P — Nucleus sampling; limits token choices to the top probability mass (e.g., 0.9 = top 90%).

Generation Configuration

Execution Configuration

Batch Size — Requests processed together.
Evaluation Limit — Limit number of dataset samples to evaluate (e.g., 10).

Run the Evaluation

Click Create Benchmark to start.

Performance Benchmarking Advanced Benchmarking

⌘I