Category:

Evaluation (LLM / Agent)

Category:

Evaluation & Benchmarking

Definition

Measuring performance, reliability, and safety of LLMs and agents.

Explanation

Evaluation assesses how well LLMs and agents perform across accuracy, reasoning, tool use, safety, latency, and cost. Enterprise evaluation combines benchmarks, synthetic tests, human review, and production monitoring.

Technical Architecture

Test Cases → LLM/Agent → Metrics → Scorecard

Core Component

Benchmarks, metrics, evaluation harness, dashboards

Use Cases

Vendor selection, Regression testing, governance

Pitfalls

Benchmarks not matching real use cases

LLM Keywords

LLM Evaluation, Agent Evaluation

Related Concepts

Related Frameworks

• Agent Benchmarks
• Synthetic Benchmarking
• Observability

• OpenAI Evals
• DeepEval

Back to Glossary Index