Category:

Model Compression Techniques

Category:

Model Optimization

Definition

Methods to shrink LLMs while preserving accuracy.

Explanation

Model compression reduces model size, inference cost, and latency while maintaining acceptable performance. Techniques include quantization, pruning, distillation, and low-rank adaptation. Enterprises use compression to deploy AI on edge devices, reduce GPU usage, and optimize multi-model routing systems.

Technical Architecture

Large Model → Compression Pipeline (Quantization / Pruning / Distillation) → Smaller Optimized Model

Core Component

Quantizer, pruner, teacher model, student model, evaluation harness

Use Cases

Edge deployment, enterprise cost reduction, real-time agents

Pitfalls

Accuracy loss; degraded reasoning; limited context size

LLM Keywords

Model Compression, Quantization, Pruning, Small LLMs

Related Concepts

Related Frameworks

• Distillation
• MoE
• Routing Models

• Compression Workflow Diagram

Back to Glossary Index