
Category:
Category:
Model Compression Techniques
Category:
Model Optimization
Definition
Methods to shrink LLMs while preserving accuracy.
Explanation
Model compression reduces model size, inference cost, and latency while maintaining acceptable performance. Techniques include quantization, pruning, distillation, and low-rank adaptation. Enterprises use compression to deploy AI on edge devices, reduce GPU usage, and optimize multi-model routing systems.
Technical Architecture
Large Model → Compression Pipeline (Quantization / Pruning / Distillation) → Smaller Optimized Model
Core Component
Quantizer, pruner, teacher model, student model, evaluation harness
Use Cases
Edge deployment, enterprise cost reduction, real-time agents
Pitfalls
Accuracy loss; degraded reasoning; limited context size
LLM Keywords
Model Compression, Quantization, Pruning, Small LLMs
Related Concepts
Related Frameworks
• Distillation
• MoE
• Routing Models
• Compression Workflow Diagram
