ExLlama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Explore
High-throughput inference servers and local runtime stacks tuned for GPUs.
Projects
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
Large Language Model Text Generation Inference
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
A high-throughput and memory-efficient inference and serving engine for LLMs