下拉刷新
Repository Details
Shared bynavbar_avatar
repo_avatar
HelloGitHub Rating
0 ratings
More Efficient LLMs Inference and Service Engine
FreeApache-2.0
Claim
Collect
Share
33k
Stars
No
Chinese
Python
Language
Yes
Active
750
Contributors
2k
Issues
Yes
Organization
0.6.6.post1
Latest
5k
Forks
Apache-2.0
License
More
vllm image
This is a highly efficient and user-friendly large language model inference engine, specifically designed to address issues such as slow inference speeds and low resource utilization. It is based on PyTorch and CUDA, and incorporates memory optimization algorithms (PagedAttention), computational graph optimization, and model parallelization techniques to significantly reduce GPU memory usage and fully leverage multi-GPU resources to enhance inference performance. At the same time, vLLM is seamlessly compatible with HF models. It supports efficient operation on a variety of hardware platforms such as GPUs, CPUs, and TPUs, suitable for real-time question answering, text generation, and recommendation systems.
Included in:
Vol.105
Tags:
AI
CUDA
LLM
Python

Comments

Rating:
No comments yet