Releases: runpod-workers/worker-vllm
Releases · runpod-workers/worker-vllm
v1.1.0
- Major update from vllm v0.4.2 -> v0.5.3.
- supports Llama 3.1 version models.
- Various improvements and bug fixes.
[Known Issue]: OpenAI Completion Requests error.
1.0.1
Hotfix adding backwards compatibility for deprecated max_context_len_to_capture engine argument
1.0.0
Worker vLLM 1.0.0 - What's Changed
- vLLM version 0.3.3 -> 0.4.2
- Various improvements and bug fixes
0.3.2
Worker vLLM 0.3.2 - What's Changed
- vLLM version 0.3.2 -> 0.3.3
- StarCoder2 support
- Performance optimization for Gemma
- 2/3/8-bit GPTQ support
- Integrate Marlin Kernels for Int4 GPTQ inference
- Performance optimization for MoE kernel
- Updated and refactored base image, sampling parameters, etc.
- Various bug fixes
0.3.1
Bug Fixes
- Loading downloaded model while HuggingFace is down
- Building image without GPU
- Model and Tokenizer revision name
0.3.0
Worker vLLM 0.3.0: What's New since 0.2.3:
-
🚀 Full OpenAI Compatibility 🚀
You may now use your deployment with any OpenAI Codebase by changing only 3 lines in total. The supported routes are Chat Completions, Completions, and Models - with both streaming and non-streaming.
-
Dynamic Batch Size - time-to-first token as fast no batching, while maintaining the performance of batched token streaming throughout the request.
-
vLLM 0.2.7 -> 0.3.2
- Gemma, DeepSeek MoE and OLMo support.
- FP8 KV Cache support
- New supported parameters
- We're working on adding support for Multi-LoRA ⚙️
- Support for a wide range of new settings for your endpoint.
- Fixed Tensor Parallelism, baking model into images, and more bugs.
- Refactors and general improvements.
Full Changelog: 0.2.3...0.3.0
0.2.3
Worker vLLM 0.2.3 - What's Changed
Various bug fixes
New Contributors
- @casper-hansen made their first contribution in #39
- @willsamu made their first contribution in #45
0.2.2
Worker vLLM 0.2.2 - What's New
- Custom Chat Templates: you may now specify a Jinja chat template with an environment variable.
- Custom Tokenizer
Fixes:
- Tensor Parallel/Multi-GPU Deployment
- Baking Model into the image. Previously, the worker would download the model every time, ignoring the baked in model.
- Crashes due to
MAX_PARALLEL_LOADING_WORKERS
0.2.1
Worker vLLM 0.2.1 - What's New
- Added OpenAI Chat Completions formatted output for non-streaming use. (previously only supported for streaming)
0.2.0
Worker vLLM 0.2.0 - What's New
- You no longer need a linux-based machine or NVIDIA GPUs to build the worker.
- Over 3x lighter Docker image size.
- OpenAI Chat Completion output format (optional to use).
- Fast image build time.
- Docker Secrets-protected Hugging Face token support for building the image with a model baked in without exposing your token.
- Support for
n
andbest_of
sampling parameters, which allow you to generate multiple responses from a single prompt. - New environment variables for various configuration.
- vLLM Version: 0.2.7