SIGN IN SIGN UP
vllm-project / vllm UNCLAIMED

A high-throughput and memory-efficient inference and serving engine for LLMs

74530 0 0 Python

TAGS

20 tags
v0.18.1rc0

[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() (#37318) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>

v0.18.0

[cherry-pick][Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591)#37605 Signed-off-by: khluu <khluu000@gmail.com>

v0.18.0rc2

[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> (cherry picked from commit ef2c4f778df5aa07a44e663330e2dfdc16927d2a)

v0.18.0rc1

[cherry-pick][Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy #37322 Signed-off-by: khluu <khluu000@gmail.com>

v0.17.2rc0

[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch (#37219) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

v0.18.0rc0

[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors (#36549)

v0.17.1

[NemotronH] Small fix reasoning parser (#36635) Signed-off-by: Roi Koren <roik@nvidia.com> (cherry picked from commit e661b9ee83d9d3c6c84c4e1acbe7e0280832e7c4)

v0.17.1rc0

[CI] Bump `mypy` version to 1.19.1 (#36104) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

v0.17.0rc1

Bound openai to under 2.25.0 Signed-off-by: khluu <khluu000@gmail.com>

v0.17.0

Bound openai to under 2.25.0 Signed-off-by: khluu <khluu000@gmail.com>

v0.17.0rc0

[Bugfix] Improve engine ready timeout error message (#35616) Signed-off-by: damaozi <1811866786@qq.com>

v0.16.1rc0

[Test] Add tests for n parameter in chat completions API (#35283) Signed-off-by: KrxGu <krishom70@gmail.com>

v0.16.0

[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility (#34447) Signed-off-by: Andreas Karatzas <akaratza@amd.com> (cherry picked from commit 4c078fa546016eacab87f833ff625463421f7d29) (cherry picked from commit a976961fb77d38129abf69edd4952101731f2421)

v0.16.0rc3

[Bugfix] Fix MTP accuracy for GLM-5 (#34385) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit ec12d39d44739bee408ec1473acc09e75daf1a5d)

v0.16.0rc2

Patch protobuf for CVE-2026-0994 (#34253) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 5045d5c9831a3a4a423a409ccea521d299a43a9a)

v0.16.0rc1

[Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

v0.15.2rc0

[Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>

v0.15.1rc1

[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) Signed-off-by: Nick Hill <nickhill123@gmail.com>

v0.15.1

[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) Signed-off-by: Nick Hill <nickhill123@gmail.com>

v0.15.1rc0

[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit 5eac9a1b341b93478d0d0d57239c92edd18ad19e)