A high-throughput and memory-efficient inference and serving engine for LLMs
TAGS
20 tags[Hybrid] calling get_mamba_groups() once at MambaCopyBuffers.create() (#37318) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
[cherry-pick][Bugfix] Disable monolithic TRTLLM MoE for Renormalize routing (#37591)#37605 Signed-off-by: khluu <khluu000@gmail.com>
[Bugfix] Zero-init MLA attention output buffers to prevent NaN from CUDA graph padding (#37442) Signed-off-by: Elvir Crncevic <elvircrn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> (cherry picked from commit ef2c4f778df5aa07a44e663330e2dfdc16927d2a)
[cherry-pick][Bugfix] Fix EP weight filter breaking EPLB and NVFP4 accuracy #37322 Signed-off-by: khluu <khluu000@gmail.com>
[ROCm] Fix AttributeError for torch.compiler.skip_all_guards_unsafe on older PyTorch (#37219) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
[Bugfix][MultiConnector] Fix MultiConnector for SupportsHMA sub-connectors (#36549)
[NemotronH] Small fix reasoning parser (#36635) Signed-off-by: Roi Koren <roik@nvidia.com> (cherry picked from commit e661b9ee83d9d3c6c84c4e1acbe7e0280832e7c4)
[CI] Bump `mypy` version to 1.19.1 (#36104) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Bound openai to under 2.25.0 Signed-off-by: khluu <khluu000@gmail.com>
[Bugfix] Improve engine ready timeout error message (#35616) Signed-off-by: damaozi <1811866786@qq.com>
[Test] Add tests for n parameter in chat completions API (#35283) Signed-off-by: KrxGu <krishom70@gmail.com>
[ROCm][CI] Pin TorchCodec to v0.10.0 for ROCm compatibility (#34447) Signed-off-by: Andreas Karatzas <akaratza@amd.com> (cherry picked from commit 4c078fa546016eacab87f833ff625463421f7d29) (cherry picked from commit a976961fb77d38129abf69edd4952101731f2421)
[Bugfix] Fix MTP accuracy for GLM-5 (#34385) Signed-off-by: mgoin <mgoin64@gmail.com> (cherry picked from commit ec12d39d44739bee408ec1473acc09e75daf1a5d)
Patch protobuf for CVE-2026-0994 (#34253) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 5045d5c9831a3a4a423a409ccea521d299a43a9a)
[Frontend][last/5] Make pooling entrypoints request schema consensus. (#31127) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
[Bugfix] Disable TRTLLM attention when KV transfer is enabled (#33192) Signed-off-by: Zhanqiu Hu <zh338@cornell.edu>
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) Signed-off-by: Nick Hill <nickhill123@gmail.com>
[BugFix][Spec Decoding] Fix negative accepted tokens metric crash (#33729) Signed-off-by: Nick Hill <nickhill123@gmail.com>
[torch.compile] Don't do the fast moe cold start optimization if there is speculative decoding (#33624) Signed-off-by: Richard Zou <zou3519@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> (cherry picked from commit 5eac9a1b341b93478d0d0d57239c92edd18ad19e)