Tags

deepspeedai / DeepSpeed UNCLAIMED

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

0 0 0 Python

20 tags

v0.18.8

Update version (#7903)

5f7b687

zip tar.gz

v0.18.7

Fix hook count performance regression from v0.18.5 (#7886) Fixes performance regressions reported in #7882 and #7885. PR #7780 added dynamic hook count computation for reentrant checkpointing correctness, but placed the call inside every gradient hook closure. For a model with n parameter tensors, this creates significant overhead per backward pass. Summary: 1. Added `should_refresh_expected_hook_count()` predicate that returns true only at backward phase boundaries (first hook, or new reentrant phase), so `count_used_parameters_in_backward()` is called once per phase instead of once per hook. 2. Applied this predicate in ZeRO-1/2 (stage_1_and_2.py) and both ZeRO-3 hook sites (stage3.py), reusing the `cached_max_expected_hooks_seen` value when refresh isn't needed. 3. Changed enter_backward() to reset hook counters on first real backward entry, preventing pollution from pre-user-backward autograd calls (e.g., TiledFusedLogitsLoss). With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2, 8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations: - Before fix: 0.1265s/iter - After fix: 0.0505s/iter --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com>

6c59d54

zip tar.gz

v0.18.6

Replace torch.jit.script with torch.compile (#7835) (#7840) Fixes #7835. On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from import-time JIT-decorated helpers. This change updates the compatibility path to align with PyTorch guidance while keeping import clean. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>

d2ca6e7

zip tar.gz

v0.18.5

Update PyTorch to v2.9 for modal tests (#7816) Update PyTorch to v2.9 for modal tests --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

b19987c

zip tar.gz

v0.18.4

fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim (#7763) fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim Short-circuit BF16_Optimizer.destroy() if using_real_optimizer is False. When initialized with optimizer=None (DummyOptim), bf16_groups remains empty, causing an IndexError when accessing it in destroy(). Resolves #7752

b35d9eb

zip tar.gz

v0.18.3

Wall clock timers API (#7714) Make wall clock timers available to clients. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

6eb98aa

zip tar.gz

v0.18.2

README refresh (#7668) Long overdue --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

e993fea

zip tar.gz

v0.18.1

Update version.txt in advance of next release

3631712

zip tar.gz

v0.18.0

Update email address (#7624) Update contact address Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

79caae1

zip tar.gz

v0.17.6

[bugfix] fix partition context unpatch (#7566) ## Fix asymmetric patching/unpatching in InsertPostInitMethodToModuleSubClasses ### Problem Description The `InsertPostInitMethodToModuleSubClasses` context manager patches `__init__` methods of model classes during entry and unpatches them during exit. However, asymmetric condition checks between patching and unpatching can introduce subtle inheritance bugs. ### Root Cause Analysis The issue occurs with classes that have multiple inheritance where: 1. **Child class A** does not override `__init__` 2. **Parent class B** does not inherit from `nn.Module` 3. **Parent class C** inherits from `nn.Module` **Current asymmetric logic:** ```python # Patching (entry): Only patch classes with explicit __init__ def _enable_class(cls): if '__init__' in cls.__dict__: # ✅ Strict check cls._old_init = cls.__init__ cls.__init__ = partition_after(cls.__init__) # Unpatching (exit): Restore any class with _old_init def _disable_class(cls): if hasattr(cls, '_old_init'): # ❌ Permissive check cls.__init__ = cls._old_init ``` **Execution flow:** 1. **During entry**: Child A is skipped (no explicit `__init__`), Parent C is patched 2. **During exit**: Child A inherits `_old_init` from Parent C and gets incorrectly "restored" **Result**: Child A's `__init__` points to Parent C's original `__init__`, bypassing Parent B and breaking the inheritance chain. ### Reproduction Case This pattern is common in Hugging Face models: ```python class Qwen3ForSequenceClassification(GenericForSequenceClassification, Qwen3PreTrainedModel): pass # No explicit __init__ # GenericForSequenceClassification - not a nn.Module subclass # Qwen3PreTrainedModel - inherits from nn.Module ``` ### Solution Apply symmetric condition checking in both patch and unpatch operations: ```python def _disable_class(cls): # Match the patching condition: only restore classes we explicitly patched if '__init__' in cls.__dict__ and hasattr(cls, '_old_init'): cls.__init__ = cls._old_init delattr(cls, '_old_init') # Optional cleanup ``` This ensures that only classes that were explicitly patched during entry get restored during exit. ### Testing The fix has been validated against the Qwen3ForSequenceClassification reproduction case and resolves the inheritance chain corruption. ### Related Issues - External issue: https://github.com/modelscope/ms-swift/pull/5820 Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

e4f6da9

zip tar.gz

v0.17.5

Add index to HPU devices (#7497) The [PR #7266](https://github.com/deepspeedai/DeepSpeed/pull/7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>

047a759

zip tar.gz

v0.17.4

`TiledFusedLogitsLoss` bug fix (#7459) bug fix - mixed up tuple and list.

c4b1a8c

zip tar.gz

v0.17.3

Fix: Adapt Llama injection policy for newer transformers versions (#7443) This PR fixes an `AttributeError` that occurs during `deepspeed.init_inference` when using kernel injection (`replace_with_kernel_inject=True`) with Llama models from recent versions of `transformers`. **The Bug:** In newer `transformers` versions (e.g., `4.53.3`), configurations like `num_heads` and `rope_theta` were moved from direct attributes of the `LlamaAttention` module into a nested `config` object. The current DeepSpeed injection policy tries to access these attributes from their old, direct location, causing the initialization to fail with an `AttributeError: 'LlamaAttention' object has no attribute 'num_heads'`. **The Solution:** This change updates the Llama injection logic to be more robust: 1. It first tries to read attributes like `num_heads` from the new `config` object location. 2. If that fails, it falls back to the legacy direct attribute path. --------- Signed-off-by: huanyuqu <yc37960@um.edu.mo>

092625c

zip tar.gz

v0.17.2

fix: engine initializes optimizer attributes at the beginning (#7410) As in `destroy`, `self.optimizer` is called, but the error out calling to `destroy` can happen in `__init__`, even before optimizer and scheduler is configured. So we need to move `self.optimizer` to the top to avoid triggering another exception. e.g.: ```logs File "deepspeed/runtime/engine.py", line 453, in _configure_tensor_parallel_states assert self.zero_optimization_stage( AssertionError: Currently, the compatibility between 'autotp' and 'zero_stage = 3' has not been validated Exception ignored in: <function DeepSpeedEngine.__del__ at 0x1516c0610820> Traceback (most recent call last): File "deepspeed/runtime/engine.py", line 509, in __del__ self.destroy() File "deepspeed/runtime/engine.py", line 512, in destroy if self.optimizer is not None and hasattr(self.optimizer, 'destroy'): File "deepspeed/runtime/engine.py", line 621, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'DeepSpeedEngine' object has no attribute 'optimizer' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>

15f054d

zip tar.gz

v0.17.1

Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327) pytest 8.4.0 seems to break a number of our tests, rather than pinning in each individually, we should just put this in the requirements file until we resolve the issue. --------- Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>

2ce5505

zip tar.gz

v0.17.0

Bump to v0.17.0 (#7324) Co-authored-by: Logan Adams <loadams@microsoft.com>

720787e

zip tar.gz

v0.16.9

[XPU] Support XCCL on deepspeed side (#7299) XCCL will be used for XPU device on Pytorch-2.8, with this support will remove torch-ccl on XPU device, and we will also reserve the old path for torch-CCL enable. --------- Signed-off-by: yisheng <yi.sheng@intel.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>

bdba823

zip tar.gz

v0.16.8

rollback #6726 (#7258) This PR rollback #6726 which caused https://github.com/deepspeedai/DeepSpeed/issues/7116 . --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>

f459502

zip tar.gz

v0.16.7

Make sure it's not None before offloading contiguous_grad_buffer (#7227) Resolves #7223 When DeepCompile is enabled in ZeRO-3, contiguous_grad_buffer is released, so we should check and make sure it's not None before we continue. https://github.com/deepspeedai/DeepSpeed/blob/227a60c0c412ddf4619401b5d8d9d1674aee17b5/deepspeed/compile/init_z3.py#L22-L25 Signed-off-by: Hollow Man <hollowman@opensuse.org> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

c66fdaf

zip tar.gz

v0.16.6

DeepCompile for enhanced compiler integration (#7154) This PR introduces *DeepCompile*, a new feature that efficiently integrates compiler optimizations with other DeepSpeed features. DeepCompile utilizes torch's dynamo to capture the computation graph and modifies it to incorporate DeepSpeed’s optimizations seamlessly. Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements such as proactive prefetching and selective unsharding to improve performance. (More details will be added later.) --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

227a60c

zip tar.gz