DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
TAGS
20 tagsFix hook count performance regression from v0.18.5 (#7886) Fixes performance regressions reported in #7882 and #7885. PR #7780 added dynamic hook count computation for reentrant checkpointing correctness, but placed the call inside every gradient hook closure. For a model with n parameter tensors, this creates significant overhead per backward pass. Summary: 1. Added `should_refresh_expected_hook_count()` predicate that returns true only at backward phase boundaries (first hook, or new reentrant phase), so `count_used_parameters_in_backward()` is called once per phase instead of once per hook. 2. Applied this predicate in ZeRO-1/2 (stage_1_and_2.py) and both ZeRO-3 hook sites (stage3.py), reusing the `cached_max_expected_hooks_seen` value when refresh isn't needed. 3. Changed enter_backward() to reset hook counters on first real backward entry, preventing pollution from pre-user-backward autograd calls (e.g., TiledFusedLogitsLoss). With 24-layer transformer, ~267M params (147 parameter tensors), ZeRO-2, 8×H100 80GB, bf16, batch size 8, 20 warmup + 20 measured iterations: - Before fix: 0.1265s/iter - After fix: 0.0505s/iter --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Ramya Ramineni <rraminen@users.noreply.github.com>
Replace torch.jit.script with torch.compile (#7835) (#7840) Fixes #7835. On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from import-time JIT-decorated helpers. This change updates the compatibility path to align with PyTorch guidance while keeping import clean. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Update PyTorch to v2.9 for modal tests (#7816) Update PyTorch to v2.9 for modal tests --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim (#7763) fix: avoid IndexError in BF16_Optimizer.destroy() when using DummyOptim Short-circuit BF16_Optimizer.destroy() if using_real_optimizer is False. When initialized with optimizer=None (DummyOptim), bf16_groups remains empty, causing an IndexError when accessing it in destroy(). Resolves #7752
Wall clock timers API (#7714) Make wall clock timers available to clients. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
README refresh (#7668) Long overdue --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Update email address (#7624) Update contact address Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
[bugfix] fix partition context unpatch (#7566) ## Fix asymmetric patching/unpatching in InsertPostInitMethodToModuleSubClasses ### Problem Description The `InsertPostInitMethodToModuleSubClasses` context manager patches `__init__` methods of model classes during entry and unpatches them during exit. However, asymmetric condition checks between patching and unpatching can introduce subtle inheritance bugs. ### Root Cause Analysis The issue occurs with classes that have multiple inheritance where: 1. **Child class A** does not override `__init__` 2. **Parent class B** does not inherit from `nn.Module` 3. **Parent class C** inherits from `nn.Module` **Current asymmetric logic:** ```python # Patching (entry): Only patch classes with explicit __init__ def _enable_class(cls): if '__init__' in cls.__dict__: # ✅ Strict check cls._old_init = cls.__init__ cls.__init__ = partition_after(cls.__init__) # Unpatching (exit): Restore any class with _old_init def _disable_class(cls): if hasattr(cls, '_old_init'): # ❌ Permissive check cls.__init__ = cls._old_init ``` **Execution flow:** 1. **During entry**: Child A is skipped (no explicit `__init__`), Parent C is patched 2. **During exit**: Child A inherits `_old_init` from Parent C and gets incorrectly "restored" **Result**: Child A's `__init__` points to Parent C's original `__init__`, bypassing Parent B and breaking the inheritance chain. ### Reproduction Case This pattern is common in Hugging Face models: ```python class Qwen3ForSequenceClassification(GenericForSequenceClassification, Qwen3PreTrainedModel): pass # No explicit __init__ # GenericForSequenceClassification - not a nn.Module subclass # Qwen3PreTrainedModel - inherits from nn.Module ``` ### Solution Apply symmetric condition checking in both patch and unpatch operations: ```python def _disable_class(cls): # Match the patching condition: only restore classes we explicitly patched if '__init__' in cls.__dict__ and hasattr(cls, '_old_init'): cls.__init__ = cls._old_init delattr(cls, '_old_init') # Optional cleanup ``` This ensures that only classes that were explicitly patched during entry get restored during exit. ### Testing The fix has been validated against the Qwen3ForSequenceClassification reproduction case and resolves the inheritance chain corruption. ### Related Issues - External issue: https://github.com/modelscope/ms-swift/pull/5820 Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Add index to HPU devices (#7497) The [PR #7266](https://github.com/deepspeedai/DeepSpeed/pull/7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
`TiledFusedLogitsLoss` bug fix (#7459) bug fix - mixed up tuple and list.
Fix: Adapt Llama injection policy for newer transformers versions (#7443) This PR fixes an `AttributeError` that occurs during `deepspeed.init_inference` when using kernel injection (`replace_with_kernel_inject=True`) with Llama models from recent versions of `transformers`. **The Bug:** In newer `transformers` versions (e.g., `4.53.3`), configurations like `num_heads` and `rope_theta` were moved from direct attributes of the `LlamaAttention` module into a nested `config` object. The current DeepSpeed injection policy tries to access these attributes from their old, direct location, causing the initialization to fail with an `AttributeError: 'LlamaAttention' object has no attribute 'num_heads'`. **The Solution:** This change updates the Llama injection logic to be more robust: 1. It first tries to read attributes like `num_heads` from the new `config` object location. 2. If that fails, it falls back to the legacy direct attribute path. --------- Signed-off-by: huanyuqu <yc37960@um.edu.mo>
fix: engine initializes optimizer attributes at the beginning (#7410) As in `destroy`, `self.optimizer` is called, but the error out calling to `destroy` can happen in `__init__`, even before optimizer and scheduler is configured. So we need to move `self.optimizer` to the top to avoid triggering another exception. e.g.: ```logs File "deepspeed/runtime/engine.py", line 453, in _configure_tensor_parallel_states assert self.zero_optimization_stage( AssertionError: Currently, the compatibility between 'autotp' and 'zero_stage = 3' has not been validated Exception ignored in: <function DeepSpeedEngine.__del__ at 0x1516c0610820> Traceback (most recent call last): File "deepspeed/runtime/engine.py", line 509, in __del__ self.destroy() File "deepspeed/runtime/engine.py", line 512, in destroy if self.optimizer is not None and hasattr(self.optimizer, 'destroy'): File "deepspeed/runtime/engine.py", line 621, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'DeepSpeedEngine' object has no attribute 'optimizer' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>
Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327) pytest 8.4.0 seems to break a number of our tests, rather than pinning in each individually, we should just put this in the requirements file until we resolve the issue. --------- Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Bump to v0.17.0 (#7324) Co-authored-by: Logan Adams <loadams@microsoft.com>
[XPU] Support XCCL on deepspeed side (#7299) XCCL will be used for XPU device on Pytorch-2.8, with this support will remove torch-ccl on XPU device, and we will also reserve the old path for torch-CCL enable. --------- Signed-off-by: yisheng <yi.sheng@intel.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
rollback #6726 (#7258) This PR rollback #6726 which caused https://github.com/deepspeedai/DeepSpeed/issues/7116 . --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Make sure it's not None before offloading contiguous_grad_buffer (#7227) Resolves #7223 When DeepCompile is enabled in ZeRO-3, contiguous_grad_buffer is released, so we should check and make sure it's not None before we continue. https://github.com/deepspeedai/DeepSpeed/blob/227a60c0c412ddf4619401b5d8d9d1674aee17b5/deepspeed/compile/init_z3.py#L22-L25 Signed-off-by: Hollow Man <hollowman@opensuse.org> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
DeepCompile for enhanced compiler integration (#7154) This PR introduces *DeepCompile*, a new feature that efficiently integrates compiler optimizations with other DeepSpeed features. DeepCompile utilizes torch's dynamo to capture the computation graph and modifies it to incorporate DeepSpeed’s optimizations seamlessly. Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements such as proactive prefetching and selective unsharding to improve performance. (More details will be added later.) --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>