With Intel Extension for PyTorch retiring, XPU device would be supported
by PyTorch 2.8+ and dependency to Intel Extension for PyTorch would not
be needed.
This PR removed IPEX dependency, adapt to builder protocol in PyTorch
for XPU, and updated documents and tests accordingly.
Note after this update, DeepSpeed will not work with previous
PyTorch+IPEX on XPU devices. Suggest user to upgrade to latest PyTorch
to get latest XPU features on XPU devices.
Come with this PR is removal of InferenceBuilder, the kernel needed by
InferenceBuilder is supported through Intel Extension for PyTorch.
---------
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Evoformer tests fail with this error. We ignore this in the full test
for now.
```
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
We have been disabled the full unit test workflow for a while. This PR
migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
- #7786
- #7788
- #7789
- #7790
- #7793
- #7794
In addition having these PRs merged, this PR has the following changes
in the full test workflow and test harness:
- Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have
NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to
enable direct GPU-to-storage transfers. CI instances don't have this
configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation
has pre-existing bugs that cause internal pytest errors and worker
crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses
torch.optim.AdamW which does CUDA graph capture checks that fail in
forked processes (--forked flag, we can just move it to sequential
tests)
- `/mnt/aio` mount for async I/O tests
- CUTLASS installation for Evoformer tests
- Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker
cleanup hangs
Once we merge this PR, we will be able to run the full test manually or
at scheduled times.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The V100 tests are not needed anymore but this prevents the CI cron jobs
from being spun up even though the jobs are disabled.
The next step will be to remove the yaml files we do not use
anymore/have already ported.
The new CI workflows using AWS are not triggered when the path filters
don't match. However, it keeps "waiting for status to be reported"
because they are set as "required."
This PR always launches workflow but skips tests when the filter doesn't
match.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR migrates CI workflows for unit tests to AWS. v1 tests use 4xL40S
and accelerate tests use 1xL40S.
@sfc-gh-truwase This looks working now. We could disable modal tests
after this PR is merged, or keep both for a while just in case.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The SSL certificate of Intel's wheel server has expired. To unblock PRs,
trust `pytorch-extension.intel.com`.
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
currently only one modal CI job is possible across all PRs, which is not
workable - all running jobs get cancelled on a new PR or existing PR
update - fixing the dependency to make the group concurrency work across
PRs not to waste valuable resources.
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
1. `modal-accelerate` needs now `uv` installed explicitly since the
image change to 2025 one.
2. moved accelerate repo cloning into the job, since the original way
was incorrect as it was caching some accelerate version and not updating
it.
3. annotated that how to actually test the ci work when changing the
workflow as `pull_request_target` will not run the updated .py+.yaml
files.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
the newly released nccl finally started to use fp32 accumulation for
reduction ops!
* Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
the accuracy with fp8 and fp16 data types should be much improved.
72d2432094
So we should change the fp32 comms default for SP to the same dtype as
inputs if `nccl>=2.27.3` - the user can still override the default.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.
---------
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.
For PR reviewers:
Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it
Features:
- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.
had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.
Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Enhancing ci/nightly coverage for gaudi2 device
Tests added :
test_autotp_training.py
test_ulysses.py
test_linear::TestLoRALinear and test_linear::TestBasicLinear
test_ctx::TestEngine
these provide coverage for model_parallesim and linear feature.
The tests are stable. 10/10 runs pass.
New tests addition is expected to increase ci time by 3-4 mins and
nightly job time by 15 min.
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Unpin transformers version for all workflows except
`nv-torch-latest-v100` as this still has a tolerance issue with some
quantization tests.
Signed-off-by: Logan Adams <loadams@microsoft.com>
These jobs haven't been run in a long time and were originally used when
compatibility with torch <2 was more important.
Signed-off-by: Logan Adams <loadams@microsoft.com>
Latest transformers causes failures when cpu-torch-latest test, so we
pin it for now to unblock other PRs.
---------
Signed-off-by: Logan Adams <loadams@microsoft.com>
- Update existing workflows that use cu121 to cu124. Note, this means
that where we download torch latest, we will now be getting torch 2.6
rather than the torch latest 2.5 provided with cuda 12.1.
- Note, nv-nightly is failing in master currently due to unrelated
errors, so this could be ignored in this PR (nv-nightly tested locally,
where it passes with 12.1 and it also passes with 12.4).
---------
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>