325 Commits

Author SHA1 Message Date
Masahiro Tanaka
36f0b0c7bb Add fallback to full test (#7933)
The recent attempts of the night full test [kept
failing](https://github.com/deepspeedai/DeepSpeed/actions/workflows/aws-torch-latest-full.yml).
We added a fallback to an A100 node on the infra side.
This PR detects the CUDA architecture and number of GPUs and sets them
to env vars.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-03-30 08:16:09 -04:00
Ma, Guokai
a41a96b19f Remove amp() from abstract accelerator (#7879)
Pytorch now provide torch.amp
https://docs.pytorch.org/docs/stable/amp.html as recommended AMP API
instead of torch.<device_type>.amp which is used in DeepSpeed abstract
accelerator amp(). Some PyTorch backend such as XPU does not provide the
legacy `torch.xpu.amp` module.

This PR replace `get_accelerator().amp()` by `torch.amp` which is the
recommended way of using AMP.

Related issues and PRs
https://github.com/deepspeedai/DeepSpeed/issues/7876
https://github.com/deepspeedai/DeepSpeed/pull/7877

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2026-03-02 14:51:22 -05:00
Ma, Guokai
d8e15da43f XPU use stock pytorch instead of Intel Extension for PyTorch (#7877)
With Intel Extension for PyTorch retiring, XPU device would be supported
by PyTorch 2.8+ and dependency to Intel Extension for PyTorch would not
be needed.

This PR removed IPEX dependency, adapt to builder protocol in PyTorch
for XPU, and updated documents and tests accordingly.

Note after this update, DeepSpeed will not work with previous
PyTorch+IPEX on XPU devices. Suggest user to upgrade to latest PyTorch
to get latest XPU features on XPU devices.

Come with this PR is removal of InferenceBuilder, the kernel needed by
InferenceBuilder is supported through Intel Extension for PyTorch.

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2026-03-01 22:37:51 -05:00
Masahiro Tanaka
0416cf68df Schedule nightly full test (#7870)
The full test workflow passed though it is still flakey
([Success](https://github.com/deepspeedai/DeepSpeed/actions/runs/22269243373)
/
[Failure](https://github.com/deepspeedai/DeepSpeed/actions/runs/22266498530))

This PR schedules a nightly run of the full test. It is launched only
when we have update since the last successful run.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2026-02-24 11:37:10 -08:00
Masahiro Tanaka
c89e0db8e2 Ignore evoformer test (#7815)
Evoformer tests fail with this error. We ignore this in the full test
for now.

```
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-01-25 22:18:40 -08:00
Masahiro Tanaka
5aa2d17dd7 Add full test suite workflow (#7795)
We have been disabled the full unit test workflow for a while. This PR
migrates the full test to our AWS test infra.
To make the tests pass, we need to merge these PRs:
- #7786
- #7788
- #7789
- #7790
- #7793
- #7794

In addition having these PRs merged, this PR has the following changes
in the full test workflow and test harness:
- Ignore flags for some known issues:
- nvme: Requires an actual NVMe device. Our CI currently doesn't have
NVMe storage configured
- GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to
enable direct GPU-to-storage transfers. CI instances don't have this
configured.
- Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation
has pre-existing bugs that cause internal pytest errors and worker
crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses
torch.optim.AdamW which does CUDA graph capture checks that fail in
forked processes (--forked flag, we can just move it to sequential
tests)
- `/mnt/aio` mount for async I/O tests
- CUTLASS installation for Evoformer tests
- Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker
cleanup hangs

Once we merge this PR, we will be able to run the full test manually or
at scheduled times.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-01-20 15:17:19 -08:00
Logan Adams
53c87ce8ed Remove cron/PR triggers for outdated V100 tests (#7777)
The V100 tests are not needed anymore but this prevents the CI cron jobs
from being spun up even though the jobs are disabled.

The next step will be to remove the yaml files we do not use
anymore/have already ported.
2026-01-13 07:53:09 -08:00
Masahiro Tanaka
52361686fd Add timeout to test workflows (#7774)
This PR adds timeout to CI workflows. This will prevent zombie jobs from
holding GPU instances.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2026-01-12 20:06:43 -08:00
Masahiro Tanaka
d988f6ca10 Udpate workflow trigger (#7768)
The new CI workflows using AWS are not triggered when the path filters
don't match. However, it keeps "waiting for status to be reported"
because they are set as "required."
This PR always launches workflow but skips tests when the filter doesn't
match.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-01-09 19:49:42 -08:00
Masahiro Tanaka
bfb66c65c6 Add CI workflow to run tests on AWS (#7753)
This PR migrates CI workflows for unit tests to AWS. v1 tests use 4xL40S
and accelerate tests use 1xL40S.

@sfc-gh-truwase This looks working now. We could disable modal tests
after this PR is merged, or keep both for a while just in case.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-01-03 07:06:18 +09:00
Logan Adams
c0e9b2c9b2 Enable python 3.11 and 3.12 tests (#7007)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
2026-01-01 09:34:52 +00:00
Masahiro Tanaka
51dc888423 Trust intel server for XPU tests (#7698)
The SSL certificate of Intel's wheel server has expired. To unblock PRs,
trust `pytorch-extension.intel.com`.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-11-19 08:42:49 +09:00
Stas Bekman
dff81cb619 modal ci: fix group concurrency (#7691)
currently only one modal CI job is possible across all PRs, which is not
workable - all running jobs get cancelled on a new PR or existing PR
update - fixing the dependency to make the group concurrency work across
PRs not to waste valuable resources.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-11-12 13:00:38 -08:00
Stas Bekman
283f6f5fde disable nv-lightning-v100.yml cI (#7681)
as we lost v100s - disable first so that it stops interfering with PRs,
then port to modal.
2025-11-08 09:05:21 -05:00
Stas Bekman
b073a557c1 [modal ci] fixes (#7676)
1. `modal-accelerate` needs now `uv` installed explicitly since the
image change to 2025 one.
2. moved accelerate repo cloning into the job, since the original way
was incorrect as it was caching some accelerate version and not updating
it.
3. annotated that how to actually test the ci work when changing the
workflow as `pull_request_target` will not run the updated .py+.yaml
files.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-11-06 11:42:22 -08:00
Liangliang Ma
69e03e52d0 [XPU][CI] recover xpu-max1100 workflow (#7630)
Reduce some test scope to recover CI workflow.

Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-13 16:43:17 +00:00
Olatunji Ruwase
64ac13f72e Enable forked PRs (#7486)
Enable forked PRs

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-14 17:43:08 -04:00
Olatunji Ruwase
a12de38db6 Modal CI (#7289)
This is an initial effort to migrate CI unto Modal infra. This PR
creates two new workflows that run on Modal
1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes
`tests/unit/runtime/zero/test_zero.py`.
2. modal-accelerate: a full copy of nv-accelerate-v100. 

Follow up PRs will selectively migrate relevant workflows onto Modal.

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-08-11 20:13:39 +00:00
Olatunji Ruwase
8c83e42ba1 Fix cpu CI (#7481)
Fix torch version

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-11 11:53:09 -07:00
Logan Adams
43f00ba31c Remove additional unused tests (human-eval) (#7445) 2025-07-24 13:16:57 -07:00
Logan Adams
3bf53451e5 Remove tests from README that are already removed. (#7441) 2025-07-21 20:56:11 -07:00
Stas Bekman
affee605e4 trying to fix nv-accelerate-v100.yml CI job (#7424)
trying a day old accelerate from the day before
1ac8643df7

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-07-11 10:07:27 -04:00
Stas Bekman
d3b9cb8c4e sequence parallel default dtype (#7364)
the newly released nccl finally started to use fp32 accumulation for
reduction ops!

* Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
   the accuracy with fp8 and fp16 data types should be much improved.

72d2432094

So we should change the fp32 comms default for SP to the same dtype as
inputs if `nccl>=2.27.3` - the user can still override the default.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-19 18:32:14 +00:00
Olatunji Ruwase
10b106619a Don't break set_start_method (#7349)
Fix #7347

---------

Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
2025-06-11 13:00:58 -04:00
Logan Adams
2ce5505799 Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327)
pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.

---------

Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-09 22:42:55 +00:00
Raza Sikander
2ad2011cc9 Fix pytest version to 8.3.5 in hpu-gaudi actions (#7337)
This is needed to avoid the issue of ci failure in #7330 PR.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-05 23:10:19 +00:00
Michael Wyatt
720787e79b Bump to v0.17.0 (#7324)
Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-06-02 16:01:44 -07:00
Stas Bekman
4d00b38ada Ulysses SP for HF Integration (#7268)
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.


For PR reviewers: 

Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it


Features:

- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-31 07:25:23 +00:00
Stas Bekman
b66c81077c anchor transformers version (#7316)
some features require minimal transformers versions so let's start
anchoring.

and fixing tests that break with recent transformers.

I need this fixed to be able to merge
https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires
`transformers>=4.51.3`

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-29 06:19:54 +00:00
Raza Sikander
ec6b254dce Update gaudi2 nightly,ci to latest 1.21.0 build (#7313)
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-29 02:58:52 +00:00
Stas Bekman
b4cc079eee CI: prefer bf16 over fp16 (#7304)
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.

had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-28 00:49:21 +00:00
Olatunji Ruwase
0e741714f5 Enable ZeRO set/get APIs for NVMe offload (#7046)
- Extend APIs for
[debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
ZeRO partitioned states to NVMe offload.
- Add vectorized update API. This is performance-critical for NVMe
offloading scenarios.

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
2025-05-20 00:11:17 +00:00
Logan Adams
d46947db4a Temporarily skip AIO tests due to an issue with runners (#7288)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-18 23:36:06 +00:00
Logan Adams
930ab46e63 Fix issues XPU tests hit with extra-index-url (#7291)
cc: @Liangliang-Ma

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-05-16 19:07:35 -07:00
Liangliang Ma
5a4e7a08ec [XPU] update xpu-max1100 CI workflow to torch 2.7 (#7284)
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-05-15 10:02:53 -07:00
Logan Adams
9926879b59 Update CPU torch version to 2.7 (#7241)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 21:58:01 +00:00
Logan Adams
8d2865e014 Revert "Update torch cpu test version"
This reverts commit 00b5678bbf.
2025-04-23 13:26:40 -07:00
Logan Adams
00b5678bbf Update torch cpu test version
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 13:26:02 -07:00
Masahiro Tanaka
227a60c0c4 DeepCompile for enhanced compiler integration (#7154)
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.

Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-04-16 04:33:53 +00:00
Logan Adams
3388f8331b Update container version that runs on A6000 tests. (#7153)
Changes from https://github.com/huggingface/transformers/pull/36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-03-19 23:42:38 +00:00
Raza Sikander
29e9fd53b5 Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear Tests (#7146)
Enhancing  ci/nightly coverage for gaudi2 device
Tests added :
        test_autotp_training.py
        test_ulysses.py
	test_linear::TestLoRALinear and test_linear::TestBasicLinear
	test_ctx::TestEngine
these provide coverage for model_parallesim and linear feature.
The tests are stable. 10/10 runs pass.
New tests addition is expected to increase ci time by 3-4 mins and
nightly job time by 15 min.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
2025-03-18 23:49:01 +00:00
Logan Adams
d095b18185 Unpin transformers version for most workflows (#7139)
Unpin transformers version for all workflows except
`nv-torch-latest-v100` as this still has a tolerance issue with some
quantization tests.

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-03-14 13:52:44 -07:00
Raza Sikander
c1acd49cdf Update gaudi2 nightly,ci to latest 1.20.0 build (#7093)
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
2025-03-07 22:46:47 +00:00
Logan Adams
02bbf50109 Remove workflows for very old torch versions (#7090)
These jobs haven't been run in a long time and were originally used when
compatibility with torch <2 was more important.

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-28 01:33:01 +00:00
Logan Adams
f2ed2531a7 Update parallelism for nv-torch-latest/nightly tests due to more GPUs/runner (#7086)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-02-27 10:47:49 -08:00
Logan Adams
f8d34295d0 Pin transformers version on tests that use latest. (#7085)
Latest transformers causes failures when cpu-torch-latest test, so we
pin it for now to unblock other PRs.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-27 08:15:11 -08:00
Logan Adams
1d30b58cba Replace calls to python setup.py sdist with python -m build --sdist (#7069)
With future changes coming to pip/python/etc, we need to modify to no
longer call `python setup.py ...` and replace it instead:
https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted


![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d)

This means we need to install the build package which is added here as
well.

Additionally, we pass the `--sdist` flag to only build the sdist rather
than the wheel as well here.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-24 20:40:24 +00:00
Logan Adams
33dd2e2165 nv-ds-chat breaks with latest transformers (#7052)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-19 15:48:41 +00:00
Logan Adams
079de6bdff Update workflows to cuda 12.4 (#7000)
- Update existing workflows that use cu121 to cu124. Note, this means
that where we download torch latest, we will now be getting torch 2.6
rather than the torch latest 2.5 provided with cuda 12.1.
- Note, nv-nightly is failing in master currently due to unrelated
errors, so this could be ignored in this PR (nv-nightly tested locally,
where it passes with 12.1 and it also passes with 12.4).

---------

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
2025-02-12 15:25:41 -08:00
Logan Adams
a83ab17d3d Update A6000 tests transformers version (#7016)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-02-08 00:26:02 +00:00