Also made the file more consistent, both internally and with the GPU and TPU test files.
The multi-CPU tests are now combined with the normal JAX tests.
- We can unpin `jax[and-cuda]` now that we've migrated to github runners for GPUs
- We have unpinned `ai-edge-litert` in the requirements files
- We do need to pin `tensorflow[and-cuda]` to 2.20.0 as 2.21 doesn't work with our setup
This is to prevent PRs like this: https://github.com/keras-team/keras/pull/22604
The tests are not correctly kept on pulls in `master`, when many PRs are submitted in `master` back to back, tests are not run: https://github.com/keras-team/keras/commits/master/
This came from a misunderstanding of `cancel-in-progress`. `cancel-in-progress: true` means that any running job is immediately cancelled are replaced by the new one. `cancel-in-progress: false` means that there can be one running job and one pending job, but if more jobs are queued, the existing pending job is cancelled.
For pulls we don't want to cancel any of them. The way to achieve this is by not having the same `group`. This is done by putting `github.run_id`, which is unique.
We can now put `cancel-in-progress: true` since pulls will never be deduped. That's because `github.head_ref` is only populated for `pull_request`.
* Enhance Gemini issue triage workflow
Updated the issue triage workflow to allow multiple labels and added a step to create the .gemini directory.
* fix issue triage
* reset to working version
* Enhance Gemini issue triage workflow
Updated the issue triage workflow to allow multiple labels and added a step to create the .gemini directory.
* fix issue triage
Problem: using `pull_request_target` causes `actions/checkout` to retrieve the code from the `master` branch, thus not checking the code from the PR.
Solution: revert back to `pull_request`.
Problem: the `trigger_gpu_tpu_tests.yml` doesn't work, but makes it nearly impossible to trigger the tests manually.
Solution: remove them for now until we find a working solution.
Also simplified the `cancel-in-progress:` condition.
Problem: if the tests are triggered multiple times, for instance when pushing an update to the PR, the old tests are not cancelled. The runners are consumed for tests that are no longer useful.
Solution: configure the `concurrency` of the workflows to cancel the runs on PRs, but not on master and on releases.
This change https://github.com/keras-team/keras/pull/22504 causes the `kokoro:force-run` label to be removed automatically. However, unlike removing the label manually, it does not trigger the GPU / TPU tests.
This is a follow-up to https://github.com/keras-team/keras/pull/22504
The workflow is currently failing with a `Label does not exist` error.
Trying to add a delay in case this is a race condition and the labels haven't fully been updated yet.
Kokoro use to automatically remove the `kokoro:force-run` label to trigger the tests. This no longer works as we don't have any Kokoro tests anymore.
This github workflow has the same behavior.
Currently, we are running multi-device tests with JAX
- always on CPU
- always on GPU with kokoro (although both `distribution_lib_test.py` are skipped)
- never on TPU
[This code](https://github.com/keras-team/keras/blob/master/keras/src/backend/jax/distribution_lib_test.py#L25-L28) is run while collecting the list of unit tests and unintentionally applies to everything instead of just `jax/distribution_lib_test.py`.
We currently use 4 T4s for JAX GPU tests, however https://github.com/keras-team/keras/pull/22462 moves them to a single L4.
A subsequent PR will add multi-TPU tests.
- Makes the normal JAX CPU tests run on a single CPU
- Adds a JAX multi-CPU check to run all the tests tagged with `pytest.mark.multi_device`
- Makes `jax/distribution_lib_test.py` work with any number of devices (as long as it's even and greater than 4) instead of the hardcoded 8
- Tags tests from `jax/distribution_lib_test.py`, `jax/trainer_test.py`, `orbax_checkpoint_test.py` with `pytest.mark.multi_device` so that we can run them with `pytest -m multi_device`
We are migrating to GPU custom runners for GPU tests instead of Kokoro.
- Disable TF32 for better numerical
- Pin tensorflow to 2.20 as 2.21 doesn't works with our L4 + CUDA13 setup
The GPU runner based tests were unintentionally run with the JAX backend and therefore running on CPU instead of GPU. This affected only the Torch test for now.
In order for the tests to pass on the NVidia L4 GPUs that we have, the following changes were needed:
- Added installing of `build-essential` to install a C++ compiler, which is needed by Torch Dynamo (Triton).
- Removed extra logging unintentionally added by the `pytest -s` option
- Changed `masking_test.py` and `lstm_test.py` to only use right padded masks (i.e. the Trues are on the left and the Falses on the right), which is required by CuDNN and is the normal use case for sequences.
- Lowered verification precision to 1e-5 for `bidirectional_test.py`, which now matches all the other RNN tests.
- Allowed fallback for int8 int8 matmul using `torch._int_mm` as it is not supported with CUDA 13.
- Turned off CuDNN's TF32 as they caused numerical differences causing some tests to fail.
- Skip broken LSTM tests, previously the issue was hidden by the fallback to the non-CuDNN implementation.
We are migrating to GPU custom runners for GPU tests instead of Kokoro.
This will be done one backend at a time as the other backends require more fixes. However, the `gpu_tests.yml` file has logic for JAX and TensorFlow already.
Problem: when a PR is approved, `google-ml-butler` adds 2 labels: `kokoro-force-run` and `ready to pull` in this order. The TPU tests workflow is triggered twice, once for each, but it is skipped for `ready to pull`. While the actual TPU tests are still run, what's shown in the UI is the skipped worflow because it was triggered last. So it's not directly possible to see the results of the TPU tests.
Solution: this changes the workflow to trigger on labeled removed. This works because kokoro immediately removes the label once it's set and it only removes one label at a time.
We will probably need to revisit once we migrate off of kokoro.