Blame: docs/_tutorials/getting-started.md - deepspeedai/DeepSpeed

deepspeedai / DeepSpeed UNCLAIMED

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

41941 0 0 Python

Normal View History Raw

drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`---`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`title: 'Getting Started'`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`permalink: /getting-started/`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`excerpt: 'First steps with DeepSpeed'`
Website posts and tutorial improvements (#1799) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2022-03-11 15:00:32 -08:00			`tags: getting-started`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`---`

			`## Installation`

DeepSpeed JIT op + PyPI support (#496) Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> 2020-11-12 11:51:38 -08:00			* Installing is as simple as `pip install deepspeed`, [see more details](/tutorials/advanced-install/).
Update broken link in docs (#4822) resolves #4821 2023-12-15 13:02:17 -08:00			`* To get started with DeepSpeed on AzureML, please see the [AzureML Examples GitHub](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/deepspeed)`
docs: fix HF links (#6780) The current link https://huggingface.co/docs/transformers/main_classes/deepspeed is very unhelpful. It turns out in the past it had some guides: https://huggingface.co/docs/transformers/v4.27.1/main_classes/deepspeed#shared-configuration Later it's refreshed and moved to https://huggingface.co/docs/transformers/deepspeed 2024-11-26 02:10:08 +08:00			* DeepSpeed has direct integrations with [HuggingFace Transformers](https://github.com/huggingface/transformers) and [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning). HuggingFace Transformers users can now easily accelerate their models with DeepSpeed through a simple ``--deepspeed`` flag + config file [See more details](https://huggingface.co/docs/transformers/deepspeed). PyTorch Lightning provides easy access to DeepSpeed through the Lightning Trainer [See more details](https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html?highlight=deepspeed#deepspeed).
[docker] simplify and update rocm dockerfile (#1819) 2022-03-09 15:23:27 -08:00			* DeepSpeed on AMD can be used via our [ROCm images](https://hub.docker.com/r/deepspeed/rocm501/tags), e.g., `docker pull deepspeed/rocm501:ds060_pytorch110`.
Add the accelerator setup guide link in Getting Started page (#6452) Add the link of https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the installation section in Getting Started page so that users can easily find the doc. Signed-off-by: roger feng <roger.feng@intel.com> 2024-08-29 00:55:33 +08:00			`* DeepSpeed also supports Intel Xeon CPU, Intel Data Center Max Series XPU, Intel Gaudi HPU, Huawei Ascend NPU etc, please refer to the [accelerator setup guide](/tutorials/accelerator-setup-guide/)`
add ds integrations (#963) Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: Sean Naren <sean@grid.ai> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-04-27 15:18:22 -07:00

drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			`## Writing DeepSpeed Models`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`DeepSpeed model training is accomplished using the DeepSpeed engine. The engine`
			can wrap any arbitrary model of type `torch.nn.module` and has a minimal set of APIs
			`for training and checkpointing the model. Please see the tutorials for detailed`
			`examples.`

			`To initialize the DeepSpeed engine:`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```python
			`model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,`
			`model=model,`
			`model_parameters=params)`
			```

Fixing a typo (#303) 2020-07-28 22:24:12 +01:00			`deepspeed.initialize` ensures that all of the necessary setup required for
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`distributed data parallel or mixed precision training are done`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`appropriately under the hood. In addition to wrapping the model, DeepSpeed can`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`construct and manage the training optimizer, data loader, and the learning rate`
[doc] typo fix and clarification (#563) This PR: * fixes a misspelled method name * also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object. 2020-11-27 21:05:27 -08:00			scheduler based on the parameters passed to `deepspeed.initialize` and the
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`DeepSpeed [configuration file](#deepspeed-configuration). Note that DeepSpeed automatically executes the learning rate schedule at every training step.`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
document deepspeed.initialize() (#644) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-01-08 10:39:39 -08:00			`If you already have a distributed environment setup, you'd need to replace:`

			```python
			`torch.distributed.init_process_group(...)`
			```

			`with:`

			```python
			`deepspeed.init_distributed()`
			```

			`The default is to use the NCCL backend, which DeepSpeed has been thoroughly tested with, but you can also [override the default](https://deepspeed.readthedocs.io/en/latest/initialize.html#distributed-initialization).`

			But if you don't need the distributed environment setup until after `deepspeed.initialize()` you don't have to use this function, as DeepSpeed will automatically initialize the distributed environment during its `initialize`. Regardless, you will need to remove `torch.distributed.init_process_group` if you already had it in place.

drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`### Training`

			`Once the DeepSpeed engine has been initialized, it can be used to train the`
[doc] typo fix and clarification (#563) This PR: * fixes a misspelled method name * also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object. 2020-11-27 21:05:27 -08:00			`model using three simple APIs for forward propagation (callable object), backward`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			propagation (`backward`), and weight updates (`step`).

			```python
			`for step, batch in enumerate(data_loader):`
			`#forward() method`
			`loss = model_engine(batch)`

			`#runs backpropagation`
			`model_engine.backward(loss)`

			`#weight update`
			`model_engine.step()`
			```

			`Under the hood, DeepSpeed automatically performs the necessary operations`
			`required for distributed data parallel training, in mixed precision, with a`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`pre-defined learning rate scheduler:`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			- Gradient Averaging: in distributed data parallel training, `backward`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`ensures that gradients are averaged across data parallel processes after`
			training on an `train_batch_size`.

update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`- Loss Scaling: in FP16/mixed precision training, the DeepSpeed`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`engine automatically handles scaling the loss to avoid precision loss in the`
			`gradients.`

Various small documentation text improvements (#1665) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2022-01-03 13:28:43 -05:00			- Learning Rate Scheduler: when using a DeepSpeed's learning rate scheduler (specified in the `ds_config.json` file), DeepSpeed calls the `step()` method of the scheduler at every training step (when `model_engine.step()` is executed). When not using DeepSpeed's learning rate scheduler:
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			- if the schedule is supposed to execute at every training step, then the user can pass the scheduler to `deepspeed.initialize` when initializing the DeepSpeed engine and let DeepSpeed manage it for update or save/restore.
			`- if the schedule is supposed to execute at any other interval (e.g., training epochs), then the user should NOT pass the scheduler to DeepSpeed during initialization and must manage it explicitly.`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			`### Model Checkpointing`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			Saving and loading the training state is handled via the `save_checkpoint` and
			`load_checkpoint` API in DeepSpeed which takes two arguments to uniquely
			`identify a checkpoint:`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
			- `ckpt_dir`: the directory where checkpoints will be saved.
			- `ckpt_id`: an identifier that uniquely identifies a checkpoint in the directory.
			`In the following code snippet, we use the loss value as the checkpoint identifier.`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			```python
			`#load checkpoint`
			`_, client_sd = model_engine.load_checkpoint(args.load_dir, args.ckpt_id)`
			`step = client_sd['step']`

			`#advance data loader to ckpt step`
			`dataloader_to_step(data_loader, step + 1)`

			`for step, batch in enumerate(data_loader):`

			`#forward() method`
			`loss = model_engine(batch)`

			`#runs backpropagation`
			`model_engine.backward(loss)`

			`#weight update`
			`model_engine.step()`

			`#save checkpoint`
			`if step % args.save_interval:`
			`client_sd['step'] = step`
			`ckpt_id = loss.item()`
			`model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)`
			```

			`DeepSpeed can automatically save and restore the model, optimizer, and the`
			`learning rate scheduler states while hiding away these details from the user.`
Various small documentation text improvements (#1665) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2022-01-03 13:28:43 -05:00			`However, the user may want to save additional data that are`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			unique to a given model training. To support these items, `save_checkpoint`
			accepts a client state dictionary `client_sd` for saving. These items can be
			retrieved from `load_checkpoint` as a return argument. In the example above,
			the `step` value is stored as part of the `client_sd`.

Various small documentation text improvements (#1665) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2022-01-03 13:28:43 -05:00			`Important: all processes must call this method and not just the process with rank 0. It is because`
document the requirement to call for all ranks (#801) 2021-02-26 13:07:23 -08:00			`each process needs to save its master weights and scheduler+optimizer states. This method will hang`
			`waiting to synchronize with other processes if it's called just for the process with rank 0.`
Various small documentation text improvements (#1665) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2022-01-03 13:28:43 -05:00			`{: .notice--info}`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			`## DeepSpeed Configuration`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`DeepSpeed features can be enabled, disabled, or configured using a config JSON`
			file that should be specified as `args.deepspeed_config`. A sample config file
Web edits (#147) 2020-03-18 00:30:51 -07:00			`is shown below. For a full set of features see [ API`
JSON configuration cleanup. (#151) * Better config filename * Clean up configuration ToC 2020-03-18 10:09:41 -07:00			`doc](/docs/config-json/).`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			```json
			`{`
			`"train_batch_size": 8,`
			`"gradient_accumulation_steps": 1,`
			`"optimizer": {`
			`"type": "Adam",`
			`"params": {`
			`"lr": 0.00015`
			`}`
			`},`
			`"fp16": {`
			`"enabled": true`
			`},`
			`"zero_optimization": true`
			`}`
			```

			`# Launching DeepSpeed Training`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			DeepSpeed installs the entry point `deepspeed` to launch distributed training.
			`We illustrate an example usage of DeepSpeed with the following assumptions:`

			`1. You have already integrated DeepSpeed into your model`
			2. `client_entry.py` is the entry script for your model
			3. `client args` is the `argparse` command line arguments
			4. `ds_config.json` is the configuration file for DeepSpeed

			`## Resource Configuration (multi-node)`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`DeepSpeed configures multi-node compute resources with hostfiles that are compatible with`
			`[OpenMPI](https://www.open-mpi.org/) and [Horovod](https://github.com/horovod/horovod).`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`A hostfile is a list of _hostnames_ (or SSH aliases), which are machines accessible via passwordless`
			`SSH, and _slot counts_, which specify the number of GPUs available on the system. For`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`example,`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```
			`worker-1 slots=4`
			`worker-2 slots=4`
			```
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
			`specifies that two machines named _worker-1_ and _worker-2_ each have four GPUs to use`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`for training.`

			Hostfiles are specified with the `--hostfile` command line option. If no hostfile is
			specified, DeepSpeed searches for `/job/hostfile`. If no hostfile is specified or found,
			`DeepSpeed queries the number of GPUs on the local machine to discover the number of local`
			`slots available.`

			`The following command launches a PyTorch training job across all available nodes and GPUs`
			specified in `myhostfile`:
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```bash
[doc] launcher (#868) As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: https://github.com/microsoft/DeepSpeed/issues/662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-03-18 08:58:23 -07:00			`deepspeed --hostfile=myhostfile <client_entry.py> <client args> \`
			`--deepspeed --deepspeed_config ds_config.json`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```

			`Alternatively, DeepSpeed allows you to restrict distributed training of your model to a`
			`subset of the available nodes and GPUs. This feature is enabled through two command line`
			arguments: `--num_nodes` and `--num_gpus`. For example, distributed training can be
			`restricted to use only two nodes with the following command:`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```bash
			`deepspeed --num_nodes=2 \`
			`<client_entry.py> <client args> \`
			`--deepspeed --deepspeed_config ds_config.json`
			```
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			You can instead include or exclude specific resources using the `--include` and
			`--exclude` flags. For example, to use all available resources except GPU 0 on node
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`_worker-2_ and GPUs 0 and 1 on _worker-3_:`

drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```bash
			`deepspeed --exclude="worker-2:0@worker-3:0,1" \`
			`<client_entry.py> <client args> \`
			`--deepspeed --deepspeed_config ds_config.json`
			```
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
			`Similarly, you can use only GPUs 0 and 1 on _worker-2_:`

drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			```bash
			`deepspeed --include="worker-2:0,1" \`
			`<client_entry.py> <client args> \`
			`--deepspeed --deepspeed_config ds_config.json`
			```
Add documentation for launcher without SSH (#6455) #5728 --------- Co-authored-by: Logan Adams <loadams@microsoft.com> 2024-08-28 11:28:10 -04:00			`### Launching without passwordless SSH`

			`DeepSpeed now supports launching training jobs without the need for passwordless SSH. This mode is`
			`particularly useful in cloud environments such as Kubernetes, where flexible container orchestration`
			`is possible, and setting up a leader-worker architecture with passwordless SSH adds unnecessary`
			`complexity.`

			`To use this mode, you need to run the DeepSpeed command separately on all nodes. The command should`
			`be structured as follows:`

			```bash
			`deepspeed --hostfile=myhostfile --no_ssh --node_rank=<n> \`
			`--master_addr=<addr> --master_port=<port> \`
			`<client_entry.py> <client args> \`
			`--deepspeed --deepspeed_config ds_config.json`
			```

			- `--hostfile=myhostfile`: Specifies the hostfile that contains information about the nodes and GPUs.
			- `--no_ssh`: Enables the no-SSH mode.
			- `--node_rank=<n>`: Specifies the rank of the node. This should be a unique integer from 0 to n - 1.
			- `--master_addr=<addr>`: The address of the leader node (rank 0).
			- `--master_port=<port>`: The port of the leader node.

			`In this setup, the hostnames in the hostfile do not need to be reachable via passwordless SSH.`
			`However, the hostfile is still required for the launcher to collect information about the environment,`
			`such as the number of nodes and the number of GPUs per node.`

			Each node must be launched with a unique `node_rank`, and all nodes must be provided with the address
			and port of the leader node (rank 0). This mode causes the launcher to act similarly to the `torchrun`
			`launcher, as described in the [PyTorch documentation](https://pytorch.org/docs/stable/elastic/run.html).`
Moved environment variable docs. (#203) 2020-04-27 14:27:04 -07:00
			`## Multi-Node Environment Variables`

			`When training across multiple nodes we have found it useful to support`
			`propagating user-defined environment variables. By default DeepSpeed will`
			`propagate all NCCL and PYTHON related environment variables that are set. If`
			`you would like to propagate additional variables you can specify them in a`
			dot-file named `.deepspeed_env` that contains a new-line separated list of
			`VAR=VAL` entries. The DeepSpeed launcher will look in the local path you are
Allow user to select name of .deepspeed_env (#4006) * formatting * Formatting * arg formatting * Allow user to select file * Add docs * Switch environment variable name to to match other existing envvars * Handle if the user enters a path or a filename * Clarify in docs that you can enter either a file or a path 2023-07-20 15:41:58 -07:00			executing from and also in your home directory (`~/`). If you would like to
			`override the default name of this file or path and name with your own, you`
			can specify this with the environment variable, `DS_ENV_FILE`. This is
			`mostly useful if you are launching multiple jobs that all require different`
			`variables.`
Moved environment variable docs. (#203) 2020-04-27 14:27:04 -07:00
			`As a concrete example, some clusters require special NCCL variables to set`
			`prior to training. The user can simply add these variables to a`
			`.deepspeed_env` file in their home directory that looks like this:
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
Moved environment variable docs. (#203) 2020-04-27 14:27:04 -07:00			```
			`NCCL_IB_DISABLE=1`
			`NCCL_SOCKET_IFNAME=eth0`
			```
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
Moved environment variable docs. (#203) 2020-04-27 14:27:04 -07:00			`DeepSpeed will then make sure that these environment variables are set when`
			`launching each process on every node across their training job.`

Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00			`### MPI and AzureML Compatibility`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`As described above, DeepSpeed provides its own parallel launcher to help launch`
			`multi-node/multi-gpu training jobs. If you prefer to launch your training job`
			`using MPI (e.g., mpirun), we provide support for this. It should be noted that`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`DeepSpeed will still use the torch distributed NCCL backend and _not_ the MPI`
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00			`backend.`

			`To launch your training job with mpirun + DeepSpeed or with AzureML (which uses`
			`mpirun as a launcher backend) you simply need to install the`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`[mpi4py](https://pypi.org/project/mpi4py/) python package. DeepSpeed will use`
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00			`this to discover the MPI environment and pass the necessary state (e.g., world`
			`size, rank) to the torch distributed backend.`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00			`If you are using model parallelism, pipeline parallelism, or otherwise require`
			torch.distributed calls before calling `deepspeed.initialize(..)` we provide
			`the same MPI support with an additional DeepSpeed API call. Replace your initial`
			`torch.distributed.init_process_group(..)` call with:

			```python
			`deepspeed.init_distributed()`
			```
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00
			`## Resource Configuration (single-node)`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`In the case that we are only running on a single node (with one or more GPUs)`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00			`DeepSpeed _does not_ require a hostfile as described above. If a hostfile is`
drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00			`not detected or passed in then DeepSpeed will query the number of GPUs on the`
			local machine to discover the number of slots available. The `--include` and
			`--exclude` arguments work as normal, but the user should specify 'localhost'
			`as the hostname.`
[doc] launcher (#868) As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: https://github.com/microsoft/DeepSpeed/issues/662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-03-18 08:58:23 -07:00
fix an outdated doc wrt CUDA_VISIBLE_DEVICES (#7058) @jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <stas@stason.org> 2025-02-20 07:27:54 -08:00			Also note that `CUDA_VISIBLE_DEVICES` can be used with `deepspeed` to control
			`which devices should be used on a single node. So either of these would work`
			`to launch just on devices 0 and 1 of the current node:`
update lr scheduler doc for doing per step or epoch update (#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> 2021-04-14 14:31:39 -07:00
[doc] launcher (#868) As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: https://github.com/microsoft/DeepSpeed/issues/662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-03-18 08:58:23 -07:00			```bash
fix an outdated doc wrt CUDA_VISIBLE_DEVICES (#7058) @jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <stas@stason.org> 2025-02-20 07:27:54 -08:00			`deepspeed --include localhost:0,1 ...`
			```

			```bash
			`CUDA_VISIBLE_DEVICES=0,1 deepspeed ...`
[doc] launcher (#868) As discussed in https://github.com/microsoft/DeepSpeed/issues/662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: https://github.com/microsoft/DeepSpeed/issues/662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> 2021-03-18 08:58:23 -07:00			```