Blame: python/paddle/distributed/passes/auto_parallel_data_parallel_optimization.py - PaddlePaddle/Paddle

PaddlePaddle / Paddle UNCLAIMED

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

0 0 1 C++

Normal View History Raw

[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.`
			`#`
			`# Licensed under the Apache License, Version 2.0 (the "License");`
			`# you may not use this file except in compliance with the License.`
			`# You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`

			`from collections import OrderedDict`

			`import paddle`
[Auto Parallel] Reorganize the fold structure (#54059) * [Auto Parallel] Reorganize the fold structure * [Auto Parallel] Fix some import errors 2023-05-30 14:07:49 +08:00			`from paddle.distributed.auto_parallel.process_mesh import ProcessMesh`
			`from paddle.distributed.auto_parallel.static.dist_attribute import (`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`OperatorDistAttr,`
			`TensorDistAttr,`
			`)`
[Auto Parallel] Reorganize the fold structure (#54059) * [Auto Parallel] Reorganize the fold structure * [Auto Parallel] Fix some import errors 2023-05-30 14:07:49 +08:00			`from paddle.distributed.auto_parallel.static.operators.common import (`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`is_data_parallel_reduce_op,`
[CodeStyle][isort] introduce isort (part6) (#48522) 2022-12-01 09:59:11 +08:00			`is_data_parallel_scale_op,`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[Auto Parallel] Reorganize the fold structure (#54059) * [Auto Parallel] Reorganize the fold structure * [Auto Parallel] Fix some import errors 2023-05-30 14:07:49 +08:00			`from paddle.distributed.auto_parallel.static.utils import (`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`find_higher_order_backward_op,`
[CodeStyle][isort] introduce isort (part6) (#48522) 2022-12-01 09:59:11 +08:00			`get_var_numel,`
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`insert_dependencies_for_vars,`
[CodeStyle][isort] introduce isort (part6) (#48522) 2022-12-01 09:59:11 +08:00			`is_forward_op,`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`is_loss_grad_op,`
			`is_optimize_op,`
			`ring_id_to_process_group,`
			`)`
[CodeStyle][isort] introduce isort (part6) (#48522) 2022-12-01 09:59:11 +08:00			`from paddle.distributed.fleet.meta_optimizers.common import OP_ROLE_KEY, OpRole`
[Auto Parallel] Remove some deprecated fluid APIs (#49099) * [Auto Parallel] Remove some fluid APIs * [Auto Parallel] Fix the wrong import * [Auto Parallel] Remove unnecessary comments * [Auto Parallel] Fix the importing bug 2023-01-10 15:54:33 +08:00			`from paddle.static import default_main_program`
			`from paddle.utils import unique_name`
[CodeStyle][isort] introduce isort (part6) (#48522) 2022-12-01 09:59:11 +08:00
			`from .pass_base import PassBase, PassType, register_pass`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`# add new optimizers supporting rescale_grad here`
			`__rescale_grad_supported_opts__ = [`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`'lars_momentum',`
			`'sparse_momentum',`
			`'dgc_momentum',`
			`'momentum',`
			`'merge_momentum',`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`]`

[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`# a heuristic number`
			`__max_stream_num_allow__ = 16`

[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`@register_pass("auto_parallel_data_parallel_optimization")`
			`class DataParallelOptimizationPass(PassBase):`
			`"""`
			`Apply Optimizations that specialized for data parallelism in Auto Parallel.`
[CodeStyle][W291] trim trailing whitespace in python file (#45937) * trim trailing whitespace * fix `.cmake-format.py` * revert npu ut changes, avoid npu ci error 2022-09-14 21:56:19 +08:00			`1. prune grad scaling`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`2. overlap comm and calc`
			`3. fuse allreduce`
			`"""`

			`def __init__(self):`
[CodeStyle][py2][U008] remove unnecessary args in `super()` (#47549) * [CodeStyle][py2][U008] remove unnecessary args in `super()` * remove remained args * revert changes in test_pylayer_op * Revert "revert changes in test_pylayer_op" This reverts commit ff185a9ae738afac3b0264f61bde6c6b7f72e7c4. * revert some changes in example code 2022-11-03 14:33:00 +08:00			`super().__init__()`
Fix some typos (dst_strategys, etc.) (#62003) 2024-02-23 14:29:40 +08:00			`# NOTE not use dependence on loss and param_grads`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`self.set_attr("dist_context", None)`
			`self.set_attr("global_rank", -1)`
[AutoParallel] adapt lazyinit & fix pass (#45840) * adapt lazy init and fix pass * add unittest * update comment * fix amp and sharding * remove clip_by_norm 2022-09-09 10:53:37 +08:00			`self.set_attr("use_sharding", False)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`# {grad1: group1, grad2: group1, grad3: group2}`
			`# record the order for fuse grad data memory`
			`self._grad_name_to_group_map = OrderedDict()`
			`# {group1:[grad1, grad2] , group2:[grad3]}`
			`self._group_to_grad_name_map = OrderedDict()`
			`self._support_rescale_grad = False`

			`def _check_self(self):`
			`if self.get_attr("dist_context") is None:`
			`return False`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`if (not isinstance(self.get_attr("global_rank"), int)) or self.get_attr(`
			`"global_rank"`
			`) < 0:`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`return False`

			`return True`

			`def _check_conflict(self, other_pass):`
			`return True`

			`def _type(self):`
			`return PassType.COMM_OPT`

			`def _apply_single_impl(self, main_program, startup_program, context):`
			`self.dist_context = self.get_attr("dist_context")`
			`self.global_rank = int(self.get_attr("global_rank"))`
[AutoParallel] adapt lazyinit & fix pass (#45840) * adapt lazy init and fix pass * add unittest * update comment * fix amp and sharding * remove clip_by_norm 2022-09-09 10:53:37 +08:00			`self.use_sharding = self.get_attr("use_sharding")`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`self.coalesce_prefix = 'coalesce_grad'`
[Executor] remove run_program branch (#52471) * remove run_program * remove FLAGS_USE_STANDALONE_EXECUTOR 2023-04-07 19:20:51 +08:00			`self.gradient_sync_stream = "gradient_sync_stream"`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`with paddle.static.program_guard(main_program, startup_program):`
			`self._analyze_program()`
bugfix (#46921) 2022-10-12 19:32:15 +08:00
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`# TODO refactor here to first fuse then overlap`
bugfix (#46921) 2022-10-12 19:32:15 +08:00			`if self.is_data_parallel_applied():`
			`self._prune_grad_scaling()`
			`self._calc_comm_overlap()`
			`grad_group = self._fuse_allreduce()`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`self._add_dependencies(grad_group)`
			`self.summary(grad_group)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`def _prune_grad_scaling(self):`
			`if not self._could_be_prune():`
			`return`

			`if self._all_dp_groups_same_degree():`
			`self._scale_backward_initial_grad()`
			`else:`
			`self._update_opt_rescale_grad()`

			`self._remove_grad_scaling()`

[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`def _calc_comm_overlap(self):`
			`if not self._could_be_overlap():`
			`return`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`self._comms_overlap_calc()`
			`self._calc_wait_comms()`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`def _fuse_allreduce(self):`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`if not self._could_be_fuse():`
			`return []`

			`grad_group = self._group_grads()`
			`self._update_program(grad_group)`
[Auto Parallel] Bugfix allreduce fuse for MP (#46086) * bugfix * bugfix * typos fixed 2022-09-16 17:12:38 +08:00
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`return grad_group`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`def _analyze_program(self):`
			`"""`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`build two maps`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`{param_grad_name: data_parallel_group}`
			`{pdata_parallel_group: aram_grad_name}`
			`"""`

			`block = default_main_program().global_block()`
			`ops = block.ops`
			`scaled_grads = []`

			`for op in ops:`
			`if is_data_parallel_reduce_op(op):`
[Auto Parallel] Support High Order Differential with Data Parallel Calc-Comm Overlaping (#45388) * support high order differential with data parallel overlap * update unitest 2022-08-25 13:14:39 +08:00			`grad_name = op.output_arg_names[0]`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`if grad_name in self._grad_name_to_group_map:`
			`continue`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert op.has_attr("ring_id"), (`
			`f"Unexpected: comm op [{op}] has NOT ring id."`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`group = ring_id_to_process_group(op.attr("ring_id"))`

[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert group is not None, (`
			`f"Unexpected: data parallel group of [{grad_name}] from op [{op}] is None"`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`self._grad_name_to_group_map[grad_name] = group`

			`if group not in self._group_to_grad_name_map:`
			`self._group_to_grad_name_map[group] = [grad_name]`
			`else:`
			`self._group_to_grad_name_map[group].append(grad_name)`

			`elif is_data_parallel_scale_op(op):`
[Auto Parallel] Support High Order Differential with Data Parallel Calc-Comm Overlaping (#45388) * support high order differential with data parallel overlap * update unitest 2022-08-25 13:14:39 +08:00			`grad_name = op.output_arg_names[0]`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`scaled_grads.append(grad_name)`

			`# TODO support multiple optimizers in on network in future.`
			`# here we assume that the optimizer is unique in network.`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`elif (`
			`is_optimize_op(op)`
			`and op.type in __rescale_grad_supported_opts__`
			`):`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`self._support_rescale_grad = True`

			`not_synchronized_grads = []`
			`for grad_name in scaled_grads:`
			`if grad_name not in self._grad_name_to_group_map:`
			`not_synchronized_grads.append(grad_name)`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert len(not_synchronized_grads) == 0, (`
			`f"Unexpected: gradients [{not_synchronized_grads}] is scaled BUT NOT synchronized."`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
bugfix (#46921) 2022-10-12 19:32:15 +08:00			`def is_data_parallel_applied(self):`
			`return len(self._group_to_grad_name_map) > 0`

[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`def _could_be_prune(self):`
[Auto Parallel] Improve the APIs (#45776) * [Auto Parallel] Use c++ dist attr in the completion process * [Auto Parallel] Add minor changes * [Auto Parallel] Use c++ dist attr in the completion process * [Auto Parallel] Add minor changes * [Auto Parallel] Add the serialization process for dist attrs * [Auto Parallel] Remove unnecessary comments * [Auto Parallel] Fix some bugs * [Auto Parallel] Fix the code style * [Auto Parallel] Remove unnecessary impls * [Auto Parallel] Fix the importing error * [Auto Parallel] Fix the copy from bugs of op dist attr * [Auto Parallel] Replace the use of constexpr if * [Auto Parallel] Redesign the shard_tensor, shard_op and ProcessMesh * [Auto Parallel] Change API of the completion unittest * [Auto Parallel] Fix the bug when set_attr an int * [Auto Parallel] Add the unittest for the serialization * [Auto Parallel] Add some unit tests * [Auto Paralle] Unify the strategy * [Auto Parallel] Improve the engine api * [Auto Parallel] Reset the changes made to the framework * [Auto Parallel] Change the engine unittest * [Auto Parallel] Update API of the completion and partitioner * [Auto Parallel] Update unit tests using engine api * update shard annotation * [Auto Parallel] Remove the modifications of other modules * [Auto Parallel] Add docs for APIs * add new strategy * [Auto Parallel] Replace the logger * [Auto Parallel] Restore the test_program.py * [Auto Parallel] Change the import rules * [Auto Parallel] Add the examples for Engine * [Auto Parallel] Do some minor changes * [Auto Parallel] Remove yaml dependency * [Auto Parallel] Fix the unittests * add valid after train * bug fix Co-authored-by: zhaoyingli <zhaoyingli@baidu.com> Co-authored-by: caozhou <caozhou@radi.ac.cn> Co-authored-by: caozhou <48191911+Caozhou1995@users.noreply.github.com> 2022-09-15 20:35:52 +08:00			`return self.dist_context.gradient_scale and (`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`self._support_rescale_grad or self._all_dp_groups_same_degree()`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`def _all_dp_groups_same_degree(self):`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`return (`
			`len(`
[CodeStyle][C403] Unnecessary list comprehension (rewrite as a set comprehension) (#51968) 2023-03-23 10:17:12 +08:00			`{`
			`len(group.ranks)`
			`for group in self._group_to_grad_name_map.keys()`
			`}`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
			`== 1`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`def _scale_backward_initial_grad(self):`
			`block = default_main_program().global_block()`
[CodeStyle][Ruff] Enable and fix `RUF015` diagnostics (#67483) 2024-08-16 09:55:23 +08:00			`dp_degree = len(next(iter(self._group_to_grad_name_map.keys())).ranks)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`for idx, op in reversed(list(enumerate(block.ops))):`
			`if is_loss_grad_op(op):`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`assert op.type == 'fill_constant', (`
			`"loss_grad_op must be fill_constant op, "`
[CodeStyle][task 1] enable Ruff UP032 rule with . except `python/paddle/base` (#57409) * update up032 * update up032 * Update api_gen.py * Update api_gen.py * Update sampcd_processor_utils.py 2023-09-22 10:14:38 +08:00			`f"but this op is {op.type}"`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`assert op.has_attr('value')`
			`loss_scale = float(op.attr('value'))`
			`loss_scale = loss_scale / dp_degree`
			`op._set_attr('value', loss_scale)`
			`break`

			`def _remove_grad_scaling(self):`
			`block = default_main_program().global_block()`

			`for op_idx, op in reversed(list(enumerate(block.ops))):`
			`if is_data_parallel_scale_op(op):`
			`block._remove_op(op_idx, False)`

			`block._sync_with_cpp()`

			`def _update_opt_rescale_grad(self):`
			`block = default_main_program().global_block()`
			`scaled_grads = set()`

			`for idx, op in reversed(list(enumerate(block.ops))):`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`if (`
			`is_optimize_op(op)`
			`and op.type in __rescale_grad_supported_opts__`
			`):`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert op.has_attr('rescale_grad'), (`
			`f"Unexpected: op [{op}] is supported to have [rescale_grad] attribute."`
			`)`
			`assert len(op.input("Grad")) == 1, (`
			`f"Unexpected: op [{op}] is supported to have only one input grad var."`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00
			`grad_name = op.input("Grad")[0]`
			`dp_degree = len(`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`list(self._grad_name_to_group_map[grad_name].ranks)`
			`)`
[Auto Parallel] Data Parallel Optimization Pass 1 (#44882) * bugfix * remove scaling * support rescale_grad opt 2022-08-12 17:33:17 +08:00			`scaled_grads.add(grad_name)`

			`rescale_grad = float(op.attr('rescale_grad')) / dp_degree`
			`op._set_attr('rescale_grad', rescale_grad)`

[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert scaled_grads == set(self._grad_name_to_group_map.keys()), (`
			`f"Unexpected: gradients [{set(self._grad_name_to_group_map.keys()) - scaled_grads}] are unscaled."`
			`)`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00
			`def _could_be_overlap(self):`
			`# NOTE current different nccl comm will use different cuda stream`
			`# so if there too many dp group there will be too many stream need to be`
			`# created and sync.`
在文档中统一静态图模式与动态图模式的英文翻译 (#49170) * 1219 * temporarily change the num_diff_files limit, test=document_fix * Revert "temporarily change the num_diff_files limit, test=document_fix" This reverts commit 8e70f00ef468d2dad0e38b3da06295ed62990d20. * for codestyle * remove duplicate license * `static mode` -> `static graph mode` * Update hybrid_parallel_inference.py * Update layer_function_generator.py * Update manipulation.py * reset Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com> Co-authored-by: SigureMo <sigure.qaq@gmail.com> 2022-12-30 11:02:06 +08:00			`# revise here when framework support custom stream in static graph mode.`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`num_dp_comm_stream = len(set(self._group_to_grad_name_map.keys()))`
			`if num_dp_comm_stream > __max_stream_num_allow__:`
			`return False`
[AutoParallel] adapt lazyinit & fix pass (#45840) * adapt lazy init and fix pass * add unittest * update comment * fix amp and sharding * remove clip_by_norm 2022-09-09 10:53:37 +08:00			`if self.use_sharding:`
			`return False`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`return True`

[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`def _comms_overlap_calc(self):`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`# TODO support InterpreterCore executor for overlap.`
			`# InterpreterCore has a different logic for overlapping`
			`# which is different from use_calc_stream`
			`block = default_main_program().global_block()`

			`# comm wait calc to finish`
			`for idx, op in reversed(list(enumerate(block.ops))):`
			`if is_data_parallel_reduce_op(op):`
			`assert op.has_attr('ring_id')`

			`op._set_attr('use_calc_stream', False)`
			`ring_id = op.attr("ring_id")`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`block._insert_op_without_sync(`
			`idx,`
			`type='c_wait_compute',`
			`inputs={'X': []},`
			`outputs={'Out': []},`
			`attrs={'op_role': OpRole.Backward, 'ring_id': ring_id},`
			`)`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00
			`block._sync_with_cpp()`

[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`def _calc_wait_comms(self):`
[Executor] remove run_program branch (#52471) * remove run_program * remove FLAGS_USE_STANDALONE_EXECUTOR 2023-04-07 19:20:51 +08:00			`return`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`block = default_main_program().global_block()`

Fix some typos (distributd, _valide_env, etc.) (#61741) * Fix * ci 2024-02-19 15:56:43 +08:00			`# NOTE the naive overlap implement in static hybrid parallel only sync comm stream`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`# at the end of Backward phase, based on a strong constraint that`
			`# all communicating gradient would NOT be used after communication in Backward phase.`
			`# BUT this constraint will fail for scenario like Weight-Sharing and Higher-Order Differentiation,`
Fix some typos (distributd, _valide_env, etc.) (#61741) * Fix * ci 2024-02-19 15:56:43 +08:00			`# where gradient will be involved in other calculation between data-parallel allreduce kernel submitted`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`# into comm streams and the synchronization of comm stream at the end of Backward phase.`
			`# synchronization of comm stream should add according to the usage of communicating gradients`
			`# to support Overlapping for Weight-Sharing and Higher-Order Differentiation.`

			`ring_id_to_un_sync_grad_map = {}`
			`op_idx_to_sync_ring_id_map = {}`
[Auto Parallel] Data Parallel Comm & Calc Overlap Optimization (#45173) * bugfix * remove scaling * support rescale_grad opt * add unitest 2022-08-23 12:01:36 +08:00			`for group in self._group_to_grad_name_map.keys():`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`ring_id_to_un_sync_grad_map[group.id] = []`

			`# analyze the where need to sync`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`for i, op in enumerate(block.ops):`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`if is_data_parallel_reduce_op(op):`
			`ring_id = op.attr("ring_id")`
			`grad_name = op.output_arg_names[0]`
			`ring_id_to_un_sync_grad_map[ring_id].append(grad_name)`
			`elif is_data_parallel_scale_op(op):`
			`continue`
			`# other ops that might use communicating grad`
			`else:`
			`for input_var_name in op.input_arg_names:`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`for (`
			`ring_id,`
			`unsync_grad_names,`
			`) in ring_id_to_un_sync_grad_map.items():`
[Auto Parallel] DP Calc-Comm Overlapping Support Weight Sharing (#45443) * bugfix (#45332) * customize wait_comm 2022-09-02 13:54:13 +08:00			`if input_var_name in unsync_grad_names:`
			`# need to sync before op_i`
			`if i in op_idx_to_sync_ring_id_map:`
			`op_idx_to_sync_ring_id_map[i].append(ring_id)`
			`else:`
			`op_idx_to_sync_ring_id_map[i] = [ring_id]`
			`# all grads in this comm stream are synced`
			`ring_id_to_un_sync_grad_map[ring_id] = []`

			`# insert synchronization`
			`indices = list(op_idx_to_sync_ring_id_map.keys())`
			`# TODO the synchronization could be optimized`
			`# we should record the event of a gradient is communicating and`
			`# only wait for that event to be completed.`
			`# BUT paddle static currently not support op api for event record only, so`
			`# here we try to wait for all kernel in that comm stream to be finish which is not that optimized.`
			`for i in sorted(indices, reverse=True):`
			`for ring_id in op_idx_to_sync_ring_id_map[i]:`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`block._insert_op_without_sync(`
			`i,`
			`type='c_wait_comm',`
			`inputs={'X': []},`
			`outputs={'Out': []},`
			`attrs={'op_role': OpRole.Backward, 'ring_id': ring_id},`
			`)`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`block._sync_with_cpp()`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`def _could_be_fuse(self):`
			`# TODO support gradient fuse higher order gradient.`
			`# should analyse the dependencies of gradient in backward.`
			`if find_higher_order_backward_op(default_main_program()):`
			`return False`
			`if self.use_sharding:`
			`return False`
			`return True`

			`def _group_grads(self):`
			`"""`
			`conditions for gradients to be grouped:`
			`1. group size < max_fuse_numel`
[CodeStyle][W291] trim trailing whitespace in python file (#45937) * trim trailing whitespace * fix `.cmake-format.py` * revert npu ut changes, avoid npu ci error 2022-09-14 21:56:19 +08:00			`2. same dp group`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`3. same dtype`
[CodeStyle][W291] trim trailing whitespace in python file (#45937) * trim trailing whitespace * fix `.cmake-format.py` * revert npu ut changes, avoid npu ci error 2022-09-14 21:56:19 +08:00			`4. dependency: grad would NOT be used by other ops within group segment`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`gradients inside same group would be fuse into one coalesce tensor`
			`"""`

			`block = default_main_program().global_block()`
			`ops = block.ops`

			`# group individual grad vars`
			`# TODO consider fuse gradient for sharding reduce`
			`# TODO let user to set fuse_grad_size`
			`# emb = 50000 * h, ffn = 8 * h * h, mha = 4 * h * h`
			`h = 2048`
			`ffn_numel = 2 * (4 * h) * h`
			`mha_numel = 3 * h * h + h * h`
			`max_fuse_numel = ffn_numel + mha_numel`
			`grad_groups = []`
			`cur_group = GradientsGroup(ops, max_fuse_numel)`
			`grouped_grad_names = set()`

			`def collect_group(cur_group, grad_var, ring_id, i):`
			`if len(cur_group.gradients) == 0:`
			`cur_group = None`
			`else:`
			`cur_group.finalize()`
			`grad_groups.append(cur_group)`

			`new_group = GradientsGroup(ops, max_fuse_numel)`
			`if grad_var:`
			`new_group.add(grad_var, ring_id, i)`
			`grouped_grad_names.add(grad_var.name)`
			`return new_group`

			`def op_depend_on_group(op, group):`
			`vars_ = set(op.input_arg_names + op.output_arg_names)`
[CodeStyle][C403] Unnecessary list comprehension (rewrite as a set comprehension) (#51968) 2023-03-23 10:17:12 +08:00			`grad_names = {grad.name for grad in group.gradients}`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`return len(vars_.intersection(grad_names)) > 0`

			`for i, op in enumerate(ops):`
			`if is_data_parallel_reduce_op(op):`
			`ring_id = op.attr("ring_id")`
			`grad_name = op.output_arg_names[0]`
			`grad_var = block.var(grad_name)`
[Auto Parallel] Sharding Optimization：Partition Algorithm & Stage2 Parameter Bucket communication (#47180) * partition param by order * add logging * reorder opt * config * stage2 bucket * update unitest 2022-11-08 20:13:11 +08:00			`grad_numel = get_var_numel(grad_var)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`if cur_group.acceptable(grad_var, ring_id):`
			`assert grad_name not in grouped_grad_names`
			`grouped_grad_names.add(grad_name)`
			`cur_group.add(grad_var, ring_id, i)`
			`else:`
			`cur_group = collect_group(cur_group, grad_var, ring_id, i)`
			`else:`
			`if op_depend_on_group(op, cur_group):`
			`cur_group = collect_group(cur_group, None, None, None)`

			`# collect last group`
			`collect_group(cur_group, None, None, None)`

			`return grad_groups`

			`def _update_program(self, grad_groups):`
			`block = default_main_program().global_block()`

Using allreduce_avg to eliminate scale in auto parallel DP (#61622) * Using allreduce_avg to eliminate scale in auto parallel DP * Fix nccl_version api * Fix nccl_version api * Fix nccl_version api * Update code * Update code * Fix typos * Update code * Add dependency for reduce_avg in sharding * Update code * Update code * Updatte code * Fix CI errors * Register reduce_avg to pir * Add op compat yaml * Add gradient_scale_using_allreduce_avg args * Fix CI errors * Add NOTE 2024-03-05 14:57:14 +08:00			`remove_op_types = [`
			`'scale',`
[fluid_ops] test_auto_parallel_partitioner_deprecated.py modify c_allreduce_sum (#71948) 2025-04-11 11:27:44 +08:00			`'all_reduce',`
Using allreduce_avg to eliminate scale in auto parallel DP (#61622) * Using allreduce_avg to eliminate scale in auto parallel DP * Fix nccl_version api * Fix nccl_version api * Fix nccl_version api * Update code * Update code * Fix typos * Update code * Add dependency for reduce_avg in sharding * Update code * Update code * Updatte code * Fix CI errors * Register reduce_avg to pir * Add op compat yaml * Add gradient_scale_using_allreduce_avg args * Fix CI errors * Add NOTE 2024-03-05 14:57:14 +08:00			`'c_wait_compute',`
			`]`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`for i, group in enumerate(grad_groups[::-1]):`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`# skip unfused big tensor`
			`if len(group.gradients) <= 1:`
			`group.coalesce_var = group.gradients[0]`
			`continue`

[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`ref_process_mesh = set()`
			`concated_shapes = []`
			`concated_ranks = []`
			`for grad_ in group.gradients:`
			`grad_dist_attr = (`
			`self.dist_context.get_tensor_dist_attr_for_program(grad_)`
			`)`
			`ref_process_mesh.update(`
			`set(grad_dist_attr.process_mesh.process_ids)`
			`)`

			`shape = grad_.shape`
			`concated_shapes.extend(shape)`
			`concated_ranks.append(len(shape))`

[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`# create coalesce tensor`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`group.coalesce_var = block.create_var(`
[CodeStyle][UP030][UP031][UP032] using f-string (#52062) * autofix Co-authored-by: Liyulingyue <83450930+Liyulingyue@users.noreply.github.com> * revert changes in python/paddle/distributed/fleet/utils/hybrid_parallel_util.py * empty commit, trigger ci * fix test_slice --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com> 2023-03-31 10:11:56 +08:00			`name=unique_name.generate(self.coalesce_prefix + f'_{i}'),`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`dtype=group.dtype,`
			`persistable=False,`
			`stop_gradient=True,`
			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`tensor_dist_attr = TensorDistAttr()`
			`tensor_dist_attr.process_mesh = ProcessMesh(list(ref_process_mesh))`
			`tensor_dist_attr.dims_mapping = []`
			`self.dist_context.set_tensor_dist_attr_for_program(`
			`group.coalesce_var, tensor_dist_attr`
			`)`

[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`# update allreduce & scale op`
			`if group.scale_op_idx != -1:`
			`scale_op = block.ops[group.scale_op_idx]`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert scale_op.type == 'scale', (`
			`f"should found scale op but found {scale_op}"`
			`)`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`scale_op._rename_input(`
			`scale_op.input_arg_names[0], group.coalesce_var.name`
			`)`
			`scale_op._rename_output(`
			`scale_op.output_arg_names[0], group.coalesce_var.name`
			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`allreduce_op = block.ops[group.allreduce_op_idx]`
[fluid_ops] clean c_allreduce_sum if statement (#72265) * Fix * Fix * Fix 2025-04-17 10:11:35 +08:00			`assert (`
[fluid_ops] test_auto_parallel_partitioner_deprecated.py modify c_allreduce_sum (#71948) 2025-04-11 11:27:44 +08:00			`allreduce_op.type == 'all_reduce'`
			`and allreduce_op.attr('reduce_type')`
			`== paddle.distributed.ReduceOp.SUM`
[fluid_ops] clean c_allreduce_sum if statement (#72265) * Fix * Fix * Fix 2025-04-17 10:11:35 +08:00			`), f"should found all_reduce sum op but found {allreduce_op}"`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`allreduce_op_dist_attr = (`
			`self.dist_context.get_op_dist_attr_for_program(allreduce_op)`
			`)`
			`old_in_name = allreduce_op.input_arg_names[0]`
			`new_in_name = group.coalesce_var.name`
			`allreduce_op._rename_input(old_in_name, new_in_name)`
			`input_dist_attr = allreduce_op_dist_attr.get_input_dist_attr(`
			`old_in_name`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`allreduce_op_dist_attr.set_input_dist_attr(`
			`new_in_name, input_dist_attr`
			`)`

			`old_out_name = allreduce_op.output_arg_names[0]`
			`new_out_name = group.coalesce_var.name`
			`allreduce_op._rename_output(old_out_name, new_out_name)`
			`out_dist_attr = allreduce_op_dist_attr.get_output_dist_attr(`
			`old_out_name`
			`)`
			`allreduce_op_dist_attr.set_output_dist_attr(`
			`new_out_name, out_dist_attr`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
Fix some typos (distributd, _valide_env, etc.) (#61741) * Fix * ci 2024-02-19 15:56:43 +08:00			`# remove un-used op`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`remove_op_indices = (`
			`group.remove_wait_op_indices`
			`+ group.remove_allreduce_op_indices`
			`+ group.remove_scale_op_indices`
			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`for idx in sorted(remove_op_indices, reverse=True):`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert block.ops[idx].type in remove_op_types, (`
			`f"Unexpected: try to remove op {block.ops[idx]}"`
			`)`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`block._remove_op(idx, False)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`# insert coalesce op`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`grad_names = [grad.name for grad in group.gradients]`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`coalesce_op = block._insert_op_without_sync(`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`group.coalesce_op_idx,`
			`type="coalesce_tensor",`
			`inputs={"Input": grad_names},`
			`outputs={`
			`"Output": grad_names,`
			`"FusedOutput": group.coalesce_var,`
			`},`
			`attrs={`
			`"copy_data": False,`
			`"use_align": True,`
			`"dtype": group.dtype,`
			`"concated_shapes": concated_shapes,`
			`"concated_ranks": concated_ranks,`
			`OP_ROLE_KEY: OpRole.Backward,`
			`},`
			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`op_dist_attr = OperatorDistAttr()`
			`op_dist_attr.impl_idx = 0`
			`op_dist_attr.impl_type = "default"`
			`op_dist_attr.process_mesh = ProcessMesh(list(ref_process_mesh))`
			`for in_name in coalesce_op.input_arg_names:`
			`in_var = block.var(in_name)`
			`in_var_dist_attr = (`
			`self.dist_context.get_tensor_dist_attr_for_program(in_var)`
			`)`
			`op_dist_attr.set_input_dims_mapping(`
			`in_name, in_var_dist_attr.dims_mapping`
			`)`
			`for out_name in coalesce_op.output_arg_names:`
			`out_var = block.var(out_name)`
			`out_var_dist_attr = (`
			`self.dist_context.get_tensor_dist_attr_for_program(out_var)`
			`)`
			`op_dist_attr.set_output_dims_mapping(`
			`out_name, out_var_dist_attr.dims_mapping`
			`)`

			`self.dist_context.set_op_dist_attr_for_program(`
			`coalesce_op, op_dist_attr`
			`)`

[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`block._sync_with_cpp()`

[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`def _add_dependencies(self, grad_groups):`
			`# NOTE Currently, auto_parallel need to adopt for two executors: Sequential executor (old exe) and Graph based`
			`# multiple stream executor(standalone exe). This function just for standalone exe. Refactor here`
			`# in future when only one executor stay.`

[Executor] remove run_program branch (#52471) * remove run_program * remove FLAGS_USE_STANDALONE_EXECUTOR 2023-04-07 19:20:51 +08:00			`if len(grad_groups) == 0:`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`return`
			`block = default_main_program().global_block()`

			`# Build maps`
			`coalesce_to_vars_map = {}`
			`for group in grad_groups:`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`coalesce_to_vars_map[group.coalesce_var.name] = group`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00
			`# analyze dependencies`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`dep_map = {}`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`for idx, op in reversed(list(enumerate(block.ops))):`
			`if is_forward_op(op):`
			`break`
			`if is_optimize_op(op):`
			`continue`

			`if is_data_parallel_reduce_op(op):`
			`coalesce_var_name = op.output_arg_names[0]`
			`if self.coalesce_prefix in coalesce_var_name:`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`group = coalesce_to_vars_map[coalesce_var_name]`
			`dep_map[idx] = [`
			`(`
			`idx,`
			`group.gradients[-1],`
			`group.coalesce_var,`
			`op.attr(OP_ROLE_KEY),`
			`)`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`]`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`dep_map[idx].append(`
			`(`
			`idx + 1,`
			`group.coalesce_var,`
			`group.gradients,`
			`op.attr(OP_ROLE_KEY),`
			`)`
			`)`

			`# insert dependency op`
[CodeStyle][C416][C417] rewrite unnecessary comprehension with function call and use generator instead of map (#52140) * codestyle c416 c417 * fix error * fix inc * unify all C4 rules into one * fix inc --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com> 2023-03-30 10:17:11 +08:00			`indice = sorted(dep_map.keys(), reverse=True)`
[AutoParallel] add dist_attr in data_parallel optimization (#49744) * fix dist_attr in data_parallel in optimization * fix grad_clip pass when pp2 * fix dist_attr 2023-02-27 10:25:31 +08:00			`for i in indice:`
			`for idx, prior_vars, post_vars, op_role in dep_map[i][::-1]:`
			`depend_op = insert_dependencies_for_vars(`
			`block,`
			`idx,`
			`prior_vars,`
			`post_vars,`
			`self.dist_context,`
			`op_role,`
			`is_recompute=False,`
			`sync=False,`
			`op_namescope="data_parallel_overlap_dep",`
			`)`
			`depend_op.dist_attr.execution_stream = self.gradient_sync_stream`
[Auto Parallel Performance] Optimizing data parallel Fuse-Allreduce-Overlapping (#48092) * add depend * add origin amp files * fp16 distinguish None & False * engine log * dp add deps for graph exe * add dep for grad clip * dep ops in comm stream * unitest 2022-11-29 14:25:45 +08:00			`block._sync_with_cpp()`

			`# remove naive synchronization & assign allreduce stream`
			`def remove_cond(op):`
			`if op.type != "c_wait_compute":`
			`return False`
			`if len(op.input_arg_names) != 0:`
			`return False`
			`if len(op.output_arg_names) != 0:`
			`return False`
			`return True`

			`for idx, op in reversed(list(enumerate(block.ops))):`
			`if is_data_parallel_reduce_op(op):`
			`op._set_attr('use_calc_stream', True)`
			`op.dist_attr.execution_stream = self.gradient_sync_stream`

			`if remove_cond(op):`
			`block._remove_op(idx, sync=False)`

			`block._sync_with_cpp()`

[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`def summary(self, grad_groups=[]):`
			`# TODO: add logger module`
			`import logging`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`self._logger = logging.getLogger()`
			`self._logger.propagate = False`
			`if not self._logger.handlers:`
			`self._logger.setLevel(logging.INFO)`
			`log_handler = logging.StreamHandler()`
			`log_format = logging.Formatter(`
			`'[%(levelname)s %(asctime)s %(filename)s:%(lineno)d] %(message)s'`
			`)`
			`log_handler.setFormatter(log_format)`
			`self._logger.addHandler(log_handler)`

			`if len(grad_groups) > 0:`
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`self._logger.info("Data Parallel Optimization: ")`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`self._logger.info(`
[CodeStyle][ruff] fix v0.3.3 UP032 (#63111) 2024-04-01 10:20:33 +08:00			`f" {len(self._grad_name_to_group_map.keys())} Allreduce ops are fused into {len(grad_groups)} coalesce allreduce ops."`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`self._logger.debug("gradient fusing group are following: ")`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`fused_grads = set()`
			`for i, group in enumerate(grad_groups):`
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`self._logger.debug(`
[CodeStyle][ruff] fix v0.3.3 UP032 (#63111) 2024-04-01 10:20:33 +08:00			`f"coalesce gradient [{i}] is composed by: {[grad.name for grad in group.gradients]}"`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`fused_grads.update([grad.name for grad in group.gradients])`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`individual_grads = set(self._grad_name_to_group_map.keys()) - set(`
			`fused_grads`
			`)`
[Auto Parallel-Performance] Sharding Comm Optimization (#48604) * remove deps and prior comm * grad comm fuse * add deps for amp&global norm * stage2 broadcast prior deps * stage2 grad overlap * stream_analyzer bugfix * overlap enable * dep op namescope * depend support multiple inputs * check finite deps * stage2 param comm overlap * Set kD2HStream * grad comm hierarchical * grad comm hierarchical * new unitest Co-authored-by: chenruibiao <chenruibiao@baidu.com> 2023-01-04 19:06:14 +08:00			`self._logger.debug(`
[CodeStyle][task 1] enable Ruff UP032 rule with . except `python/paddle/base` (#57409) * update up032 * update up032 * Update api_gen.py * Update api_gen.py * Update sampcd_processor_utils.py 2023-09-22 10:14:38 +08:00			`f"the following [{len(individual_grads)}] gradients are not fused: "`
[CodeStyle][black] use black instead of yapf (#46014) * update config * re-blacken python code * temporarily disable date and diff_py_file * skip a format 2022-10-23 20:01:27 +08:00			`)`
[CodeStyle][UP030][UP031][UP032] using f-string (#52062) * autofix Co-authored-by: Liyulingyue <83450930+Liyulingyue@users.noreply.github.com> * revert changes in python/paddle/distributed/fleet/utils/hybrid_parallel_util.py * empty commit, trigger ci * fix test_slice --------- Co-authored-by: SigureMo <sigure.qaq@gmail.com> 2023-03-31 10:11:56 +08:00			`self._logger.debug(f"individual gradient {individual_grads}")`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00

[CodeStyle][py2][U004] unecessary explicit `object` inheritance in class definition (#47642) * [CodeStyle][py2][U004] unecessary explicit `object` inheritance in class definition * fix an increment 2022-11-08 11:29:41 +08:00			`class GradientsGroup:`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`def __init__(self, ops, max_group_size):`
			`self.max_group_size = max_group_size`
			`self.ops = ops`

			`self.gradients = []`
			`self.numel = 0`
			`self.dtype = None`
			`self.ring_id = None`
			`self.coalesce_var = None`
			`self.coalesce_op_idx = -1`
			`self.allreduce_op_idx = -1`
			`self.scale_op_idx = -1`
			`self.remove_wait_op_indices = []`
			`self.remove_allreduce_op_indices = []`
			`self.remove_scale_op_indices = []`

			`def acceptable(self, grad_var, ring_id):`
			`if len(self.gradients) == 0:`
			`return True`
			`if ring_id != self.ring_id:`
			`return False`
[Auto Parallel] Sharding Optimization：Partition Algorithm & Stage2 Parameter Bucket communication (#47180) * partition param by order * add logging * reorder opt * config * stage2 bucket * update unitest 2022-11-08 20:13:11 +08:00			`if get_var_numel(grad_var) + self.numel > self.max_group_size:`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`return False`
			`if grad_var.dtype != self.dtype:`
			`return False`

			`return True`

			`def add(self, grad_var, ring_id, i):`
			`self.gradients.append(grad_var)`
			`self.ring_id = ring_id`
			`self.dtype = grad_var.dtype`
[Auto Parallel] Sharding Optimization：Partition Algorithm & Stage2 Parameter Bucket communication (#47180) * partition param by order * add logging * reorder opt * config * stage2 bucket * update unitest 2022-11-08 20:13:11 +08:00			`self.numel += get_var_numel(grad_var)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00
			`# remove auxiliary ops in non-fuse dp allreduce`
			`self.remove_allreduce_op_indices.append(i)`

			`# NOTE this pass rely on the original synchronization add in previous passes`
			`# (same stream or calc_wait_comm & comm_wait_calc)`
			`# to guarantee the correctness of comm_calc execution order.`
			`# so the calc_wait_comm should be keep.`
			`grad_op_idx = i - 1`
			`if i > 0 and self.ops[i - 1].type == 'c_wait_compute':`
			`self.remove_wait_op_indices.append(i - 1)`
			`grad_op_idx -= 1`
			`if i + 1 < len(self.ops) and is_data_parallel_scale_op(self.ops[i - 1]):`
			`self.remove_scale_op_indices.append(i + 1)`

			`if len(self.gradients) == 1:`
[Auto Parallel] Bugfix allreduce fuse for MP (#46086) * bugfix * bugfix * typos fixed 2022-09-16 17:12:38 +08:00			`# TODO Remove this is a temporary hack for Tensor Parallel. the logic`
			`# for find grad_op should be more general.`
[fluid_ops] clean c_allreduce_sum if statement (#72265) * Fix * Fix * Fix 2025-04-17 10:11:35 +08:00			`if (`
[fluid_ops] test_auto_parallel_partitioner_deprecated.py modify c_allreduce_sum (#71948) 2025-04-11 11:27:44 +08:00			`self.ops[grad_op_idx].type == "all_reduce"`
			`and self.ops[grad_op_idx].attr("reduce_type")`
			`== paddle.distributed.ReduceOp.SUM`
			`):`
[Auto Parallel] Bugfix allreduce fuse for MP (#46086) * bugfix * bugfix * typos fixed 2022-09-16 17:12:38 +08:00			`grad_op_idx -= 1`

[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`grad_op = self.ops[grad_op_idx]`
[CodeStyle] `black -> ruff format` migration - part 29 (#74743) 2025-08-21 02:00:58 +08:00			`assert grad_var.name in grad_op.output_arg_names, (`
			`f"grad [{grad_var.name}] should be output of {grad_op}"`
			`)`
[Auto Parallel] Gradient Fuse Allreduce (#45643) * bugfix (#45332) * dist embedding support lookup table v1 * add unitest * customize wait_comm * group gradients * bugfix * update program 2022-09-14 13:56:56 +08:00			`self.coalesce_op_idx = grad_op_idx`

			`def finalize(self):`
			`self.allreduce_op_idx = self.remove_allreduce_op_indices.pop()`
			`if len(self.remove_wait_op_indices) > 1:`
			`self.remove_wait_op_indices.pop()`
			`if len(self.remove_scale_op_indices) > 1:`
			`self.scale_op_idx = self.remove_scale_op_indices.pop()`