Commit Graph

  • 9664b1bc19 [shardformer] hotfix attn mask (#5945) Hongxin Liu 2024-07-29 13:58:27 +08:00
  • c8332b9cb5 Merge pull request #5922 from hpcaitech/kto YeAnbang 2024-07-29 13:27:00 +08:00
  • 6fd9e86864 fix style YeAnbang 2024-07-29 01:29:18 +00:00
  • de1bf08ed0 fix style YeAnbang 2024-07-26 10:07:15 +00:00
  • 8a3ff4f315 fix style YeAnbang 2024-07-26 09:55:15 +00:00
  • ad35a987d3 [Feature] Add a switch to control whether the model checkpoint needs to be saved after each epoch ends (#5941) zhurunhua 2024-07-26 11:15:20 +08:00
  • 2069472e96 [Hotfix] Fix ZeRO typo #5936 Edenzzzz 2024-07-25 09:59:58 +08:00
  • 5fd0592767 [fp8] support all-gather flat tensor (#5932) Hongxin Liu 2024-07-24 16:55:20 +08:00
  • befe3100da [bugfix] colo attn bug fix moe_sp haze188 2024-07-24 08:43:36 +00:00
  • 2d73efdfdd [bugfix] colo attn bug fix haze188 2024-07-24 06:53:24 +00:00
  • 5fb958cc83 [FIX BUG] convert env param to int in (#5934) Gao, Ruiyuan 2024-07-24 10:30:40 +08:00
  • a521ffc9f8 Add n_fused as an input from native_module (#5894) Insu Jang 2024-07-23 11:15:39 -04:00
  • e521890d32 [test] add check hxwang 2024-07-23 09:38:05 +00:00
  • 4b6fbaf956 [moe] deepseek moe sp support haze188 2024-07-23 06:39:49 +00:00
  • 91f84f6a5f [bug] fix: somehow logger hangs the program botbw 2024-07-23 06:17:51 +00:00
  • 9688e19b32 remove real data path YeAnbang 2024-07-22 06:13:02 +00:00
  • b0e15d563e remove real data path YeAnbang 2024-07-22 06:11:38 +00:00
  • 12fe8b5858 refactor evaluation YeAnbang 2024-07-22 05:57:39 +00:00
  • e31d2ebcf7 [test] fix test: test_zero1_2 hxwang 2024-07-22 05:36:20 +00:00
  • c67e553fd3 [moe] remove ops hxwang 2024-07-22 04:00:42 +00:00
  • 05a78d2f41 [chore] solve moe ckpt test failure and some other arg pass failure hxwang 2024-07-22 03:40:34 +00:00
  • c5f582f666 fix test data YeAnbang 2024-07-22 01:31:32 +00:00
  • 4ec17a7cdf [FIX BUG] UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value (#5931) zhurunhua 2024-07-21 19:46:01 +08:00
  • 150505cbb8 Merge branch 'kto' of https://github.com/hpcaitech/ColossalAI into kto YeAnbang 2024-07-19 10:11:05 +00:00
  • d49550fb49 refactor tokenization YeAnbang 2024-07-19 10:10:48 +00:00
  • 9f9e268265 [pre-commit.ci] auto fixes from pre-commit.com hooks pre-commit-ci[bot] 2024-07-19 07:54:40 +00:00
  • c27f5d9731 [chore] minor fix after rebase hxwang 2024-07-19 07:53:40 +00:00
  • 783aafa327 [moe] full test for deepseek and mixtral (pp + sp to fix) hxwang 2024-07-19 06:11:11 +00:00
  • 162e2d935c [moe] finalize test (no pp) hxwang 2024-07-18 13:36:18 +00:00
  • b91cdccf2e moe sp + ep bug fix haze188 2024-07-18 10:08:06 +00:00
  • 8e85523a42 [moe] init moe plugin comm setting with sp hxwang 2024-07-18 08:37:06 +00:00
  • f0599a0c19 [chore] minor fix hxwang 2024-07-18 03:53:51 +00:00
  • 633849f438 [Feature] MoE Ulysses Support (#5918) Haze188 2024-07-18 11:37:56 +08:00
  • c8bf2681e3 [moe] clean legacy code hxwang 2024-07-16 09:08:31 +00:00
  • 8d3d7f3cbd [moe] test deepseek hxwang 2024-07-16 10:10:40 +00:00
  • 335ad3c6fb [moe] implement tp botbw 2024-07-16 06:03:57 +00:00
  • d4a64e355e [test] add mixtral modelling test botbw 2024-07-15 06:43:27 +00:00
  • 18be903ed9 [chore] arg pass & remove drop token hxwang 2024-07-12 09:08:16 +00:00
  • cbcc818d5a [chore] trivial fix botbw 2024-07-12 07:04:17 +00:00
  • 5bc085fc01 [chore] manually revert unintended commit botbw 2024-07-12 03:29:16 +00:00
  • 1b15cc97f5 [moe] add mixtral dp grad scaling when not all experts are activated botbw 2024-07-12 03:27:20 +00:00
  • 2f9bce6686 [moe] implement submesh initialization botbw 2024-07-11 05:50:20 +00:00
  • a613edd517 solve hang when parallel mode = pp + dp haze188 2024-07-11 02:12:44 +00:00
  • 0210bead8c [misc] solve booster hang by rename the variable haze188 2024-07-09 09:44:04 +00:00
  • b303ffe9f3 [zero] solve hang botbw 2024-07-09 08:14:00 +00:00
  • 2431694564 [moe] implement transit between non moe tp and ep botbw 2024-07-08 09:59:46 +00:00
  • dec6e25e99 [test] pass mixtral shardformer test botbw 2024-07-08 05:13:49 +00:00
  • 61109c7843 [zero] solve hang hxwang 2024-07-05 07:19:37 +00:00
  • 000456bf94 [chore] handle non member group hxwang 2024-07-05 07:03:45 +00:00
  • 4fc6f9aa98 [test] mixtra pp shard test hxwang 2024-07-04 06:39:01 +00:00
  • 5a9490a46b [moe] fix plugin hxwang 2024-07-02 09:09:00 +00:00
  • 6a9164a477 [test] add mixtral transformer test hxwang 2024-07-02 09:08:41 +00:00
  • 229db4bc16 [test] add mixtral for sequence classification hxwang 2024-07-02 09:02:21 +00:00
  • d08c99be0d Merge branch 'main' into kto Tong Li 2024-07-19 15:23:31 +08:00
  • f585d4e38e [ColossalChat] Hotfix for ColossalChat (#5910) Tong Li 2024-07-19 13:40:07 +08:00
  • 8cc8f645cd [Examples] Add lazy init to OPT and GPT examples (#5924) Edenzzzz 2024-07-19 10:10:08 +08:00
  • 544b7a38a1 fix style, add kto data sample YeAnbang 2024-07-18 08:38:56 +00:00
  • 62661cde22 Merge pull request #5921 from BurkeHulk/fp8_fix Guangyao Zhang 2024-07-18 16:34:38 +08:00
  • 845ea7214e Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into kto YeAnbang 2024-07-18 07:55:43 +00:00
  • 09d5ffca1a add kto YeAnbang 2024-07-18 07:54:11 +00:00
  • e86127925a [plugin] support all-gather overlap for hybrid parallel (#5919) Hongxin Liu 2024-07-18 15:33:03 +08:00
  • 5b969fd831 fix shardformer fp8 communication training degradation GuangyaoZhang 2024-07-18 07:16:36 +00:00
  • d0bdb51f48 Merge pull request #5899 from BurkeHulk/SP_fp8 Guangyao Zhang 2024-07-18 10:46:59 +08:00
  • 73494de577 [release] update version (#5912) v0.4.1 Hongxin Liu 2024-07-17 17:29:59 +08:00
  • 6a20f07b80 remove all to all GuangyaoZhang 2024-07-17 05:33:38 +00:00
  • 5a310b9ee1 fix rebase GuangyaoZhang 2024-07-17 02:56:07 +00:00
  • 457a0de79f shardformer fp8 GuangyaoZhang 2024-07-08 07:04:48 +00:00
  • 27a72f0de1 [misc] support torch2.3 (#5893) Hongxin Liu 2024-07-11 16:43:18 +08:00
  • 530283dba0 fix object_to_tensor usage when torch>=2.3.0 (#5820) アマデウス 2024-07-04 10:53:58 +08:00
  • 2e28c793ce [compatibility] support torch 2.2 (#5875) Guangyao Zhang 2024-07-04 10:53:09 +08:00
  • 9470701110 Merge pull request #5885 from BurkeHulk/feature/fp8_comm Hanks 2024-07-16 11:37:05 +08:00
  • d8bf7e09a2 Merge pull request #5901 from hpcaitech/colossalchat YeAnbang 2024-07-16 11:07:32 +08:00
  • 1c961b20f3 [ShardFormer] fix qwen2 sp (#5903) Guangyao Zhang 2024-07-15 13:58:06 +08:00
  • 45c49dde96 [Auto Parallel]: Speed up intra-op plan generation by 44% (#5446) Stephan Kö 2024-07-15 12:05:06 +08:00
  • b3594d4d68 fix orpo cross entropy loss YeAnbang 2024-07-15 02:12:05 +00:00
  • 51f916b11d [pre-commit.ci] auto fixes from pre-commit.com hooks pre-commit-ci[bot] 2024-07-12 07:33:44 +00:00
  • 1f1b856354 Merge remote-tracking branch 'origin/feature/fp8_comm' into feature/fp8_comm BurkeHulk 2024-07-12 15:29:41 +08:00
  • 66018749f3 add fp8_communication flag in the script BurkeHulk 2024-07-12 15:26:17 +08:00
  • e88190184a support fp8 communication in pipeline parallelism BurkeHulk 2024-07-12 15:25:25 +08:00
  • 1e1959467e fix scaling algorithm in FP8 casting BurkeHulk 2024-07-12 15:23:37 +08:00
  • c068ef0fa0 [zero] support all-gather overlap (#5898) Hongxin Liu 2024-07-11 18:59:59 +08:00
  • 115c4cc5a4 hotfix citation YeAnbang 2024-07-11 06:05:05 +00:00
  • e7a8634636 fix eval YeAnbang 2024-07-11 03:35:03 +00:00
  • dd9e1cdafe Merge pull request #5850 from hpcaitech/rlhf_SimPO YeAnbang 2024-07-11 09:14:12 +08:00
  • 8a9721bafe [pre-commit.ci] auto fixes from pre-commit.com hooks pre-commit-ci[bot] 2024-07-10 10:44:30 +00:00
  • 33f15203d3 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into rlhf_SimPO YeAnbang 2024-07-10 10:39:34 +00:00
  • f6ef5c3609 fix style YeAnbang 2024-07-10 10:37:17 +00:00
  • d888c3787c add benchmark for sft, dpo, simpo, orpo. Add benchmarking result. Support lora with gradient checkpoint YeAnbang 2024-07-10 10:17:08 +00:00
  • dbfa7d39fc fix typo GuangyaoZhang 2024-07-10 08:13:26 +00:00
  • 669849d74b [ShardFormer] Add Ulysses Sequence Parallelism support for Command-R, Qwen2 and ChatGLM (#5897) Guangyao Zhang 2024-07-10 11:34:25 +08:00
  • 16f3451fe2 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into rlhf_SimPO YeAnbang 2024-07-10 02:32:07 +00:00
  • fbf33ecd01 [Feature] Enable PP + SP for llama (#5868) Edenzzzz 2024-07-09 18:05:20 +08:00
  • 66abf1c6e8 [HotFix] CI,import,requirements-test for #5838 (#5892) Runyu Lu 2024-07-08 22:32:06 +08:00
  • cba20525a8 [Feat] Diffusion Model(PixArtAlpha/StableDiffusion3) Support (#5838) Runyu Lu 2024-07-08 16:02:07 +08:00
  • 8ec24b6a4d [Hoxfix] Fix CUDA_DEVICE_MAX_CONNECTIONS for comm overlap Edenzzzz 2024-07-05 20:02:36 +08:00
  • 3420921101 [shardformer] DeepseekMoE support (#5871) Haze188 2024-07-05 16:13:58 +08:00
  • e17f835df7 [pre-commit.ci] auto fixes from pre-commit.com hooks pre-commit-ci[bot] 2024-07-04 12:47:16 +00:00
  • 6991819a97 Merge branch 'hpcaitech:main' into feature/fp8_comm Hanks 2024-07-04 20:34:41 +08:00
  • 7997683aac [pre-commit.ci] pre-commit autoupdate (#5878) pre-commit-ci[bot] 2024-07-04 13:46:41 +08:00
  • 7afbc81d62 [quant] fix bitsandbytes version check (#5882) Hongxin Liu 2024-07-04 11:33:23 +08:00