Commit Graph

  • 7fdef9fd6b [pre-commit.ci] pre-commit autoupdate (#6113) pre-commit-ci[bot] 2025-01-02 10:23:20 +08:00
  • a9bedc7a43 [Sharderformer] Support zbv in Sharderformer Policy (#6150) duanjunwen 2025-01-02 10:22:26 +08:00
  • af06d162cf [checkpointio] support non blocking pin load (#6172) Hongxin Liu 2024-12-25 17:03:25 +08:00
  • 836992438f [news] release colossalai for sora (#6166) binmakeswell 2024-12-23 21:59:39 +08:00
  • 8b0ed61490 [hotfix] improve compatibility (#6165) Hongxin Liu 2024-12-23 18:57:08 +08:00
  • 5f82bfa636 [doc] add bonus event (#6164) binmakeswell 2024-12-23 17:41:59 +08:00
  • fa9d0318e4 [Hotfix] hotfix normalization (#6163) duanjunwen 2024-12-23 16:29:48 +08:00
  • 130229fdcb [checkpointio]support asyncio for 3d (#6152) flybird11111 2024-12-23 10:24:22 +08:00
  • aaafb38851 [Device]Support npu (#6159) flybird11111 2024-12-17 15:42:39 +08:00
  • e994c64568 [checkpointio] fix async io (#6155) flybird11111 2024-12-16 10:36:28 +08:00
  • de3d371f65 [hotfix] fix zero comm buffer init (#6154) Hongxin Liu 2024-12-10 16:46:15 +08:00
  • 8d826a336e [fix] fix bug caused by perf version (#6156) duanjunwen 2024-12-10 15:03:16 +08:00
  • 6280cb18b8 [checkpointio] support debug log (#6153) Hongxin Liu 2024-12-02 11:29:19 +08:00
  • d6af7be06e fix ckpt_api wangbluo 2024-11-25 17:12:29 +08:00
  • ab856fd308 [checkpointio] fix zero optimizer async save memory (#6151) Hongxin Liu 2024-11-25 14:46:31 +08:00
  • 82c88c1e0d fix wangbluo 2024-11-25 11:58:39 +08:00
  • b83143ee72 fix wangbluo 2024-11-25 10:51:49 +08:00
  • fa0318dba5 Merge branch 'main' into ckpt_api wangbluo 2024-11-25 10:29:34 +08:00
  • fabc12a523 Merge branch 'main' of https://github.com/hpcaitech/ColossalAI into main wangbluo 2024-11-25 10:28:10 +08:00
  • 8ecff0cb7f Merge pull request #6149 from ver217/hotfix/ckpt Wang Binluo 2024-11-21 16:05:19 +08:00
  • 8fddbab04c [checkpointio] disable buffering ver217 2024-11-21 14:33:26 +08:00
  • 152162a80e [doc] update cloud link (#6148) Sze-qq 2024-11-20 22:00:10 +08:00
  • cf519dac6a [optim] hotfix adam load (#6146) Hongxin Liu 2024-11-20 16:36:37 +08:00
  • 5caad13055 [doc] add hpc cloud intro (#6147) Sze-qq 2024-11-20 15:47:30 +08:00
  • 64f74a157e [NPU]support npu (#6089) support-npu flybird11111 2024-11-20 15:28:35 +08:00
  • cf2e9ed345 fix wangbluo 2024-11-20 14:12:01 +08:00
  • e0c68ab6d3 [Zerobubble] merge main. (#6142) duanjunwen 2024-11-19 19:00:36 +08:00
  • 2aa6e44355 fix wangbluo 2024-11-19 17:32:20 +08:00
  • 65c487499e fix wangbluo 2024-11-19 17:29:22 +08:00
  • 0009a4dbbe fix wangbluo 2024-11-19 16:46:46 +08:00
  • 603d06ad56 fix wangbluo 2024-11-19 11:55:02 +08:00
  • 945a67dd61 fix wangbluo 2024-11-19 11:52:51 +08:00
  • f4f3d52924 fix wangbluo 2024-11-18 17:51:15 +08:00
  • 974449ace0 fix wangbluo 2024-11-18 07:06:04 +00:00
  • 184a653704 [checkpointio] fix pinned state dict ver217 2024-11-19 11:40:42 +08:00
  • 5fa657f0a1 [checkpointio] fix size compute ver217 2024-11-18 19:12:24 +08:00
  • eb69e640e5 [async io]supoort async io (#6137) flybird11111 2024-11-18 17:52:24 +08:00
  • b90835bd32 [checkpointio] fix performance issue (#6139) Hongxin Liu 2024-11-18 16:41:29 +08:00
  • 8e08c27e19 [ckpt] Add async ckpt api (#6136) Wang Binluo 2024-11-15 18:19:16 +08:00
  • d4a436051d [checkpointio] support async model save (#6131) Hongxin Liu 2024-11-14 11:38:10 +08:00
  • 810cafb2f9 Merge pull request #6114 from duanjunwen/dev/zero_bubble feature/zerobubble duanjunwen 2024-11-18 17:38:49 +08:00
  • 41fdd2139b [fix] rm unused comments duanjunwen 2024-11-18 16:48:21 +08:00
  • dafda0fb70 [fix] remove debug info; duanjunwen 2024-11-18 03:32:04 +00:00
  • 9a21f87ed6 [fix] fix wait handle in run_fwd_bwd duanjunwen 2024-11-18 02:50:14 +00:00
  • f48a85e91d [fix] fix test_lora in llama policy duanjunwen 2024-11-15 10:27:13 +00:00
  • 2980da559f [fix] fix test_lora duanjunwen 2024-11-15 10:26:30 +00:00
  • 0fb500c7d4 [fix] rm debug info; update llama policy; update wait handle duanjunwen 2024-11-15 09:47:05 +00:00
  • cf86c1b1c5 [fix] fix zbv wait_handle duanjunwen 2024-11-15 07:56:14 +00:00
  • 5c2ebbfd48 [fix] fix mixtral modeling & policy; update wait handles; doing benchmarking for llama hybrid; duanjunwen 2024-11-15 05:58:56 +00:00
  • 5a03d2696d [cli] support run as module option (#6135) Hongxin Liu 2024-11-14 18:10:37 +08:00
  • cc40fe0e6f [fix] multi-node backward slowdown (#6134) Hanks 2024-11-14 17:45:49 +08:00
  • 014afbdb59 [fix] fix attn duanjunwen 2024-11-14 09:43:47 +00:00
  • 1bc4dba3a3 [fix] fix p2p error in zbv duanjunwen 2024-11-14 09:40:38 +00:00
  • c2fe3137e2 [hotfix] fix flash attn window_size err (#6132) duanjunwen 2024-11-14 17:11:35 +08:00
  • b6d5e61809 [feat] update mixtral policy & bert policy for zerobubble duanjunwen 2024-11-14 02:51:34 +00:00
  • 80b04d7855 [feat] support mixtral policy with zbv tp_Linear & non_tp_Linear duanjunwen 2024-11-12 07:28:49 +00:00
  • a2596519fd [zero] support extra dp (#6123) feature/async-io Hongxin Liu 2024-11-12 11:20:46 +08:00
  • 337debcf2a [feat] fix testcase; duanjunwen 2024-11-11 11:34:29 +00:00
  • 12919de424 [fix] fix send_tensor_metadata & send_grad_metadata; duanjunwen 2024-11-11 08:54:39 +00:00
  • 30a9443132 [Coati] Refine prompt for better inference (#6117) Tong Li 2024-11-08 11:00:37 +08:00
  • 7a60161035 update readme (#6116) Tong Li 2024-11-06 17:24:08 +08:00
  • 0d6d40ccc6 [fix] fix zbv llama pp4 duanjunwen 2024-11-06 03:35:12 +00:00
  • a15ab139ad [plugin] support get_grad_norm (#6115) Hongxin Liu 2024-11-05 18:12:47 +08:00
  • 4fc92aa77d [feat] support no_tp Linear for sharderformer.llama duanjunwen 2024-11-05 05:55:42 +00:00
  • 37b23e32b1 Merge pull request #6107 from duanjunwen/dev/zero_bubble duanjunwen 2024-11-05 11:31:48 +08:00
  • 13ffa08cfa [release] update version (#6109) v0.4.6 Hongxin Liu 2024-11-04 17:26:28 +08:00
  • 8e40087633 [fix] fix model zoo init duanjunwen 2024-11-01 09:02:07 +00:00
  • 0218e673db [fix] fix use_fp8 flag duanjunwen 2024-11-01 07:05:24 +00:00
  • 5b5fbcff09 [fix] fix hybridparall use_fp8 config duanjunwen 2024-11-01 05:27:11 +00:00
  • 3b5c314bea [fix] fix fp8 args in HybridParallel duanjunwen 2024-11-01 03:54:08 +00:00
  • c82c75a9b4 Merge branch 'feature/zerobubble' of github.com:hpcaitech/ColossalAI into dev/zero_bubble duanjunwen 2024-11-01 03:32:18 +00:00
  • 1d328ff651 Merge branch 'main' into dev/zero_bubble duanjunwen 2024-11-01 03:10:53 +00:00
  • 2f583c1549 [pre-commit.ci] pre-commit autoupdate (#6078) pre-commit-ci[bot] 2024-10-31 18:18:01 +08:00
  • aed20fb2df [feat] support zbv in mixtral benchmark; (#6083) duanjunwen 2024-10-31 18:17:29 +08:00
  • c2e8f61592 [checkpointio] fix hybrid plugin model save (#6106) Hongxin Liu 2024-10-31 17:04:53 +08:00
  • 5f0924361d [fix] fix linear (no tp) ops func name; duanjunwen 2024-10-31 08:18:28 +00:00
  • d2e05a99b3 [feat] support no tensor parallel Linear in shardformer; Add test for use weightGradStore and not use WeightGradStore duanjunwen 2024-10-30 02:54:32 +00:00
  • 982e4ee1f8 [fix] fix comment in llama & benchmark duanjunwen 2024-10-29 07:35:50 +00:00
  • fa3ccda8ee [fix] fix send recv signature; duanjunwen 2024-10-29 03:33:58 +00:00
  • fafe049b83 [fix] fix handle name; rm useless comments; duanjunwen 2024-10-29 03:24:15 +00:00
  • 5aee4261a6 [fix] fix test zerobubble duanjunwen 2024-10-28 06:06:07 +00:00
  • 6377aa0fff [fix] fix test_shard_llama ci; duanjunwen 2024-10-28 02:42:33 +00:00
  • 03fa79a55c [fix] fix llama modeling policy; duanjunwen 2024-10-25 10:17:06 +00:00
  • cc0dfddcbc [fix] fix test_shard_llama duanjunwen 2024-10-25 09:01:13 +00:00
  • d0ec221b38 [fix\ fix fail case test_shard_llama duanjunwen 2024-10-25 02:28:55 +00:00
  • 89a9a600bc [MCTS] Add self-refined MCTS (#6098) Tong Li 2024-10-24 17:51:19 +08:00
  • 2eca112c90 [feat] support meta cache, meta_grad_send, meta_tensor_send; fix runtime too long in Recv Bwd; benchmark for llama + Hybrid(tp+pp); duanjunwen 2024-10-24 07:30:19 +00:00
  • 4294ae83bb [doc] sora solution news (#6100) binmakeswell 2024-10-24 13:24:37 +08:00
  • 80a8ca916a [extension] hotfix compile check (#6099) Hongxin Liu 2024-10-24 11:11:44 +08:00
  • dee63cc5ef Merge pull request #6096 from BurkeHulk/hotfix/lora_ckpt Hanks 2024-10-21 14:13:04 +08:00
  • 6d6cafabe2 pre-commit fix BurkeHulk 2024-10-21 14:04:32 +08:00
  • b10339df7c fix lora ckpt save format (ColoTensor to Tensor) BurkeHulk 2024-10-21 13:55:43 +08:00
  • 19baab5fd5 [release] update version (#6094) v0.4.5 Hongxin Liu 2024-10-21 10:19:08 +08:00
  • 58d8b8a2dd [misc] fit torch api upgradation and remove legecy import (#6093) Hongxin Liu 2024-10-18 16:48:52 +08:00
  • 5ddad486ca [fp8] add fallback and make compile option configurable (#6092) Hongxin Liu 2024-10-18 13:55:31 +08:00
  • 3b1d7d1ae8 [chore] refactor botbw 2024-10-14 09:41:25 +00:00
  • 2bcd0b6844 [ckpt] add safetensors util botbw 2024-10-14 07:32:16 +00:00
  • cd61353bae [pipeline] hotfix backward for multiple outputs (#6090) Hongxin Liu 2024-10-16 17:27:33 +08:00
  • 705b18e1e7 [fix] add & fix llama test duanjunwen 2024-10-16 03:58:50 +00:00
  • e76308c6e6 [fix] rm use_zbv flag in Shardconfig; rm debug info; duanjunwen 2024-10-16 03:25:04 +00:00