Skip to content

Releases: hpcaitech/ColossalAI

Version v0.1.11rc3 Release Today!

13 Nov 07:37
b42b672
Compare
Choose a tag to compare

What's Changed

Release

Tutorial

Example

Sc

Nfc

  • [NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu
  • [NFC] remove redundant dependency (#1869) by binmakeswell
  • [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou
  • [NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan
  • [NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)
  • [NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY
  • [NFC] polish .readthedocs.yaml code style (#1852) by nuszzh
  • [NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo
  • [NFC] polish amp.naive_amp.grad_scaler code style by zbian
  • [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON
  • [NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang
  • [NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng
  • [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq
  • [NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang
  • [NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya
  • [NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan
  • [NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1
  • [NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli
  • [NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic
  • [NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng
  • [NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu
  • [NFC] polish strategies_constructor.py code style (#1806) by binmakeswell

Doc

Zero

Autoparallel

Fx

Hotfix

Inference

  • [inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang
  • [inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang

Amp

Diffusion

Utils

Full Changelog: v0.1.11rc3...v0.1.11rc2

Version v0.1.11rc2 Release Today!

08 Nov 14:44
4ac7d3e
Compare
Choose a tag to compare

What's Changed

Autoparallel

Kernel

Gemini

Checkpointio

  • [CheckpointIO] a uniform checkpoint I/O module (#1689) by ver217

Doc

Example

Nfc

  • [NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang
  • [NFC] polish type hint for shape consistency (#1801) by Jiarui Fang
  • [NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin
  • [NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju
  • [NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq
  • [NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao
  • [NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng
  • [NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan
  • [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1
  • [NFC] polish _checkpoint_hook.py code style (#1722) by LuGY
  • [NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)
  • [NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic
  • [NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou
  • [NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell
  • [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan

Fx

Hotfix

Pipeline

Ci

Compatibility

Feat

Fx/profiler

  • [fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel

Workflow

  • [workflow] handled the git directory ownership error (#1741) by Frank Lee

Full Changelog: v0.1.11rc2...v0.1.11rc1

Version v0.1.11rc1 Release Today!

19 Oct 03:49
d373e67
Compare
Choose a tag to compare

What's Changed

Hotfix

Release

Doc

Zero

  • [zero] add chunk init function for users (#1729) by HELSON
  • [zero] add constant placement policy (#1705) by HELSON

Pre-commit

Autoparallel

Fx/meta/rpc

  • [fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel

Embeddings

Unittest

  • [unittest] added doc for the pytest wrapper (#1704) by Frank Lee
  • [unittest] supported condititonal testing based on env var (#1701) by Frank Lee

Embedding

Fx/profiler

  • [fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel
  • [fx/profiler] provide a table of sum...
Read more

Version v0.1.10 Release Today!

08 Sep 10:03
b0f4c0b
Compare
Choose a tag to compare

What's Changed

Embedding

  • [embedding] cache_embedding small improvement (#1564) by CsRic
  • [embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang
  • [embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic
  • [embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang
  • [embedding] tablewise sharding polish (#1535) by Jiarui Fang
  • [embedding] add tablewise sharding for FAW (#1526) by CsRic

Nfc

Pipeline/tuning

  • [pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto

Fx

Autoparallel

Utils

  • [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217
  • [utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217
  • [utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao
  • [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217

Hotfix

Pipeline/pipleline_process_group

  • [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto

Doc

Autoparellel

Faw

Pipeline/rpc

  • [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto
  • [pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto
  • [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto
  • [pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto

Tensor

Fce

  • [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang

Workflow

  • [workflow] added TensorNVMe to compatibility test (#1449) by Frank Lee

Test

Engin/schedule

  • [engin/schedule] use p2p_v2 to ...
Read more

Version v0.1.9 Release Today!

11 Aug 13:16
74bee5f
Compare
Choose a tag to compare

What's Changed

Zero

  • [zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON
  • [zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON
  • [zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON
  • [zero] add unit test for AgChunk's append, close, access (#1423) by HELSON
  • [zero] add AgChunk (#1417) by HELSON
  • [zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217
  • [zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON
  • [zero] chunk manager allows filtering ex-large params (#1393) by ver217
  • [zero] zero optim state_dict takes only_rank_0 (#1384) by ver217

Fx

Recommendation System

Global Tensor

Hotfix

  • [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217
  • [hotfix] fix CPUAdam kernel nullptr (#1410) by ver217
  • [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON
  • [hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON
  • [hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306
  • [hotfix] fix zero optim save/load state dict (#1381) by ver217
  • [hotfix] fix zero ddp buffer cast (#1376) by ver217
  • [hotfix] fix no optimizer in save/load (#1363) by HELSON
  • [hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON
  • [hotfix] ZeroDDP use new process group (#1333) by ver217
  • [hotfix] shared model returns cpu state_dict (#1328) by ver217
  • [hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON
  • [hotfix] fix unit test test_module_spec (#1321) by HELSON
  • [hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217
  • [hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON
  • [hotfix] add missing file (#1308) by Jiarui Fang
  • [hotfix] remove potiential circle import (#1307) by Jiarui Fang
  • [hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306
  • [hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON
  • [hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang

Communication

Device

Chunk

DDP

  • [DDP] test ddp state dict uses more strict threshold (#1382) by ver217

Checkpoint

  • [checkpoint] add kwargs for load_state_dict (#1374) by HELSON
  • [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON
  • [checkpoint] sharded optim save/load grad scaler (#1350) by ver217
  • [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON
  • [checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang
  • [checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang

Util

Nvme

  • [nvme] CPUAdam and HybridAdam support NVMe offload (#1360) by ver217

Colotensor

  • [colotensor] use cpu memory to store state_dict (#1367) by HELSON
  • [colotensor] add Tensor.view op and its unit test (#1343) by HELSON

Unit test

  • [unit test] add megatron init test in zero_optim (#1358) by HELSON

Docker

Doc

Refactor

  • [refactor] refactor ColoTensor's unit tests (#1340) by HELSON

Workflow

  • [workflow] update docker build workflow to use proxy (#1334) by Frank Lee
  • [workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee
  • [workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee
  • [workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee
  • [workflow] updated release bdist workflow (#1318) by Frank Lee
  • [workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee
  • [workflow] updated pytorch compatibility test (#1311) by Frank Lee

Test

Read more

Version v0.1.8 Release Today!

12 Jul 16:10
7e8114a
Compare
Choose a tag to compare

What's Changed

Hotfix

Tensor

Fx

Rename

Checkpoint

Polish

  • [polish] polish repr for ColoTensor, DistSpec, ProcessGroup (#1235) by HELSON

Refactor

Context

Ddp

Colotensor

Zero

  • [zero] sharded optim supports loading local state dict (#1170) by ver217
  • [zero] zero optim supports loading local state dict (#1171) by ver217

Workflow

Gemini

Pipeline

Ci

  • [ci] added scripts to auto-generate release post text (#1142) by Frank Lee

Full Changelog: v0.1.8...v0.1.7

Version v0.1.7 Released Today

21 Jun 04:10
6690a61
Compare
Choose a tag to compare

Version v0.1.7 Released Today

Highlights

  • Started torch.fx for auto-parallel training
  • Update the zero mechanism with ColoTensor
  • Fixed various bugs

What's Changed

Hotfix

Zero

  • [zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
  • [zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
  • [zero] fixed api consistency (#1098) by Frank Lee
  • [zero] zero optim copy chunk rather than copy tensor (#1070) by ver217

Optim

Ddp

  • [ddp] add save/load state dict for ColoDDP (#1127) by ver217
  • [ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
  • [ddp] supported customized torch ddp configuration (#1123) by Frank Lee

Pipeline

Fx

Gemini

  • [gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
  • [gemini] zero supports gemini (#1093) by ver217

Test

Release

Tensor

Amp

  • [amp] included dict for type casting of model output (#1102) by Frank Lee

Workflow

Engine

Doc

  • [doc] added documentation to chunk and chunk manager (#1094) by Frank Lee

Context

Refactory

Cudnn

  • [cudnn] set False to cudnn benchmark by default (#1063) by Frank Lee

Full Changelog: v0.1.7...v0.1.6

v0.1.6 Released!

02 Jun 06:31
b167258
Compare
Choose a tag to compare

Main features

  1. ColoTensor supports hybrid parallel (tensor parallel and data parallel)
  2. ColoTensor supports ZeRO (with chunk)
  3. Config tensor parallel by module via ColoTensor
  4. ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face from_pretrain()

What's Changed

ColoTensor

Zero

  • [zero] add load_state_dict for sharded model by @ver217 in #894
  • [zero] add zero optimizer for ColoTensor by @ver217 in #1046

Hotfix

Unit test

CI

CLI

Documentation

Misc

New Contributors

Full Changelog: v0.1.5...v0.1.6

v0.1.5 Released!

17 May 01:48
5898ccf
Compare
Choose a tag to compare

Main Features

  1. Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.

What's Changed

ColoTensor

Pipeline Parallelism

CI

Misc

  • [Bot] Synchronize Submodule References by @github-actions in #907
  • [Bot] Synchronize Submodule References by @github-actions in #912
  • [setup] update cuda ext cc flags by @ver217 in #919
  • [setup] support more cuda architectures by @ver217 in #920
  • [NFC] update results on a single GPU, highlight quick view by @binmakeswell in #981

Full Changelog: v0.1.4...v0.1.5

v0.1.4 Released!

28 Apr 07:56
e1108ca
Compare
Choose a tag to compare

Main Features

Here are the main improvements of this release:

  1. ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
  2. Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
  3. CLI: a command-line tool that helps users launch distributed training tasks more easily.
  4. Pipeline Parallelism (PP): a more user-friendly API for PP.

What's Changed

ColoTensor

Gemini + ZeRO

  • [zero] add zero tensor shard strategy by @1SAA in #793
  • Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
  • [gemini] a new tensor structure by @feifeibear in #818
  • [gemini] APIs to set cpu memory capacity by @feifeibear in #809
  • [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
  • [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
  • [gemini] add GeminiMemoryManger by @1SAA in #832
  • [zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
  • [gemini] polish code by @1SAA in #855
  • [gemini] add stateful tensor container by @1SAA in #867
  • [gemini] polish stateful_tensor_mgr by @1SAA in #876
  • [gemini] accelerate adjust_layout() by @ver217 in #878

CLI

Pipeline Parallelism

Misc

  • [hotfix] fix auto tensor placement policy by @ver217 in #775
  • [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
  • [hotfix] fix bugs in zero by @1SAA in #781
  • [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
  • [refactor] moving memtracer to gemini by @feifeibear in #801
  • [log] display tflops if available by @feifeibear in #802
  • [refactor] moving grad acc logic to engine by @feifeibear in #804
  • [log] local throughput metrics by @feifeibear in #811
  • [Bot] Synchronize Submodule References by @github-actions in #810
  • [Bot] Synchronize Submodule References by @github-actions in #819
  • [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
  • [setup] allow installation with python 3.6 by @FrankLeeeee in #834
  • Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
  • [dependency] removed torchvision by @FrankLeeeee in #833
  • [Bot] Synchronize Submodule References by @github-actions in #827
  • [unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
  • [setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
  • [hotfix] ColoTensor pin_memory by @feifeibear in #840
  • modefied the pp build for ckpt adaptation by @Gy-Lu in #803
  • [hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
  • [hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
  • [hotfix] add deconstructor for stateful tensor by @ver217 in #848
  • [utils] refactor profiler by @ver217 in #837
  • [ci] cache cuda extension by @FrankLeeeee in #860
  • hotfix tensor unittest bugs by @feifeibear in #862
  • [usability] added assertion message in registry by @FrankLeeeee in #864
  • [doc] improved docstring in the communication module by @FrankLeeeee in #863
  • [doc] improved docstring in the logging module by @FrankLeeeee in #861
  • [doc] improved docstring in the amp module by @FrankLeeeee in #857
  • [usability] improved error messages in the context modu...
Read more