Releases · hpcaitech/ColossalAI

13 Nov 07:37

github-actions

v0.1.11rc3

b42b672

Version v0.1.11rc3 Release Today!

What's Changed

Release

[release] update version (#1931) by ver217

Tutorial

[tutorial] polish README and OPT files (#1930) by binmakeswell
[tutorial] add synthetic dataset for opt (#1924) by ver217
[tutorial] updated hybrid parallel readme (#1928) by Frank Lee
[tutorial] added synthetic data for sequence parallel (#1927) by Frank Lee
[tutorial] removed huggingface model warning (#1925) by Frank Lee
Hotfix/tutorial readme index (#1922) by Frank Lee
[tutorial] modify hands-on of auto activation checkpoint (#1920) by Boyuan Yao
[tutorial] added synthetic data for hybrid parallel (#1921) by Frank Lee
[tutorial] added synthetic data for hybrid parallel (#1919) by Frank Lee
[tutorial] added synthetic dataset for auto parallel demo (#1918) by Frank Lee
[tutorial] updated auto parallel demo with latest data path (#1917) by Frank Lee
[tutorial] added data script and updated readme (#1916) by Frank Lee
[tutorial] add cifar10 for diffusion (#1907) by binmakeswell
[tutorial] removed duplicated tutorials (#1904) by Frank Lee
[tutorial] edited hands-on practices (#1899) by BoxiangW

Example

[example] update auto_parallel img path (#1910) by binmakeswell
[example] add cifar10 dadaset for diffusion (#1902) by Fazzie-Maqianli
[example] migrate diffusion and auto_parallel hands-on (#1871) by binmakeswell
[example] initialize tutorial (#1865) by binmakeswell
Merge pull request #1842 from feifeibear/jiarui/polish by Fazzie-Maqianli
[example] polish diffusion readme by jiaruifang

Sc

[SC] add GPT example for auto checkpoint (#1889) by Boyuan Yao
[sc] add examples for auto checkpoint. (#1880) by Super Daniel

Nfc

[NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu
[NFC] remove redundant dependency (#1869) by binmakeswell
[NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou
[NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan
[NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)
[NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY
[NFC] polish .readthedocs.yaml code style (#1852) by nuszzh
[NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo
[NFC] polish amp.naive_amp.grad_scaler code style by zbian
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON
[NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang
[NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng
[NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq
[NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang
[NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya
[NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan
[NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1
[NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli
[NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic
[NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng
[NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu
[NFC] polish strategies_constructor.py code style (#1806) by binmakeswell

Doc

[doc] add news (#1901) by binmakeswell

Zero

[zero] migrate zero1&2 (#1878) by HELSON

Autoparallel

[autoparallel] user-friendly API for CheckpointSolver. (#1879) by Super Daniel
[autoparallel] fix linear logical convert issue (#1857) by YuliangLiu0306

Fx

[fx] metainfo_trace as an API. (#1873) by Super Daniel

Hotfix

[hotfix] pass test_complete_workflow (#1877) by Jiarui Fang

Inference

[inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang
[inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang

Amp

[amp] add torch amp test (#1860) by xcnick

Diffusion

[diffusion] fix package conflicts (#1875) by HELSON

Utils

[utils] fixed lazy init context (#1867) by Frank Lee
[utils] remove lazy_memory_allocate from ColoInitContext (#1844) by Jiarui Fang

Full Changelog: v0.1.11rc3...v0.1.11rc2

Assets 2

4 Join discussion

08 Nov 14:44

github-actions

v0.1.11rc2

4ac7d3e

Version v0.1.11rc2 Release Today!

What's Changed

Autoparallel

[autoparallel] fix bugs caused by negative dim key (#1808) by YuliangLiu0306
[autoparallel] fix bias addition module (#1800) by YuliangLiu0306
[autoparallel] add batch norm metainfo (#1815) by Boyuan Yao
[autoparallel] add conv metainfo class for auto parallel (#1796) by Boyuan Yao
[autoparallel]add essential CommActions for broadcast oprands (#1793) by YuliangLiu0306
[autoparallel] refactor and add rotorc. (#1789) by Super Daniel
[autoparallel] add getattr handler (#1767) by YuliangLiu0306
[autoparallel] added matmul handler (#1763) by Frank Lee
[autoparallel] fix conv handler numerical test (#1771) by YuliangLiu0306
[autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764) by Super Daniel
[autoparallel] add numerical test for handlers (#1769) by YuliangLiu0306
[autoparallel] update CommSpec to CommActions (#1768) by YuliangLiu0306
[autoparallel] add numerical test for node strategies (#1760) by YuliangLiu0306
[autoparallel] refactor the runtime apply pass and add docstring to passes (#1757) by YuliangLiu0306
[autoparallel] added binary elementwise node handler (#1758) by Frank Lee
[autoparallel] fix param hook issue in transform pass (#1755) by YuliangLiu0306
[autoparallel] added addbmm handler (#1751) by Frank Lee
[autoparallel] shard param and buffer as expected (#1753) by YuliangLiu0306
[autoparallel] add sequential order to communication actions (#1735) by YuliangLiu0306
[autoparallel] recovered skipped test cases (#1748) by Frank Lee
[autoparallel] fixed wrong sharding strategy in conv handler (#1747) by Frank Lee
[autoparallel] fixed wrong generated strategy for dot op (#1746) by Frank Lee
[autoparallel] handled illegal sharding strategy in shape consistency (#1744) by Frank Lee
[autoparallel] handled illegal strategy in node handler (#1743) by Frank Lee
[autoparallel] handled illegal sharding strategy (#1728) by Frank Lee

Kernel

[kernel] added jit warmup (#1792) by アマデウス
[kernel] more flexible flashatt interface (#1804) by oahzxl
[kernel] skip tests of flash_attn and triton when they are not available (#1798) by Jiarui Fang

Gemini

[Gemini] make gemini usage simple (#1821) by Jiarui Fang

Checkpointio

[CheckpointIO] a uniform checkpoint I/O module (#1689) by ver217

Doc

[doc] polish diffusion README (#1840) by binmakeswell
[doc] remove obsolete API demo (#1833) by binmakeswell
[doc] add diffusion (#1827) by binmakeswell
[doc] add FastFold (#1766) by binmakeswell

Example

[example] remove useless readme in diffusion (#1831) by Jiarui Fang
[example] add TP to GPT example (#1828) by Jiarui Fang
[example] add stable diffuser (#1825) by Fazzie-Maqianli
[example] simplify the GPT2 huggingface example (#1826) by Jiarui Fang
[example] opt does not depend on Titans (#1811) by Jiarui Fang
[example] add GPT by Jiarui Fang
[example] add opt model in lauguage (#1809) by Jiarui Fang
[example] add diffusion to example (#1805) by Jiarui Fang

Nfc

[NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang
[NFC] polish type hint for shape consistency (#1801) by Jiarui Fang
[NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin
[NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju
[NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq
[NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao
[NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng
[NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan
[NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1
[NFC] polish _checkpoint_hook.py code style (#1722) by LuGY
[NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)
[NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic
[NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou
[NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell
[NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan

Fx

[fx] add a symbolic_trace api. (#1812) by Super Daniel
[fx] skip diffusers unitest if it is not installed (#1799) by Jiarui Fang
[fx] Add linear metainfo class for auto parallel (#1783) by Boyuan Yao
[fx] support module with bias addition (#1780) by YuliangLiu0306
[fx] refactor memory utils and extend shard utils. (#1754) by Super Daniel
[fx] test tracer on diffuser modules. (#1750) by Super Daniel

Hotfix

[hotfix] fix build error when torch version >= 1.13 (#1803) by xcnick
[hotfix] polish flash attention (#1802) by oahzxl
[hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786) by HELSON
[hotfix] polish chunk import (#1787) by Jiarui Fang
[hotfix] autoparallel unit test (#1752) by YuliangLiu0306

Pipeline

[Pipeline]Adapt to Pipelinable OPT (#1782) by Ziyue Jiang

Ci

[CI] downgrade fbgemm. (#1778) by Super Daniel

Compatibility

[compatibility] ChunkMgr import error (#1772) by Jiarui Fang

Feat

[feat] add flash attention (#1762) by oahzxl

Fx/profiler

[fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel

Workflow

[workflow] handled the git directory ownership error (#1741) by Frank Lee

Full Changelog: v0.1.11rc2...v0.1.11rc1

Assets 2

0 Join discussion

19 Oct 03:49

github-actions

v0.1.11rc1

d373e67

Version v0.1.11rc1 Release Today!

What's Changed

Hotfix

[hotfix] resharding cost issue (#1742) by YuliangLiu0306
[hotfix] solver bug caused by dict type comm cost (#1686) by YuliangLiu0306
[hotfix] fix wrong type name in profiler (#1678) by Boyuan Yao
[hotfix]unit test (#1670) by YuliangLiu0306
[hotfix] add recompile after graph manipulatation (#1621) by YuliangLiu0306
[hotfix] got sliced types (#1614) by YuliangLiu0306

Release

[release] update to v0.1.11 (#1736) by Frank Lee

Doc

[doc] update recommendation system catalogue (#1732) by binmakeswell
[doc] update recommedation system urls (#1725) by Jiarui Fang

Zero

[zero] add chunk init function for users (#1729) by HELSON
[zero] add constant placement policy (#1705) by HELSON

Pre-commit

[pre-commit] update pre-commit (#1726) by HELSON

Autoparallel

[autoparallel] runtime_backward_apply (#1720) by YuliangLiu0306
[autoparallel] moved tests to test_tensor_shard (#1713) by Frank Lee
[autoparallel] resnet block runtime apply (#1709) by YuliangLiu0306
[autoparallel] fixed broken node handler tests (#1708) by Frank Lee
[autoparallel] refactored the autoparallel module for organization (#1706) by Frank Lee
[autoparallel] adapt runtime passes (#1703) by YuliangLiu0306
[autoparallel] collated all deprecated files (#1700) by Frank Lee
[autoparallel] init new folder structure (#1696) by Frank Lee
[autoparallel] adapt solver and CostGraph with new handler (#1695) by YuliangLiu0306
[autoparallel] add output handler and placeholder handler (#1694) by YuliangLiu0306
[autoparallel] add pooling handler (#1690) by YuliangLiu0306
[autoparallel] where_handler_v2 (#1688) by YuliangLiu0306
[autoparallel] fix C version rotor inconsistency (#1691) by Boyuan Yao
[autoparallel] added sharding spec conversion for linear handler (#1687) by Frank Lee
[autoparallel] add reshape handler v2 and fix some previous bug (#1683) by YuliangLiu0306
[autoparallel] add unary element wise handler v2 (#1674) by YuliangLiu0306
[autoparallel] add following node generator (#1673) by YuliangLiu0306
[autoparallel] add layer norm handler v2 (#1671) by YuliangLiu0306
[autoparallel] fix insecure subprocess (#1680) by Boyuan Yao
[autoparallel] add rotor C version (#1658) by Boyuan Yao
[autoparallel] added utils for broadcast operation (#1665) by Frank Lee
[autoparallel] update CommSpec (#1667) by YuliangLiu0306
[autoparallel] added bias comm spec to matmul strategy (#1664) by Frank Lee
[autoparallel] add batch norm handler v2 (#1666) by YuliangLiu0306
[autoparallel] remove no strategy nodes (#1652) by YuliangLiu0306
[autoparallel] added compute resharding costs for node handler (#1662) by Frank Lee
[autoparallel] added new strategy constructor template (#1661) by Frank Lee
[autoparallel] added node handler for bmm (#1655) by Frank Lee
[autoparallel] add conv handler v2 (#1663) by YuliangLiu0306
[autoparallel] adapt solver with gpt (#1653) by YuliangLiu0306
[autoparallel] implemented all matmul strategy generator (#1650) by Frank Lee
[autoparallel] change the following nodes strategies generation logic (#1636) by YuliangLiu0306
[autoparallel] where handler (#1651) by YuliangLiu0306
[autoparallel] implemented linear projection strategy generator (#1639) by Frank Lee
[autoparallel] adapt solver with mlp (#1638) by YuliangLiu0306
[autoparallel] Add pofo sequence annotation (#1637) by Boyuan Yao
[autoparallel] add elementwise handler (#1622) by YuliangLiu0306
[autoparallel] add embedding handler (#1620) by YuliangLiu0306
[autoparallel] protect bcast handler from invalid strategies (#1631) by YuliangLiu0306
[autoparallel] add layernorm handler (#1629) by YuliangLiu0306
[autoparallel] recover the merged node strategy index (#1613) by YuliangLiu0306
[autoparallel] added new linear module handler (#1616) by Frank Lee
[autoparallel] added new node handler (#1612) by Frank Lee
[autoparallel]add bcast matmul strategies (#1605) by YuliangLiu0306
[autoparallel] refactored the data structure for sharding strategy (#1610) by Frank Lee
[autoparallel] add bcast op handler (#1600) by YuliangLiu0306
[autoparallel] added all non-bcast matmul strategies (#1603) by Frank Lee
[autoparallel] added strategy generator and bmm strategies (#1602) by Frank Lee
[autoparallel] add reshape handler (#1594) by YuliangLiu0306
[autoparallel] refactored shape consistency to remove redundancy (#1591) by Frank Lee
[autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589) by YuliangLiu0306
[autoparallel] added generate_sharding_spec to utils (#1590) by Frank Lee
[autoparallel] added solver option dataclass (#1588) by Frank Lee
[autoparallel] adapt solver with resnet (#1583) by YuliangLiu0306

Fx/meta/rpc

[fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel

Embeddings

[embeddings] add doc in readme (#1711) by Jiarui Fang
[embeddings] more detailed timer (#1692) by Jiarui Fang
[embeddings] cache option (#1635) by Jiarui Fang
[embeddings] use cache_ratio instead of cuda_row_num (#1611) by Jiarui Fang
[embeddings] add already_split_along_rank flag for tablewise mode (#1584) by CsRic

Unittest

[unittest] added doc for the pytest wrapper (#1704) by Frank Lee
[unittest] supported condititonal testing based on env var (#1701) by Frank Lee

Embedding

[embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) by Jiarui Fang
[embedding] polish async copy (#1657) by Jiarui Fang
[embedding] add more detail profiling (#1656) by Jiarui Fang
[embedding] print profiling results (#1654) by Jiarui Fang
[embedding] non-blocking cpu-gpu copy (#1647) by Jiarui Fang
[embedding] isolate cache_op from forward (#1645) by CsRic
[embedding] rollback for better FAW performance (#1625) by Jiarui Fang
[embedding] updates some default parameters by Jiarui Fang

Fx/profiler

[fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel
[fx/profiler] provide a table of sum...

Assets 2

08 Sep 10:03

github-actions

v0.1.10

b0f4c0b

Version v0.1.10 Release Today!

What's Changed

Embedding

[embedding] cache_embedding small improvement (#1564) by CsRic
[embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang
[embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic
[embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang
[embedding] tablewise sharding polish (#1535) by Jiarui Fang
[embedding] add tablewise sharding for FAW (#1526) by CsRic

Nfc

[NFC] polish test component gpt code style (#1567) by アマデウス
[NFC] polish doc style for ColoTensor (#1457) by Jiarui Fang
[NFC] global vars should be upper case (#1456) by Jiarui Fang

Pipeline/tuning

[pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto

Fx

[fx] provide a stable but not accurate enough version of profiler. (#1547) by Super Daniel
[fx] Add common node in model linearize (#1542) by Boyuan Yao
[fx] support meta tracing for aten level computation graphs like functorch. (#1536) by Super Daniel
[fx] Modify solver linearize and add corresponding test (#1531) by Boyuan Yao
[fx] add test for meta tensor. (#1527) by Super Daniel
[fx]patch nn.functional convolution (#1528) by YuliangLiu0306
[fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521) by Boyuan Yao
[fx] hack torch_dispatch for meta tensor and autograd. (#1515) by Super Daniel
[fx] Fix activation codegen dealing with checkpointing first op (#1510) by Boyuan Yao
[fx] fix the discretize bug (#1506) by Boyuan Yao
[fx] fix wrong variable name in solver rotor (#1502) by Boyuan Yao
[fx] Add activation checkpoint solver rotor (#1496) by Boyuan Yao
[fx] add more op patches for profiler and error message for unsupported ops. (#1495) by Super Daniel
[fx] fixed adapative pooling size concatenation error (#1489) by Frank Lee
[fx] add profiler for fx nodes. (#1480) by Super Daniel
[fx] Fix ckpt functions' definitions in forward (#1476) by Boyuan Yao
[fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466) by Super Daniel
[fx] add rules to linearize computation graphs for searching. (#1461) by Super Daniel
[fx] Add use_reentrant=False to checkpoint in codegen (#1463) by Boyuan Yao
[fx] fix test and algorithm bugs in activation checkpointing. (#1451) by Super Daniel
[fx] Use colossalai checkpoint and add offload recognition in codegen (#1439) by Boyuan Yao
[fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446) by Super Daniel

Autoparallel

[autoparallel]add backward cost info into strategies (#1524) by YuliangLiu0306
[autoparallel] support fucntion in operator handler (#1529) by YuliangLiu0306
[autoparallel] change the merge node logic (#1533) by YuliangLiu0306
[autoparallel] added liveness analysis (#1516) by Frank Lee
[autoparallel] add more sharding strategies to conv (#1487) by YuliangLiu0306
[autoparallel] add cost graph class (#1481) by YuliangLiu0306
[autoparallel] added namespace constraints (#1490) by Frank Lee
[autoparallel] integrate auto parallel with torch fx (#1479) by Frank Lee
[autoparallel] added dot handler (#1475) by Frank Lee
[autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471) by Frank Lee
[autoparallel] standardize the code structure (#1469) by Frank Lee
[autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) by YuliangLiu0306

Utils

[utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217
[utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217
[utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao
[utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217

Hotfix

[hotfix] change namespace for meta_trace. (#1541) by Super Daniel
[hotfix] fix init context (#1543) by ver217
[hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530) by Super Daniel
[hotfix] fix coloproxy typos. (#1519) by Super Daniel

Pipeline/pipleline_process_group

[pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto

Doc

[doc] docstring for FreqAwareEmbeddingBag (#1525) by Jiarui Fang
[doc] update readme with the new xTrimoMultimer project (#1477) by Sze-qq
[doc] update docstring in ProcessGroup (#1468) by Jiarui Fang
[Doc] add more doc for ColoTensor. (#1458) by Jiarui Fang

Autoparellel

[autoparellel]add strategies constructor (#1505) by YuliangLiu0306

Faw

[FAW] cpu caching operations (#1520) by Jiarui Fang
[FAW] refactor reorder() for CachedParamMgr (#1514) by Jiarui Fang
[FAW] LFU initialize with dataset freq (#1513) by Jiarui Fang
[FAW] shrink freq_cnter size (#1509) by CsRic
[FAW] remove code related to chunk (#1501) by Jiarui Fang
[FAW] add more docs and fix a warning (#1500) by Jiarui Fang
[FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) by CsRic
[FAW] LFU cache for the FAW by CsRic
[FAW] init an LFU implementation for FAW (#1488) by Jiarui Fang
[FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) by Geng Zhang

Pipeline/rpc

[pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto
[pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto
[pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto
[pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto

Tensor

[tensor]add 1D device mesh (#1492) by YuliangLiu0306
[tensor] support runtime ShardingSpec apply (#1453) by YuliangLiu0306
[tensor] shape consistency generate transform path and communication cost (#1435) by YuliangLiu0306
[tensor] added linear implementation for the new sharding spec (#1416) by Frank Lee

Fce

[FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang

Workflow

[workflow] added TensorNVMe to compatibility test (#1449) by Frank Lee

Test

[test] fixed the activation codegen test (#1447) by Frank Lee

Engin/schedule

[engin/schedule] use p2p_v2 to ...

Assets 2

0 Join discussion

11 Aug 13:16

github-actions

v0.1.9

74bee5f

Version v0.1.9 Release Today!

What's Changed

Zero

[zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON
[zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON
[zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON
[zero] add unit test for AgChunk's append, close, access (#1423) by HELSON
[zero] add AgChunk (#1417) by HELSON
[zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217
[zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON
[zero] chunk manager allows filtering ex-large params (#1393) by ver217
[zero] zero optim state_dict takes only_rank_0 (#1384) by ver217

Fx

[fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433) by Super Daniel
[fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425) by Super Daniel
[fx] fixed torchaudio conformer tracing (#1392) by Frank Lee
[fx] patched torch.max and data movement operator (#1391) by Frank Lee
[fx] fixed indentation error in checkpointing codegen (#1385) by Frank Lee
[fx] patched torch.full for huggingface opt (#1386) by Frank Lee
[fx] update split module pass and add customized policy (#1373) by YuliangLiu0306
[fx] add torchaudio test (#1369) by Super Daniel
[fx] Add colotracer compatibility test on torchrec (#1370) by Boyuan Yao
[fx]add gpt2 passes for pipeline performance test (#1366) by YuliangLiu0306
[fx] added activation checkpoint codegen support for torch < 1.12 (#1359) by Frank Lee
[fx] added activation checkpoint codegen (#1355) by Frank Lee
[fx] fixed apex normalization patch exception (#1352) by Frank Lee
[fx] added activation checkpointing annotation (#1349) by Frank Lee
[fx] update MetaInforProp pass to process more complex node.meta (#1344) by YuliangLiu0306
[fx] refactor tracer to trace complete graph (#1342) by YuliangLiu0306
[fx] tested the complete workflow for auto-parallel (#1336) by Frank Lee
[fx]refactor tracer (#1335) by YuliangLiu0306
[fx] recovered skipped pipeline tests (#1338) by Frank Lee
[fx] fixed compatiblity issue with torch 1.10 (#1331) by Frank Lee
[fx] fixed unit tests for torch 1.12 (#1327) by Frank Lee
[fx] add balanced policy v2 (#1251) by YuliangLiu0306
[fx] Add unit test and fix bugs for transform_mlp_pass (#1299) by XYE
[fx] added apex normalization to patched modules (#1300) by Frank Lee

Recommendation System

[FAW] export FAW in _ops (#1438) by Jiarui Fang
[FAW] move coloparam setting in test code. (#1429) by Jiarui Fang
[FAW] parallel FreqAwareEmbedding (#1424) by Jiarui Fang
[FAW] add cache manager for the cached embedding (#1419) by Jiarui Fang

Global Tensor

[tensor] add shape consistency feature to support auto spec transform (#1418) by YuliangLiu0306
[tensor]build sharding spec to replace distspec in future. (#1405) by YuliangLiu0306

Hotfix

[hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217
[hotfix] fix CPUAdam kernel nullptr (#1410) by ver217
[hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON
[hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON
[hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306
[hotfix] fix zero optim save/load state dict (#1381) by ver217
[hotfix] fix zero ddp buffer cast (#1376) by ver217
[hotfix] fix no optimizer in save/load (#1363) by HELSON
[hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON
[hotfix] ZeroDDP use new process group (#1333) by ver217
[hotfix] shared model returns cpu state_dict (#1328) by ver217
[hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON
[hotfix] fix unit test test_module_spec (#1321) by HELSON
[hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217
[hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON
[hotfix] add missing file (#1308) by Jiarui Fang
[hotfix] remove potiential circle import (#1307) by Jiarui Fang
[hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306
[hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON
[hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang

Communication

[communication] add p2p_v2.py to support communication with List[Any] (#1407) by Kirigaya Kazuto

Device

[device] add DeviceMesh class to support logical device layout (#1394) by YuliangLiu0306

Chunk

[chunk] add PG check for tensor appending (#1383) by Jiarui Fang

DDP

[DDP] test ddp state dict uses more strict threshold (#1382) by ver217

Checkpoint

[checkpoint] add kwargs for load_state_dict (#1374) by HELSON
[checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON
[checkpoint] sharded optim save/load grad scaler (#1350) by ver217
[checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON
[checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang
[checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang

Util

[util] standard checkpoint function naming (#1377) by Frank Lee

Nvme

[nvme] CPUAdam and HybridAdam support NVMe offload (#1360) by ver217

Colotensor

[colotensor] use cpu memory to store state_dict (#1367) by HELSON
[colotensor] add Tensor.view op and its unit test (#1343) by HELSON

Unit test

[unit test] add megatron init test in zero_optim (#1358) by HELSON

Docker

[docker] add tensornvme in docker (#1354) by ver217

Doc

[doc] update rst and docstring (#1351) by ver217

Refactor

[refactor] refactor ColoTensor's unit tests (#1340) by HELSON

Workflow

[workflow] update docker build workflow to use proxy (#1334) by Frank Lee
[workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee
[workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee
[workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee
[workflow] updated release bdist workflow (#1318) by Frank Lee
[workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee
[workflow] updated pytorch compatibility test (#1311) by Frank Lee

Test

[test] removed outdated unit test for meta context (#1329) by [Frank Lee](https://api.github.com/users/Fra...

Assets 2

0 Join discussion

12 Jul 16:10

github-actions

v0.1.8

7e8114a

Version v0.1.8 Release Today!

What's Changed

Hotfix

[hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang
[hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306
[hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217
[hotfix] fx get comm size bugs (#1233) by Jiarui Fang
[hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang
[hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306
[hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306
[hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306

Tensor

[tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang
[tensor] redistribute among different process groups (#1247) by Jiarui Fang
[tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang
[tensor] redirect .data.get to a tensor instance (#1239) by HELSON
[tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang
[tensor] fix some unittests (#1234) by Jiarui Fang
[tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON
[tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON
[tensor] torch function return colotensor (#1229) by Jiarui Fang
[tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON
[tensor] sharded global process group (#1219) by Jiarui Fang
[Tensor] add cpu group to ddp (#1200) by Jiarui Fang
[tensor] remove gpc in tensor tests (#1186) by Jiarui Fang
[tensor] revert local view back (#1178) by Jiarui Fang
[Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang
[Tensor] rename parallel_action (#1174) by Ziyue Jiang
[Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang
[Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang
[tensor] add embedding bag op (#1156) by ver217
[tensor] add more element-wise ops (#1155) by ver217
[tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee
[tensor] dist spec s2s uses all-to-all (#1136) by ver217
[tensor] added repr to spec (#1147) by Frank Lee

Fx

[fx] added ndim property to proxy (#1253) by Frank Lee
[fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee
[fx] refactored the file structure of patched function and module (#1238) by Frank Lee
[fx] methods to get fx graph property. (#1246) by YuliangLiu0306
[fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306
[fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee
[fx]get communication size between partitions (#1224) by YuliangLiu0306
[fx] added patches for tracing swin transformer (#1228) by Frank Lee
[fx] fixed timm tracing result misalignment (#1225) by Frank Lee
[fx] added timm model tracing testing (#1221) by Frank Lee
[fx] added torchvision model tracing testing (#1216) by Frank Lee
[fx] temporarily used (#1215) by XYE
[fx] added testing for all albert variants (#1211) by Frank Lee
[fx] added testing for all gpt variants (#1210) by Frank Lee
[fx]add uniform policy (#1208) by YuliangLiu0306
[fx] added testing for all bert variants (#1207) by Frank Lee
[fx] supported model tracing for huggingface bert (#1201) by Frank Lee
[fx] added module patch for pooling layers (#1197) by Frank Lee
[fx] patched conv and normalization (#1188) by Frank Lee
[fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee

Rename

[rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang

Checkpoint

[checkpoint] save sharded optimizer states (#1237) by Jiarui Fang
[checkpoint]support generalized scheduler (#1222) by Yi Zhao
[checkpoint] make unitest faster (#1217) by Jiarui Fang
[checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang

Polish

[polish] polish repr for ColoTensor, DistSpec, ProcessGroup (#1235) by HELSON

Refactor

[refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang
[refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang
[refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang

Context

[context]support arbitary module materialization. (#1193) by YuliangLiu0306
[context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306

Ddp

[ddp] ColoDDP uses bucket all-reduce (#1177) by ver217
[ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217

Colotensor

[ColoTensor] add independent process group (#1179) by Jiarui Fang
[ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang
[ColoTensor] improves init functions. (#1150) by Jiarui Fang

Zero

[zero] sharded optim supports loading local state dict (#1170) by ver217
[zero] zero optim supports loading local state dict (#1171) by ver217

Workflow

[workflow] polish readme and dockerfile (#1165) by Frank Lee
[workflow] auto-publish docker image upon release (#1164) by Frank Lee
[workflow] fixed release post workflow (#1154) by Frank Lee
[workflow] fixed format error in yaml file (#1145) by Frank Lee
[workflow] added workflow to auto draft the release post (#1144) by Frank Lee

Gemini

[gemini] refactor gemini mgr (#1151) by ver217

Pipeline

[pipeline]add customized policy (#1139) by YuliangLiu0306
[pipeline]support more flexible pipeline (#1138) by YuliangLiu0306

Ci

[ci] added scripts to auto-generate release post text (#1142) by Frank Lee

Full Changelog: v0.1.8...v0.1.7

Assets 2

21 Jun 04:10

FrankLeeeee

v0.1.7

6690a61

Version v0.1.7 Released Today

Highlights

Started torch.fx for auto-parallel training
Update the zero mechanism with ColoTensor
Fixed various bugs

What's Changed

Hotfix

[hotfix] prevent nested ZeRO (#1140) by ver217
[hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306
[hotfix] fix param op hook (#1131) by ver217
[hotfix] fix zero init ctx numel (#1128) by ver217
[hotfix]change to fit latest p2p (#1100) by YuliangLiu0306
[hotfix] fix chunk comm src rank (#1072) by ver217

Zero

[zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
[zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
[zero] fixed api consistency (#1098) by Frank Lee
[zero] zero optim copy chunk rather than copy tensor (#1070) by ver217

Optim

[optim] refactor fused sgd (#1134) by ver217

Ddp

[ddp] add save/load state dict for ColoDDP (#1127) by ver217
[ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
[ddp] supported customized torch ddp configuration (#1123) by Frank Lee

Pipeline

[pipeline]support List of Dict data (#1125) by YuliangLiu0306
[pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee
[pipeline] refactor the pipeline module (#1087) by Frank Lee

Fx

[fx]add autoparallel passes (#1121) by YuliangLiu0306
[fx] added unit test for coloproxy (#1119) by Frank Lee
[fx] added coloproxy (#1115) by Frank Lee

Gemini

[gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
[gemini] zero supports gemini (#1093) by ver217

Test

[test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee
[test] skip tests when not enough GPUs are detected (#1090) by Frank Lee
[test] ignore 8 gpu test (#1080) by Frank Lee

Release

[release] update version.txt (#1103) by Frank Lee

Tensor

[tensor] refactor param op hook (#1097) by ver217
[tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217
[Tensor] fix equal assert (#1091) by Ziyue Jiang
[Tensor] 1d row embedding (#1075) by Ziyue Jiang
[tensor] chunk manager monitor mem usage (#1076) by ver217
[Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang
[Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang

Amp

[amp] included dict for type casting of model output (#1102) by Frank Lee

Workflow

[workflow] fixed 8-gpu test workflow (#1101) by Frank Lee
[workflow] added regular 8 GPU testing (#1099) by Frank Lee
[workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee

Engine

[engine] fixed empty op hook check (#1096) by Frank Lee

Doc

[doc] added documentation to chunk and chunk manager (#1094) by Frank Lee

Context

[context] support lazy init of module (#1088) by Frank Lee
[context] maintain the context object in with statement (#1073) by Frank Lee

Refactory

[refactory] add nn.parallel module (#1068) by Jiarui Fang

Cudnn

[cudnn] set False to cudnn benchmark by default (#1063) by Frank Lee

Full Changelog: v0.1.7...v0.1.6

Assets 2

0 Join discussion

02 Jun 06:31

ver217

v0.1.6

b167258

v0.1.6 Released!

Main features

ColoTensor supports hybrid parallel (tensor parallel and data parallel)
ColoTensor supports ZeRO (with chunk)
Config tensor parallel by module via ColoTensor
ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face from_pretrain()

What's Changed

ColoTensor

[tensor] refactor colo-tensor by @ver217 in #992
[tensor] refactor parallel action by @ver217 in #1007
[tensor] impl ColoDDP for ColoTensor by @ver217 in #1009
[Tensor] add module handler for linear by @Wesley-Jzy in #1021
[Tensor] add module check and bert test by @Wesley-Jzy in #1031
[Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in #1041
[tensor] ColoTensor supports ZeRo by @ver217 in #1015
[zero] add chunk size search for chunk manager by @ver217 in #1052

Zero

[zero] add load_state_dict for sharded model by @ver217 in #894
[zero] add zero optimizer for ColoTensor by @ver217 in #1046

Hotfix

[hotfix] fix colo init context by @ver217 in #1026
[hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in #1011
[kernel] fixed the include bug in dropout kernel by @FrankLeeeee in #999
fix typo in constants by @ryanrussell in #1027
[engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in #1030
[hotfix] fix dist spec mgr by @ver217 in #1045
[hotfix] fix import error in sharded model v2 by @ver217 in #1053

Unit test

[unit test] refactor test tensor by @ver217 in #1005

CI

[ci] update the docker image name by @FrankLeeeee in #1017
[ci] added nightly build (#1018) by @FrankLeeeee in #1019
[ci] fixed nightly build workflow by @FrankLeeeee in #1022
[ci] fixed nightly build workflow by @FrankLeeeee in #1029
[ci] fixed nightly build workflow by @FrankLeeeee in #1040

CLI

[cli] remove unused imports by @FrankLeeeee in #1001

Documentation

Hotfix/format by @binmakeswell in #987
[doc] update docker instruction by @FrankLeeeee in #1020

Misc

[NFC] Hotfix/format by @binmakeswell in #984
Revert "[NFC] Hotfix/format" by @ver217 in #986
remove useless import in tensor dir by @feifeibear in #997
[NFC] fix download link by @binmakeswell in #998
[Bot] Synchronize Submodule References by @github-actions in #1003
[NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.c… by @zhengzangw in #1010
[NFC] fix paper link by @binmakeswell in #1012
[p2p]add object list send/recv by @YuliangLiu0306 in #1024
[Bot] Synchronize Submodule References by @github-actions in #1034
[NFC] add inference by @binmakeswell in #1044
[titans]remove model zoo by @YuliangLiu0306 in #1042
[NFC] add inference submodule in path by @binmakeswell in #1047
[release] update version.txt by @FrankLeeeee in #1048
[Bot] Synchronize Submodule References by @github-actions in #1049
updated collective ops api by @kurisusnowdeng in #1054
[pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in #1050

New Contributors

@ryanrussell made their first contribution in #1027

Full Changelog: v0.1.5...v0.1.6

Contributors

ryanrussell, feifeibear, and 7 other contributors

Assets 2

0 Join discussion

17 May 01:48

ver217

v0.1.5

5898ccf

v0.1.5 Released!

Main Features

Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.

What's Changed

ColoTensor

[Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in #899
[Tensor] add embedding tp1d row by @Wesley-Jzy in #904
[Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in #913
[Tensor] init ColoParameter by @feifeibear in #914
[Tensor] add a basic bert. by @Wesley-Jzy in #911
[Tensor] polish model test by @feifeibear in #915
[Tensor] fix test_model by @Wesley-Jzy in #916
[Tensor] add 1d vocab loss by @Wesley-Jzy in #918
[Graph] building computing graph with ColoTensor, Linear only by @feifeibear in #917
[Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in #921
[Tensor] test pretrain loading on multi-process by @feifeibear in #922
[tensor] hijack addmm for colo tensor by @ver217 in #923
[tensor] colo tensor overrides mul by @ver217 in #927
[Tensor] simplify named param by @Wesley-Jzy in #928
[Tensor] fix init context by @Wesley-Jzy in #931
[Tensor] add optimizer to bert test by @Wesley-Jzy in #933
[tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in #934
[Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in #947
[tensor] derive compute pattern from dist spec by @ver217 in #971

Pipeline Parallelism

[pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in #903

CI

[CI] add CI for releasing bdist wheel by @ver217 in #901
[CI] fix release bdist CI by @ver217 in #902
[ci] added wheel build scripts by @FrankLeeeee in #910

Misc

[Bot] Synchronize Submodule References by @github-actions in #907
[Bot] Synchronize Submodule References by @github-actions in #912
[setup] update cuda ext cc flags by @ver217 in #919
[setup] support more cuda architectures by @ver217 in #920
[NFC] update results on a single GPU, highlight quick view by @binmakeswell in #981

Full Changelog: v0.1.4...v0.1.5

Contributors

feifeibear, Wesley-Jzy, and 4 other contributors

Assets 2

0 Join discussion

28 Apr 07:56

feifeibear

v0.1.4

e1108ca

v0.1.4 Released!

Main Features

Here are the main improvements of this release:

ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
CLI: a command-line tool that helps users launch distributed training tasks more easily.
Pipeline Parallelism (PP): a more user-friendly API for PP.

What's Changed

ColoTensor

[tensor]fix colo_tensor torch_function by @Wesley-Jzy in #825
[tensor]fix test_linear by @Wesley-Jzy in #826
[tensor] ZeRO use ColoTensor as the base class. by @feifeibear in #828
[tensor] revert zero tensors back by @feifeibear in #829
[Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in #889
[tensor] refine linear and add gather for laynorm by @Wesley-Jzy in #893
[Tensor] test parameters() as member function by @feifeibear in #896
[Tensor] activation is an attr of ColoTensor by @feifeibear in #897
[Tensor] initialize the ColoOptimizer by @feifeibear in #898
[tensor] reorganize files by @feifeibear in #820
[Tensor] apply ColoTensor on Torch functions by @feifeibear in #821
[Tensor] update ColoTensor torch_function by @feifeibear in #822
[tensor] lazy init by @feifeibear in #823
[WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in #831
Init Conext supports lazy allocate model memory by @feifeibear in #842
[Tensor] TP Linear 1D row by @Wesley-Jzy in #843
[Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in #846
[Tensor] init a simple network training with ColoTensor by @feifeibear in #849
[Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in #854
[Tensor] add layer norm Op by @feifeibear in #852
[tensor] an initial dea of tensor spec by @feifeibear in #865
[Tensor] colo init context add device attr. by @feifeibear in #866
[tensor] add cross_entropy_loss by @feifeibear in #868
[Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in #869
[tensor] customized op returns ColoTensor by @feifeibear in #875
[Tensor] get named parameters for model using ColoTensors by @feifeibear in #874
[Tensor] Add some attributes to ColoTensor by @feifeibear in #877
[Tensor] make a simple net works with 1D row TP by @feifeibear in #879
[tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in #881
[Tensor] make ColoTensor more robust for getattr by @feifeibear in #886
[Tensor] test model check results for a simple net by @feifeibear in #887
[tensor] add ColoTensor 1Dcol by @Wesley-Jzy in #888

Gemini + ZeRO

[zero] add zero tensor shard strategy by @1SAA in #793
Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
[gemini] a new tensor structure by @feifeibear in #818
[gemini] APIs to set cpu memory capacity by @feifeibear in #809
[DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
[gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
[gemini] add GeminiMemoryManger by @1SAA in #832
[zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
[gemini] polish code by @1SAA in #855
[gemini] add stateful tensor container by @1SAA in #867
[gemini] polish stateful_tensor_mgr by @1SAA in #876
[gemini] accelerate adjust_layout() by @ver217 in #878

CLI

[cli] added distributed launcher command by @YuliangLiu0306 in #791
[cli] added micro benchmarking for tp by @YuliangLiu0306 in #789
[cli] add missing requirement by @FrankLeeeee in #805
[cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in #807
[cli] fixed single-node process launching by @FrankLeeeee in #812
[cli] added check installation cli by @FrankLeeeee in #815
[CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in #844
[cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in #858

Pipeline Parallelism

[pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in #816
[pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in #853

Misc

[hotfix] fix auto tensor placement policy by @ver217 in #775
[hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
[hotfix] fix bugs in zero by @1SAA in #781
[hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
[refactor] moving memtracer to gemini by @feifeibear in #801
[log] display tflops if available by @feifeibear in #802
[refactor] moving grad acc logic to engine by @feifeibear in #804
[log] local throughput metrics by @feifeibear in #811
[Bot] Synchronize Submodule References by @github-actions in #810
[Bot] Synchronize Submodule References by @github-actions in #819
[refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
[setup] allow installation with python 3.6 by @FrankLeeeee in #834
Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
[dependency] removed torchvision by @FrankLeeeee in #833
[Bot] Synchronize Submodule References by @github-actions in #827
[unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
[setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
[hotfix] ColoTensor pin_memory by @feifeibear in #840
modefied the pp build for ckpt adaptation by @Gy-Lu in #803
[hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
[hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
[hotfix] add deconstructor for stateful tensor by @ver217 in #848
[utils] refactor profiler by @ver217 in #837
[ci] cache cuda extension by @FrankLeeeee in #860
hotfix tensor unittest bugs by @feifeibear in #862
[usability] added assertion message in registry by @FrankLeeeee in #864
[doc] improved docstring in the communication module by @FrankLeeeee in #863
[doc] improved docstring in the logging module by @FrankLeeeee in #861
[doc] improved docstring in the amp module by @FrankLeeeee in #857
[usability] improved error messages in the context modu...

Contributors

feifeibear, Wesley-Jzy, and 5 other contributors

Assets 2

Releases: hpcaitech/ColossalAI

Version v0.1.11rc3 Release Today!

What's Changed

Release

Tutorial

Example

Sc

Nfc

Doc

Zero

Autoparallel

Fx

Hotfix

Inference

Amp

Diffusion

Utils

Version v0.1.11rc2 Release Today!

What's Changed

Autoparallel

Kernel

Gemini

Checkpointio

Doc

Example

Nfc

Fx

Hotfix

Pipeline

Ci

Compatibility

Feat

Fx/profiler

Workflow

Version v0.1.11rc1 Release Today!

What's Changed

Hotfix

Release

Doc

Zero

Pre-commit

Autoparallel

Fx/meta/rpc

Embeddings

Unittest

Embedding

Fx/profiler

Version v0.1.10 Release Today!

What's Changed

Embedding

Nfc

Pipeline/tuning

Fx

Autoparallel

Utils

Hotfix

Pipeline/pipleline_process_group

Doc

Autoparellel

Faw

Pipeline/rpc

Tensor

Fce

Workflow

Test

Engin/schedule

Version v0.1.9 Release Today!

What's Changed

Zero

Fx

Recommendation System

Global Tensor

Hotfix

Communication

Device

Chunk

DDP

Checkpoint

Util

Nvme