Releases: hpcaitech/ColossalAI
Releases · hpcaitech/ColossalAI
Version v0.1.11rc3 Release Today!
What's Changed
Release
Tutorial
- [tutorial] polish README and OPT files (#1930) by binmakeswell
- [tutorial] add synthetic dataset for opt (#1924) by ver217
- [tutorial] updated hybrid parallel readme (#1928) by Frank Lee
- [tutorial] added synthetic data for sequence parallel (#1927) by Frank Lee
- [tutorial] removed huggingface model warning (#1925) by Frank Lee
- Hotfix/tutorial readme index (#1922) by Frank Lee
- [tutorial] modify hands-on of auto activation checkpoint (#1920) by Boyuan Yao
- [tutorial] added synthetic data for hybrid parallel (#1921) by Frank Lee
- [tutorial] added synthetic data for hybrid parallel (#1919) by Frank Lee
- [tutorial] added synthetic dataset for auto parallel demo (#1918) by Frank Lee
- [tutorial] updated auto parallel demo with latest data path (#1917) by Frank Lee
- [tutorial] added data script and updated readme (#1916) by Frank Lee
- [tutorial] add cifar10 for diffusion (#1907) by binmakeswell
- [tutorial] removed duplicated tutorials (#1904) by Frank Lee
- [tutorial] edited hands-on practices (#1899) by BoxiangW
Example
- [example] update auto_parallel img path (#1910) by binmakeswell
- [example] add cifar10 dadaset for diffusion (#1902) by Fazzie-Maqianli
- [example] migrate diffusion and auto_parallel hands-on (#1871) by binmakeswell
- [example] initialize tutorial (#1865) by binmakeswell
- Merge pull request #1842 from feifeibear/jiarui/polish by Fazzie-Maqianli
- [example] polish diffusion readme by jiaruifang
Sc
- [SC] add GPT example for auto checkpoint (#1889) by Boyuan Yao
- [sc] add examples for auto checkpoint. (#1880) by Super Daniel
Nfc
- [NFC] polish colossalai/amp/naive_amp/init.py code style (#1905) by Junming Wu
- [NFC] remove redundant dependency (#1869) by binmakeswell
- [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1856) by yuxuan-lou
- [NFC] polish .github/workflows/scripts/generate_release_draft.py code style (#1855) by Ofey Chan
- [NFC] polish workflows code style (#1854) by Kai Wang (Victor Kai)
- [NFC] polish colossalai/amp/apex_amp/init.py code style (#1853) by LuGY
- [NFC] polish .readthedocs.yaml code style (#1852) by nuszzh
- [NFC] polish <.github/workflows/release_nightly.yml> code style (#1851) by RichardoLuo
- [NFC] polish amp.naive_amp.grad_scaler code style by zbian
- [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/operator_handler.py code style (#1845) by HELSON
- [NFC] polish ./colossalai/amp/torch_amp/init.py code style (#1836) by Genghan Zhang
- [NFC] polish .github/workflows/build.yml code style (#1837) by xyupeng
- [NFC] polish colossalai/auto_parallel/tensor_shard/deprecated/op_handler/conv_handler.py code style (#1829) by Sze-qq
- [NFC] polish colossalai/amp/torch_amp/_grad_scaler.py code style (#1823) by Ziyue Jiang
- [NFC] polish .github/workflows/release_docker.yml code style by Maruyama_Aya
- [NFC] polish .github/workflows/submodule.yml code style (#1822) by shenggan
- [NFC] polish .github/workflows/draft_github_release_post.yml code style (#1820) by Arsmart1
- [NFC] polish colossalai/amp/naive_amp/_fp16_optimizer.py code style (#1819) by Fazzie-Maqianli
- [NFC] polish colossalai/amp/naive_amp/_utils.py code style (#1816) by CsRic
- [NFC] polish .github/workflows/build_gpu_8.yml code style (#1813) by Zangwei Zheng
- [NFC] polish MANIFEST.in code style (#1814) by Zirui Zhu
- [NFC] polish strategies_constructor.py code style (#1806) by binmakeswell
Doc
- [doc] add news (#1901) by binmakeswell
Zero
Autoparallel
- [autoparallel] user-friendly API for CheckpointSolver. (#1879) by Super Daniel
- [autoparallel] fix linear logical convert issue (#1857) by YuliangLiu0306
Fx
- [fx] metainfo_trace as an API. (#1873) by Super Daniel
Hotfix
- [hotfix] pass test_complete_workflow (#1877) by Jiarui Fang
Inference
- [inference] overlap comm and compute in Linear1D_Row when stream_chunk_num > 1 (#1876) by Jiarui Fang
- [inference] streaming Linear 1D Row inference (#1874) by Jiarui Fang
Amp
Diffusion
Utils
- [utils] fixed lazy init context (#1867) by Frank Lee
- [utils] remove lazy_memory_allocate from ColoInitContext (#1844) by Jiarui Fang
Full Changelog: v0.1.11rc3...v0.1.11rc2
Version v0.1.11rc2 Release Today!
What's Changed
Autoparallel
- [autoparallel] fix bugs caused by negative dim key (#1808) by YuliangLiu0306
- [autoparallel] fix bias addition module (#1800) by YuliangLiu0306
- [autoparallel] add batch norm metainfo (#1815) by Boyuan Yao
- [autoparallel] add conv metainfo class for auto parallel (#1796) by Boyuan Yao
- [autoparallel]add essential CommActions for broadcast oprands (#1793) by YuliangLiu0306
- [autoparallel] refactor and add rotorc. (#1789) by Super Daniel
- [autoparallel] add getattr handler (#1767) by YuliangLiu0306
- [autoparallel] added matmul handler (#1763) by Frank Lee
- [autoparallel] fix conv handler numerical test (#1771) by YuliangLiu0306
- [autoparallel] move ckpt solvers to autoparallel folder / refactor code (#1764) by Super Daniel
- [autoparallel] add numerical test for handlers (#1769) by YuliangLiu0306
- [autoparallel] update CommSpec to CommActions (#1768) by YuliangLiu0306
- [autoparallel] add numerical test for node strategies (#1760) by YuliangLiu0306
- [autoparallel] refactor the runtime apply pass and add docstring to passes (#1757) by YuliangLiu0306
- [autoparallel] added binary elementwise node handler (#1758) by Frank Lee
- [autoparallel] fix param hook issue in transform pass (#1755) by YuliangLiu0306
- [autoparallel] added addbmm handler (#1751) by Frank Lee
- [autoparallel] shard param and buffer as expected (#1753) by YuliangLiu0306
- [autoparallel] add sequential order to communication actions (#1735) by YuliangLiu0306
- [autoparallel] recovered skipped test cases (#1748) by Frank Lee
- [autoparallel] fixed wrong sharding strategy in conv handler (#1747) by Frank Lee
- [autoparallel] fixed wrong generated strategy for dot op (#1746) by Frank Lee
- [autoparallel] handled illegal sharding strategy in shape consistency (#1744) by Frank Lee
- [autoparallel] handled illegal strategy in node handler (#1743) by Frank Lee
- [autoparallel] handled illegal sharding strategy (#1728) by Frank Lee
Kernel
- [kernel] added jit warmup (#1792) by アマデウス
- [kernel] more flexible flashatt interface (#1804) by oahzxl
- [kernel] skip tests of flash_attn and triton when they are not available (#1798) by Jiarui Fang
Gemini
- [Gemini] make gemini usage simple (#1821) by Jiarui Fang
Checkpointio
Doc
- [doc] polish diffusion README (#1840) by binmakeswell
- [doc] remove obsolete API demo (#1833) by binmakeswell
- [doc] add diffusion (#1827) by binmakeswell
- [doc] add FastFold (#1766) by binmakeswell
Example
- [example] remove useless readme in diffusion (#1831) by Jiarui Fang
- [example] add TP to GPT example (#1828) by Jiarui Fang
- [example] add stable diffuser (#1825) by Fazzie-Maqianli
- [example] simplify the GPT2 huggingface example (#1826) by Jiarui Fang
- [example] opt does not depend on Titans (#1811) by Jiarui Fang
- [example] add GPT by Jiarui Fang
- [example] add opt model in lauguage (#1809) by Jiarui Fang
- [example] add diffusion to example (#1805) by Jiarui Fang
Nfc
- [NFC] update gitignore remove DS_Store (#1830) by Jiarui Fang
- [NFC] polish type hint for shape consistency (#1801) by Jiarui Fang
- [NFC] polish tests/test_layers/test_3d/test_3d.py code style (#1740) by Ziheng Qin
- [NFC] polish tests/test_layers/test_3d/checks_3d/common.py code style (#1733) by lucasliunju
- [NFC] polish colossalai/nn/metric/_utils.py code style (#1727) by Sze-qq
- [NFC] polish tests/test_layers/test_3d/checks_3d/check_layer_3d.py code style (#1731) by Xue Fuzhao
- [NFC] polish tests/test_layers/test_sequence/checks_seq/check_layer_seq.py code style (#1723) by xyupeng
- [NFC] polish accuracy_2d.py code style (#1719) by Ofey Chan
- [NFC] polish .github/workflows/scripts/build_colossalai_wheel.py code style (#1721) by Arsmart1
- [NFC] polish _checkpoint_hook.py code style (#1722) by LuGY
- [NFC] polish test_2p5d/checks_2p5d/check_operation_2p5d.py code style (#1718) by Kai Wang (Victor Kai)
- [NFC] polish colossalai/zero/sharded_param/init.py code style (#1717) by CsRic
- [NFC] polish colossalai/nn/lr_scheduler/linear.py code style (#1716) by yuxuan-lou
- [NFC] polish tests/test_layers/test_2d/checks_2d/check_operation_2d.py code style (#1715) by binmakeswell
- [NFC] polish colossalai/nn/metric/accuracy_2p5d.py code style (#1714) by shenggan
Fx
- [fx] add a symbolic_trace api. (#1812) by Super Daniel
- [fx] skip diffusers unitest if it is not installed (#1799) by Jiarui Fang
- [fx] Add linear metainfo class for auto parallel (#1783) by Boyuan Yao
- [fx] support module with bias addition (#1780) by YuliangLiu0306
- [fx] refactor memory utils and extend shard utils. (#1754) by Super Daniel
- [fx] test tracer on diffuser modules. (#1750) by Super Daniel
Hotfix
- [hotfix] fix build error when torch version >= 1.13 (#1803) by xcnick
- [hotfix] polish flash attention (#1802) by oahzxl
- [hotfix] fix zero's incompatibility with checkpoint in torch-1.12 (#1786) by HELSON
- [hotfix] polish chunk import (#1787) by Jiarui Fang
- [hotfix] autoparallel unit test (#1752) by YuliangLiu0306
Pipeline
- [Pipeline]Adapt to Pipelinable OPT (#1782) by Ziyue Jiang
Ci
- [CI] downgrade fbgemm. (#1778) by Super Daniel
Compatibility
- [compatibility] ChunkMgr import error (#1772) by Jiarui Fang
Feat
Fx/profiler
- [fx/profiler] debug the fx.profiler / add an example test script for fx.profiler (#1730) by Super Daniel
Workflow
Full Changelog: v0.1.11rc2...v0.1.11rc1
Version v0.1.11rc1 Release Today!
What's Changed
Hotfix
- [hotfix] resharding cost issue (#1742) by YuliangLiu0306
- [hotfix] solver bug caused by dict type comm cost (#1686) by YuliangLiu0306
- [hotfix] fix wrong type name in profiler (#1678) by Boyuan Yao
- [hotfix]unit test (#1670) by YuliangLiu0306
- [hotfix] add recompile after graph manipulatation (#1621) by YuliangLiu0306
- [hotfix] got sliced types (#1614) by YuliangLiu0306
Release
Doc
- [doc] update recommendation system catalogue (#1732) by binmakeswell
- [doc] update recommedation system urls (#1725) by Jiarui Fang
Zero
- [zero] add chunk init function for users (#1729) by HELSON
- [zero] add constant placement policy (#1705) by HELSON
Pre-commit
Autoparallel
- [autoparallel] runtime_backward_apply (#1720) by YuliangLiu0306
- [autoparallel] moved tests to test_tensor_shard (#1713) by Frank Lee
- [autoparallel] resnet block runtime apply (#1709) by YuliangLiu0306
- [autoparallel] fixed broken node handler tests (#1708) by Frank Lee
- [autoparallel] refactored the autoparallel module for organization (#1706) by Frank Lee
- [autoparallel] adapt runtime passes (#1703) by YuliangLiu0306
- [autoparallel] collated all deprecated files (#1700) by Frank Lee
- [autoparallel] init new folder structure (#1696) by Frank Lee
- [autoparallel] adapt solver and CostGraph with new handler (#1695) by YuliangLiu0306
- [autoparallel] add output handler and placeholder handler (#1694) by YuliangLiu0306
- [autoparallel] add pooling handler (#1690) by YuliangLiu0306
- [autoparallel] where_handler_v2 (#1688) by YuliangLiu0306
- [autoparallel] fix C version rotor inconsistency (#1691) by Boyuan Yao
- [autoparallel] added sharding spec conversion for linear handler (#1687) by Frank Lee
- [autoparallel] add reshape handler v2 and fix some previous bug (#1683) by YuliangLiu0306
- [autoparallel] add unary element wise handler v2 (#1674) by YuliangLiu0306
- [autoparallel] add following node generator (#1673) by YuliangLiu0306
- [autoparallel] add layer norm handler v2 (#1671) by YuliangLiu0306
- [autoparallel] fix insecure subprocess (#1680) by Boyuan Yao
- [autoparallel] add rotor C version (#1658) by Boyuan Yao
- [autoparallel] added utils for broadcast operation (#1665) by Frank Lee
- [autoparallel] update CommSpec (#1667) by YuliangLiu0306
- [autoparallel] added bias comm spec to matmul strategy (#1664) by Frank Lee
- [autoparallel] add batch norm handler v2 (#1666) by YuliangLiu0306
- [autoparallel] remove no strategy nodes (#1652) by YuliangLiu0306
- [autoparallel] added compute resharding costs for node handler (#1662) by Frank Lee
- [autoparallel] added new strategy constructor template (#1661) by Frank Lee
- [autoparallel] added node handler for bmm (#1655) by Frank Lee
- [autoparallel] add conv handler v2 (#1663) by YuliangLiu0306
- [autoparallel] adapt solver with gpt (#1653) by YuliangLiu0306
- [autoparallel] implemented all matmul strategy generator (#1650) by Frank Lee
- [autoparallel] change the following nodes strategies generation logic (#1636) by YuliangLiu0306
- [autoparallel] where handler (#1651) by YuliangLiu0306
- [autoparallel] implemented linear projection strategy generator (#1639) by Frank Lee
- [autoparallel] adapt solver with mlp (#1638) by YuliangLiu0306
- [autoparallel] Add pofo sequence annotation (#1637) by Boyuan Yao
- [autoparallel] add elementwise handler (#1622) by YuliangLiu0306
- [autoparallel] add embedding handler (#1620) by YuliangLiu0306
- [autoparallel] protect bcast handler from invalid strategies (#1631) by YuliangLiu0306
- [autoparallel] add layernorm handler (#1629) by YuliangLiu0306
- [autoparallel] recover the merged node strategy index (#1613) by YuliangLiu0306
- [autoparallel] added new linear module handler (#1616) by Frank Lee
- [autoparallel] added new node handler (#1612) by Frank Lee
- [autoparallel]add bcast matmul strategies (#1605) by YuliangLiu0306
- [autoparallel] refactored the data structure for sharding strategy (#1610) by Frank Lee
- [autoparallel] add bcast op handler (#1600) by YuliangLiu0306
- [autoparallel] added all non-bcast matmul strategies (#1603) by Frank Lee
- [autoparallel] added strategy generator and bmm strategies (#1602) by Frank Lee
- [autoparallel] add reshape handler (#1594) by YuliangLiu0306
- [autoparallel] refactored shape consistency to remove redundancy (#1591) by Frank Lee
- [autoparallel] add resnet autoparallel unit test and add backward weight communication cost (#1589) by YuliangLiu0306
- [autoparallel] added generate_sharding_spec to utils (#1590) by Frank Lee
- [autoparallel] added solver option dataclass (#1588) by Frank Lee
- [autoparallel] adapt solver with resnet (#1583) by YuliangLiu0306
Fx/meta/rpc
- [fx/meta/rpc] move _meta_registration.py to fx folder / register fx functions with compatibility checks / remove color debug (#1710) by Super Daniel
Embeddings
- [embeddings] add doc in readme (#1711) by Jiarui Fang
- [embeddings] more detailed timer (#1692) by Jiarui Fang
- [embeddings] cache option (#1635) by Jiarui Fang
- [embeddings] use cache_ratio instead of cuda_row_num (#1611) by Jiarui Fang
- [embeddings] add already_split_along_rank flag for tablewise mode (#1584) by CsRic
Unittest
- [unittest] added doc for the pytest wrapper (#1704) by Frank Lee
- [unittest] supported condititonal testing based on env var (#1701) by Frank Lee
Embedding
- [embedding] rename FreqAwareEmbedding -> CachedEmbedding (#1699) by Jiarui Fang
- [embedding] polish async copy (#1657) by Jiarui Fang
- [embedding] add more detail profiling (#1656) by Jiarui Fang
- [embedding] print profiling results (#1654) by Jiarui Fang
- [embedding] non-blocking cpu-gpu copy (#1647) by Jiarui Fang
- [embedding] isolate cache_op from forward (#1645) by CsRic
- [embedding] rollback for better FAW performance (#1625) by Jiarui Fang
- [embedding] updates some default parameters by Jiarui Fang
Fx/profiler
- [fx/profiler] assigned UUID to each unrecorded tensor/ improved performance on GPT-2 (#1679) by Super Daniel
- [fx/profiler] provide a table of sum...
Version v0.1.10 Release Today!
What's Changed
Embedding
- [embedding] cache_embedding small improvement (#1564) by CsRic
- [embedding] polish parallel embedding tablewise (#1545) by Jiarui Fang
- [embedding] freq_aware_embedding: add small functions for caller application (#1537) by CsRic
- [embedding] fix a bug in table wise sharding (#1538) by Jiarui Fang
- [embedding] tablewise sharding polish (#1535) by Jiarui Fang
- [embedding] add tablewise sharding for FAW (#1526) by CsRic
Nfc
- [NFC] polish test component gpt code style (#1567) by アマデウス
- [NFC] polish doc style for ColoTensor (#1457) by Jiarui Fang
- [NFC] global vars should be upper case (#1456) by Jiarui Fang
Pipeline/tuning
- [pipeline/tuning] improve dispatch performance both time and space cost (#1544) by Kirigaya Kazuto
Fx
- [fx] provide a stable but not accurate enough version of profiler. (#1547) by Super Daniel
- [fx] Add common node in model linearize (#1542) by Boyuan Yao
- [fx] support meta tracing for aten level computation graphs like functorch. (#1536) by Super Daniel
- [fx] Modify solver linearize and add corresponding test (#1531) by Boyuan Yao
- [fx] add test for meta tensor. (#1527) by Super Daniel
- [fx]patch nn.functional convolution (#1528) by YuliangLiu0306
- [fx] Fix wrong index in annotation and minimal flops in ckpt solver (#1521) by Boyuan Yao
- [fx] hack torch_dispatch for meta tensor and autograd. (#1515) by Super Daniel
- [fx] Fix activation codegen dealing with checkpointing first op (#1510) by Boyuan Yao
- [fx] fix the discretize bug (#1506) by Boyuan Yao
- [fx] fix wrong variable name in solver rotor (#1502) by Boyuan Yao
- [fx] Add activation checkpoint solver rotor (#1496) by Boyuan Yao
- [fx] add more op patches for profiler and error message for unsupported ops. (#1495) by Super Daniel
- [fx] fixed adapative pooling size concatenation error (#1489) by Frank Lee
- [fx] add profiler for fx nodes. (#1480) by Super Daniel
- [fx] Fix ckpt functions' definitions in forward (#1476) by Boyuan Yao
- [fx] fix MetaInfoProp for incorrect calculations and add detections for inplace op. (#1466) by Super Daniel
- [fx] add rules to linearize computation graphs for searching. (#1461) by Super Daniel
- [fx] Add use_reentrant=False to checkpoint in codegen (#1463) by Boyuan Yao
- [fx] fix test and algorithm bugs in activation checkpointing. (#1451) by Super Daniel
- [fx] Use colossalai checkpoint and add offload recognition in codegen (#1439) by Boyuan Yao
- [fx] fix the false interpretation of algorithm 3 in https://arxiv.org/abs/1604.06174. (#1446) by Super Daniel
Autoparallel
- [autoparallel]add backward cost info into strategies (#1524) by YuliangLiu0306
- [autoparallel] support fucntion in operator handler (#1529) by YuliangLiu0306
- [autoparallel] change the merge node logic (#1533) by YuliangLiu0306
- [autoparallel] added liveness analysis (#1516) by Frank Lee
- [autoparallel] add more sharding strategies to conv (#1487) by YuliangLiu0306
- [autoparallel] add cost graph class (#1481) by YuliangLiu0306
- [autoparallel] added namespace constraints (#1490) by Frank Lee
- [autoparallel] integrate auto parallel with torch fx (#1479) by Frank Lee
- [autoparallel] added dot handler (#1475) by Frank Lee
- [autoparallel] introduced baseclass for op handler and reduced code redundancy (#1471) by Frank Lee
- [autoparallel] standardize the code structure (#1469) by Frank Lee
- [autoparallel] Add conv handler to generate strategies and costs info for conv (#1467) by YuliangLiu0306
Utils
- [utils] refactor parallel layers checkpoint and bcast model on loading checkpoint (#1548) by ver217
- [utils] optimize partition_tensor_parallel_state_dict (#1546) by ver217
- [utils] Add use_reetrant=False in utils.activation_checkpoint (#1460) by Boyuan Yao
- [utils] Impl clip_grad_norm for ColoTensor and ZeroOptimizer (#1442) by ver217
Hotfix
- [hotfix] change namespace for meta_trace. (#1541) by Super Daniel
- [hotfix] fix init context (#1543) by ver217
- [hotfix] avoid conflict of meta registry with torch 1.13.0. (#1530) by Super Daniel
- [hotfix] fix coloproxy typos. (#1519) by Super Daniel
Pipeline/pipleline_process_group
- [pipeline/pipleline_process_group] finish PipelineProcessGroup to manage local abd global rank in TP,DP and PP (#1508) by Kirigaya Kazuto
Doc
- [doc] docstring for FreqAwareEmbeddingBag (#1525) by Jiarui Fang
- [doc] update readme with the new xTrimoMultimer project (#1477) by Sze-qq
- [doc] update docstring in ProcessGroup (#1468) by Jiarui Fang
- [Doc] add more doc for ColoTensor. (#1458) by Jiarui Fang
Autoparellel
- [autoparellel]add strategies constructor (#1505) by YuliangLiu0306
Faw
- [FAW] cpu caching operations (#1520) by Jiarui Fang
- [FAW] refactor reorder() for CachedParamMgr (#1514) by Jiarui Fang
- [FAW] LFU initialize with dataset freq (#1513) by Jiarui Fang
- [FAW] shrink freq_cnter size (#1509) by CsRic
- [FAW] remove code related to chunk (#1501) by Jiarui Fang
- [FAW] add more docs and fix a warning (#1500) by Jiarui Fang
- [FAW] FAW embedding use LRU as eviction strategy intialized with dataset stats (#1494) by CsRic
- [FAW] LFU cache for the FAW by CsRic
- [FAW] init an LFU implementation for FAW (#1488) by Jiarui Fang
- [FAW] reorganize the inheritance struct of FreqCacheEmbedding (#1448) by Geng Zhang
Pipeline/rpc
- [pipeline/rpc] update outstanding mechanism | optimize dispatching strategy (#1497) by Kirigaya Kazuto
- [pipeline/rpc] implement distributed optimizer | test with assert_close (#1486) by Kirigaya Kazuto
- [pipeline/rpc] support interleaving | fix checkpoint bug | change logic when dispatch data in work_list to ensure steady 1F1B (#1483) by Kirigaya Kazuto
- [pipeline/rpc] implement a demo for PP with cuda rpc framework (#1470) by Kirigaya Kazuto
Tensor
- [tensor]add 1D device mesh (#1492) by YuliangLiu0306
- [tensor] support runtime ShardingSpec apply (#1453) by YuliangLiu0306
- [tensor] shape consistency generate transform path and communication cost (#1435) by YuliangLiu0306
- [tensor] added linear implementation for the new sharding spec (#1416) by Frank Lee
Fce
- [FCE] update interface for frequency statistics in FreqCacheEmbedding (#1462) by Geng Zhang
Workflow
Test
Engin/schedule
- [engin/schedule] use p2p_v2 to ...
Version v0.1.9 Release Today!
What's Changed
Zero
- [zero] add chunk_managerV2 for all-gather chunk (#1441) by HELSON
- [zero] add chunk size searching algorithm for parameters in different groups (#1436) by HELSON
- [zero] add has_inf_or_nan in AgChunk; enhance the unit test of AgChunk (#1426) by HELSON
- [zero] add unit test for AgChunk's append, close, access (#1423) by HELSON
- [zero] add AgChunk (#1417) by HELSON
- [zero] ZeroDDP supports controlling outputs' dtype (#1399) by ver217
- [zero] alleviate memory usage in ZeRODDP state_dict (#1398) by HELSON
- [zero] chunk manager allows filtering ex-large params (#1393) by ver217
- [zero] zero optim state_dict takes only_rank_0 (#1384) by ver217
Fx
- [fx] add vanilla activation checkpoint search with test on resnet and densenet (#1433) by Super Daniel
- [fx] modify the calculation of node_size in MetaInfoProp for activation checkpointing usages (#1425) by Super Daniel
- [fx] fixed torchaudio conformer tracing (#1392) by Frank Lee
- [fx] patched torch.max and data movement operator (#1391) by Frank Lee
- [fx] fixed indentation error in checkpointing codegen (#1385) by Frank Lee
- [fx] patched torch.full for huggingface opt (#1386) by Frank Lee
- [fx] update split module pass and add customized policy (#1373) by YuliangLiu0306
- [fx] add torchaudio test (#1369) by Super Daniel
- [fx] Add colotracer compatibility test on torchrec (#1370) by Boyuan Yao
- [fx]add gpt2 passes for pipeline performance test (#1366) by YuliangLiu0306
- [fx] added activation checkpoint codegen support for torch < 1.12 (#1359) by Frank Lee
- [fx] added activation checkpoint codegen (#1355) by Frank Lee
- [fx] fixed apex normalization patch exception (#1352) by Frank Lee
- [fx] added activation checkpointing annotation (#1349) by Frank Lee
- [fx] update MetaInforProp pass to process more complex node.meta (#1344) by YuliangLiu0306
- [fx] refactor tracer to trace complete graph (#1342) by YuliangLiu0306
- [fx] tested the complete workflow for auto-parallel (#1336) by Frank Lee
- [fx]refactor tracer (#1335) by YuliangLiu0306
- [fx] recovered skipped pipeline tests (#1338) by Frank Lee
- [fx] fixed compatiblity issue with torch 1.10 (#1331) by Frank Lee
- [fx] fixed unit tests for torch 1.12 (#1327) by Frank Lee
- [fx] add balanced policy v2 (#1251) by YuliangLiu0306
- [fx] Add unit test and fix bugs for transform_mlp_pass (#1299) by XYE
- [fx] added apex normalization to patched modules (#1300) by Frank Lee
Recommendation System
- [FAW] export FAW in _ops (#1438) by Jiarui Fang
- [FAW] move coloparam setting in test code. (#1429) by Jiarui Fang
- [FAW] parallel FreqAwareEmbedding (#1424) by Jiarui Fang
- [FAW] add cache manager for the cached embedding (#1419) by Jiarui Fang
Global Tensor
- [tensor] add shape consistency feature to support auto spec transform (#1418) by YuliangLiu0306
- [tensor]build sharding spec to replace distspec in future. (#1405) by YuliangLiu0306
Hotfix
- [hotfix] zero optim prevents calling inner optim.zero_grad (#1422) by ver217
- [hotfix] fix CPUAdam kernel nullptr (#1410) by ver217
- [hotfix] adapt ProcessGroup and Optimizer to ColoTensor (#1388) by HELSON
- [hotfix] fix a running error in test_colo_checkpoint.py (#1387) by HELSON
- [hotfix] fix some bugs during gpt2 testing (#1379) by YuliangLiu0306
- [hotfix] fix zero optim save/load state dict (#1381) by ver217
- [hotfix] fix zero ddp buffer cast (#1376) by ver217
- [hotfix] fix no optimizer in save/load (#1363) by HELSON
- [hotfix] fix megatron_init in test_gpt2.py (#1357) by HELSON
- [hotfix] ZeroDDP use new process group (#1333) by ver217
- [hotfix] shared model returns cpu state_dict (#1328) by ver217
- [hotfix] fix ddp for unit test test_gpt2 (#1326) by HELSON
- [hotfix] fix unit test test_module_spec (#1321) by HELSON
- [hotfix] fix PipelineSharedModuleGradientHandler (#1314) by ver217
- [hotfix] fix ColoTensor GPT2 unitest (#1309) by HELSON
- [hotfix] add missing file (#1308) by Jiarui Fang
- [hotfix] remove potiential circle import (#1307) by Jiarui Fang
- [hotfix] skip some unittest due to CI environment. (#1301) by YuliangLiu0306
- [hotfix] fix shape error in backward when using ColoTensor (#1298) by HELSON
- [hotfix] Dist Mgr gather torch version (#1284) by Jiarui Fang
Communication
- [communication] add p2p_v2.py to support communication with List[Any] (#1407) by Kirigaya Kazuto
Device
- [device] add DeviceMesh class to support logical device layout (#1394) by YuliangLiu0306
Chunk
- [chunk] add PG check for tensor appending (#1383) by Jiarui Fang
DDP
Checkpoint
- [checkpoint] add kwargs for load_state_dict (#1374) by HELSON
- [checkpoint] use args, kwargs in save_checkpoint, load_checkpoint (#1368) by HELSON
- [checkpoint] sharded optim save/load grad scaler (#1350) by ver217
- [checkpoint] use gather_tensor in checkpoint and update its unit test (#1339) by HELSON
- [checkpoint] add ColoOptimizer checkpointing (#1316) by Jiarui Fang
- [checkpoint] add test for bert and hotfix save bugs (#1297) by Jiarui Fang
Util
Nvme
Colotensor
- [colotensor] use cpu memory to store state_dict (#1367) by HELSON
- [colotensor] add Tensor.view op and its unit test (#1343) by HELSON
Unit test
Docker
Doc
Refactor
Workflow
- [workflow] update docker build workflow to use proxy (#1334) by Frank Lee
- [workflow] update 8-gpu test to use torch 1.11 (#1332) by Frank Lee
- [workflow] roll back to use torch 1.11 for unit testing (#1325) by Frank Lee
- [workflow] fixed trigger condition for 8-gpu unit test (#1323) by Frank Lee
- [workflow] updated release bdist workflow (#1318) by Frank Lee
- [workflow] disable SHM for compatibility CI on rtx3080 (#1315) by Frank Lee
- [workflow] updated pytorch compatibility test (#1311) by Frank Lee
Test
- [test] removed outdated unit test for meta context (#1329) by [Frank Lee](https://api.github.com/users/Fra...
Version v0.1.8 Release Today!
What's Changed
Hotfix
- [hotfix] torchvison fx unittests miss import pytest (#1277) by Jiarui Fang
- [hotfix] fix an assertion bug in base schedule. (#1250) by YuliangLiu0306
- [hotfix] fix sharded optim step and clip_grad_norm (#1226) by ver217
- [hotfix] fx get comm size bugs (#1233) by Jiarui Fang
- [hotfix] fx shard 1d pass bug fixing (#1220) by Jiarui Fang
- [hotfix]fixed p2p process send stuck (#1181) by YuliangLiu0306
- [hotfix]different overflow status lead to communication stuck. (#1175) by YuliangLiu0306
- [hotfix]fix some bugs caused by refactored schedule. (#1148) by YuliangLiu0306
Tensor
- [tensor] distributed checkpointing for parameters (#1240) by Jiarui Fang
- [tensor] redistribute among different process groups (#1247) by Jiarui Fang
- [tensor] a shorter shard and replicate spec (#1245) by Jiarui Fang
- [tensor] redirect .data.get to a tensor instance (#1239) by HELSON
- [tensor] add zero_like colo op, important for Optimizer (#1236) by Jiarui Fang
- [tensor] fix some unittests (#1234) by Jiarui Fang
- [tensor] fix a assertion in colo_tensor cross_entropy (#1232) by HELSON
- [tensor] add unitest for colo_tensor 1DTP cross_entropy (#1230) by HELSON
- [tensor] torch function return colotensor (#1229) by Jiarui Fang
- [tensor] improve robustness of class 'ProcessGroup' (#1223) by HELSON
- [tensor] sharded global process group (#1219) by Jiarui Fang
- [Tensor] add cpu group to ddp (#1200) by Jiarui Fang
- [tensor] remove gpc in tensor tests (#1186) by Jiarui Fang
- [tensor] revert local view back (#1178) by Jiarui Fang
- [Tensor] rename some APIs in TensorSpec and Polish view unittest (#1176) by Jiarui Fang
- [Tensor] rename parallel_action (#1174) by Ziyue Jiang
- [Tensor] distributed view supports inter-process hybrid parallel (#1169) by Jiarui Fang
- [Tensor] remove ParallelAction, use ComputeSpec instread (#1166) by Jiarui Fang
- [tensor] add embedding bag op (#1156) by ver217
- [tensor] add more element-wise ops (#1155) by ver217
- [tensor] fixed non-serializable colo parameter during model checkpointing (#1153) by Frank Lee
- [tensor] dist spec s2s uses all-to-all (#1136) by ver217
- [tensor] added repr to spec (#1147) by Frank Lee
Fx
- [fx] added ndim property to proxy (#1253) by Frank Lee
- [fx] fixed tracing with apex-based T5 model (#1252) by Frank Lee
- [fx] refactored the file structure of patched function and module (#1238) by Frank Lee
- [fx] methods to get fx graph property. (#1246) by YuliangLiu0306
- [fx]add split module pass and unit test from pipeline passes (#1242) by YuliangLiu0306
- [fx] fixed huggingface OPT and T5 results misalignment (#1227) by Frank Lee
- [fx]get communication size between partitions (#1224) by YuliangLiu0306
- [fx] added patches for tracing swin transformer (#1228) by Frank Lee
- [fx] fixed timm tracing result misalignment (#1225) by Frank Lee
- [fx] added timm model tracing testing (#1221) by Frank Lee
- [fx] added torchvision model tracing testing (#1216) by Frank Lee
- [fx] temporarily used (#1215) by XYE
- [fx] added testing for all albert variants (#1211) by Frank Lee
- [fx] added testing for all gpt variants (#1210) by Frank Lee
- [fx]add uniform policy (#1208) by YuliangLiu0306
- [fx] added testing for all bert variants (#1207) by Frank Lee
- [fx] supported model tracing for huggingface bert (#1201) by Frank Lee
- [fx] added module patch for pooling layers (#1197) by Frank Lee
- [fx] patched conv and normalization (#1188) by Frank Lee
- [fx] supported data-dependent control flow in model tracing (#1185) by Frank Lee
Rename
- [rename] convert_to_dist -> redistribute (#1243) by Jiarui Fang
Checkpoint
- [checkpoint] save sharded optimizer states (#1237) by Jiarui Fang
- [checkpoint]support generalized scheduler (#1222) by Yi Zhao
- [checkpoint] make unitest faster (#1217) by Jiarui Fang
- [checkpoint] checkpoint for ColoTensor Model (#1196) by Jiarui Fang
Polish
Refactor
- [refactor] move process group from _DistSpec to ColoTensor. (#1203) by Jiarui Fang
- [refactor] remove gpc dependency in colotensor's _ops (#1189) by Jiarui Fang
- [refactor] move chunk and chunkmgr to directory gemini (#1182) by Jiarui Fang
Context
- [context]support arbitary module materialization. (#1193) by YuliangLiu0306
- [context]use meta tensor to init model lazily. (#1187) by YuliangLiu0306
Ddp
- [ddp] ColoDDP uses bucket all-reduce (#1177) by ver217
- [ddp] refactor ColoDDP and ZeroDDP (#1146) by ver217
Colotensor
- [ColoTensor] add independent process group (#1179) by Jiarui Fang
- [ColoTensor] rename APIs and add output_replicate to ComputeSpec (#1168) by Jiarui Fang
- [ColoTensor] improves init functions. (#1150) by Jiarui Fang
Zero
- [zero] sharded optim supports loading local state dict (#1170) by ver217
- [zero] zero optim supports loading local state dict (#1171) by ver217
Workflow
- [workflow] polish readme and dockerfile (#1165) by Frank Lee
- [workflow] auto-publish docker image upon release (#1164) by Frank Lee
- [workflow] fixed release post workflow (#1154) by Frank Lee
- [workflow] fixed format error in yaml file (#1145) by Frank Lee
- [workflow] added workflow to auto draft the release post (#1144) by Frank Lee
Gemini
Pipeline
- [pipeline]add customized policy (#1139) by YuliangLiu0306
- [pipeline]support more flexible pipeline (#1138) by YuliangLiu0306
Ci
Full Changelog: v0.1.8...v0.1.7
Version v0.1.7 Released Today
Version v0.1.7 Released Today
Highlights
- Started torch.fx for auto-parallel training
- Update the zero mechanism with ColoTensor
- Fixed various bugs
What's Changed
Hotfix
- [hotfix] prevent nested ZeRO (#1140) by ver217
- [hotfix]fix bugs caused by refactored pipeline (#1133) by YuliangLiu0306
- [hotfix] fix param op hook (#1131) by ver217
- [hotfix] fix zero init ctx numel (#1128) by ver217
- [hotfix]change to fit latest p2p (#1100) by YuliangLiu0306
- [hotfix] fix chunk comm src rank (#1072) by ver217
Zero
- [zero] avoid zero hook spam by changing log to debug level (#1137) by Frank Lee
- [zero] added error message to handle on-the-fly import of torch Module class (#1135) by Frank Lee
- [zero] fixed api consistency (#1098) by Frank Lee
- [zero] zero optim copy chunk rather than copy tensor (#1070) by ver217
Optim
Ddp
- [ddp] add save/load state dict for ColoDDP (#1127) by ver217
- [ddp] add set_params_to_ignore for ColoDDP (#1122) by ver217
- [ddp] supported customized torch ddp configuration (#1123) by Frank Lee
Pipeline
- [pipeline]support List of Dict data (#1125) by YuliangLiu0306
- [pipeline] supported more flexible dataflow control for pipeline parallel training (#1108) by Frank Lee
- [pipeline] refactor the pipeline module (#1087) by Frank Lee
Fx
- [fx]add autoparallel passes (#1121) by YuliangLiu0306
- [fx] added unit test for coloproxy (#1119) by Frank Lee
- [fx] added coloproxy (#1115) by Frank Lee
Gemini
- [gemini] gemini mgr supports "cpu" placement policy (#1118) by ver217
- [gemini] zero supports gemini (#1093) by ver217
Test
- [test] fixed hybrid parallel test case on 8 GPUs (#1106) by Frank Lee
- [test] skip tests when not enough GPUs are detected (#1090) by Frank Lee
- [test] ignore 8 gpu test (#1080) by Frank Lee
Release
Tensor
- [tensor] refactor param op hook (#1097) by ver217
- [tensor] refactor chunk mgr and impl MemStatsCollectorV2 (#1077) by ver217
- [Tensor] fix equal assert (#1091) by Ziyue Jiang
- [Tensor] 1d row embedding (#1075) by Ziyue Jiang
- [tensor] chunk manager monitor mem usage (#1076) by ver217
- [Tensor] fix optimizer for CPU parallel (#1069) by Ziyue Jiang
- [Tensor] add hybrid device demo and fix bugs (#1059) by Ziyue Jiang
Amp
Workflow
- [workflow] fixed 8-gpu test workflow (#1101) by Frank Lee
- [workflow] added regular 8 GPU testing (#1099) by Frank Lee
- [workflow] disable p2p via shared memory on non-nvlink machine (#1086) by Frank Lee
Engine
Doc
Context
- [context] support lazy init of module (#1088) by Frank Lee
- [context] maintain the context object in with statement (#1073) by Frank Lee
Refactory
- [refactory] add nn.parallel module (#1068) by Jiarui Fang
Cudnn
Full Changelog: v0.1.7...v0.1.6
v0.1.6 Released!
Main features
- ColoTensor supports hybrid parallel (tensor parallel and data parallel)
- ColoTensor supports ZeRO (with chunk)
- Config tensor parallel by module via ColoTensor
- ZeroInitContext and ShardedModelV2 support loading checkpoint and hugging face
from_pretrain()
What's Changed
ColoTensor
- [tensor] refactor colo-tensor by @ver217 in #992
- [tensor] refactor parallel action by @ver217 in #1007
- [tensor] impl ColoDDP for ColoTensor by @ver217 in #1009
- [Tensor] add module handler for linear by @Wesley-Jzy in #1021
- [Tensor] add module check and bert test by @Wesley-Jzy in #1031
- [Tensor] add Parameter inheritance for ColoParameter by @Wesley-Jzy in #1041
- [tensor] ColoTensor supports ZeRo by @ver217 in #1015
- [zero] add chunk size search for chunk manager by @ver217 in #1052
Zero
- [zero] add load_state_dict for sharded model by @ver217 in #894
- [zero] add zero optimizer for ColoTensor by @ver217 in #1046
Hotfix
- [hotfix] fix colo init context by @ver217 in #1026
- [hotfix] fix some bugs caused by size mismatch. by @YuliangLiu0306 in #1011
- [kernel] fixed the include bug in dropout kernel by @FrankLeeeee in #999
- fix typo in constants by @ryanrussell in #1027
- [engine] fixed bug in gradient accumulation dataloader to keep the last step by @FrankLeeeee in #1030
- [hotfix] fix dist spec mgr by @ver217 in #1045
- [hotfix] fix import error in sharded model v2 by @ver217 in #1053
Unit test
CI
- [ci] update the docker image name by @FrankLeeeee in #1017
- [ci] added nightly build (#1018) by @FrankLeeeee in #1019
- [ci] fixed nightly build workflow by @FrankLeeeee in #1022
- [ci] fixed nightly build workflow by @FrankLeeeee in #1029
- [ci] fixed nightly build workflow by @FrankLeeeee in #1040
CLI
- [cli] remove unused imports by @FrankLeeeee in #1001
Documentation
- Hotfix/format by @binmakeswell in #987
- [doc] update docker instruction by @FrankLeeeee in #1020
Misc
- [NFC] Hotfix/format by @binmakeswell in #984
- Revert "[NFC] Hotfix/format" by @ver217 in #986
- remove useless import in tensor dir by @feifeibear in #997
- [NFC] fix download link by @binmakeswell in #998
- [Bot] Synchronize Submodule References by @github-actions in #1003
- [NFC] polish colossalai/kernel/cuda_native/csrc/colossal_C_frontend.c… by @zhengzangw in #1010
- [NFC] fix paper link by @binmakeswell in #1012
- [p2p]add object list send/recv by @YuliangLiu0306 in #1024
- [Bot] Synchronize Submodule References by @github-actions in #1034
- [NFC] add inference by @binmakeswell in #1044
- [titans]remove model zoo by @YuliangLiu0306 in #1042
- [NFC] add inference submodule in path by @binmakeswell in #1047
- [release] update version.txt by @FrankLeeeee in #1048
- [Bot] Synchronize Submodule References by @github-actions in #1049
- updated collective ops api by @kurisusnowdeng in #1054
- [pipeline]refactor ppschedule to support tensor list by @YuliangLiu0306 in #1050
New Contributors
- @ryanrussell made their first contribution in #1027
Full Changelog: v0.1.5...v0.1.6
v0.1.5 Released!
Main Features
- Enhance ColoTensor and build a demo to train BERT (from hugging face) using Tensor Parallelism without modifying model.
What's Changed
ColoTensor
- [Tensor] add ColoTensor TP1Dcol Embedding by @Wesley-Jzy in #899
- [Tensor] add embedding tp1d row by @Wesley-Jzy in #904
- [Tensor] update pytest.mark.parametrize in tensor tests by @Wesley-Jzy in #913
- [Tensor] init ColoParameter by @feifeibear in #914
- [Tensor] add a basic bert. by @Wesley-Jzy in #911
- [Tensor] polish model test by @feifeibear in #915
- [Tensor] fix test_model by @Wesley-Jzy in #916
- [Tensor] add 1d vocab loss by @Wesley-Jzy in #918
- [Graph] building computing graph with ColoTensor, Linear only by @feifeibear in #917
- [Tensor] add from_pretrained support and bert pretrained test by @Wesley-Jzy in #921
- [Tensor] test pretrain loading on multi-process by @feifeibear in #922
- [tensor] hijack addmm for colo tensor by @ver217 in #923
- [tensor] colo tensor overrides mul by @ver217 in #927
- [Tensor] simplify named param by @Wesley-Jzy in #928
- [Tensor] fix init context by @Wesley-Jzy in #931
- [Tensor] add optimizer to bert test by @Wesley-Jzy in #933
- [tensor] design DistSpec and DistSpecManager for ColoTensor by @ver217 in #934
- [Tensor] add DistSpec for loss and test_model by @Wesley-Jzy in #947
- [tensor] derive compute pattern from dist spec by @ver217 in #971
Pipeline Parallelism
- [pipelinable]use pipelinable to support GPT model. by @YuliangLiu0306 in #903
CI
- [CI] add CI for releasing bdist wheel by @ver217 in #901
- [CI] fix release bdist CI by @ver217 in #902
- [ci] added wheel build scripts by @FrankLeeeee in #910
Misc
- [Bot] Synchronize Submodule References by @github-actions in #907
- [Bot] Synchronize Submodule References by @github-actions in #912
- [setup] update cuda ext cc flags by @ver217 in #919
- [setup] support more cuda architectures by @ver217 in #920
- [NFC] update results on a single GPU, highlight quick view by @binmakeswell in #981
Full Changelog: v0.1.4...v0.1.5
v0.1.4 Released!
Main Features
Here are the main improvements of this release:
- ColoTensor: A data structure that unifies the Tensor representation of different parallel methods.
- Gemini: More efficient Genimi implementation reduces the overhead of model data statistic collection.
- CLI: a command-line tool that helps users launch distributed training tasks more easily.
- Pipeline Parallelism (PP): a more user-friendly API for PP.
What's Changed
ColoTensor
- [tensor]fix colo_tensor torch_function by @Wesley-Jzy in #825
- [tensor]fix test_linear by @Wesley-Jzy in #826
- [tensor] ZeRO use ColoTensor as the base class. by @feifeibear in #828
- [tensor] revert zero tensors back by @feifeibear in #829
- [Tensor] overriding paramters() for Module using ColoTensor by @feifeibear in #889
- [tensor] refine linear and add gather for laynorm by @Wesley-Jzy in #893
- [Tensor] test parameters() as member function by @feifeibear in #896
- [Tensor] activation is an attr of ColoTensor by @feifeibear in #897
- [Tensor] initialize the ColoOptimizer by @feifeibear in #898
- [tensor] reorganize files by @feifeibear in #820
- [Tensor] apply ColoTensor on Torch functions by @feifeibear in #821
- [Tensor] update ColoTensor torch_function by @feifeibear in #822
- [tensor] lazy init by @feifeibear in #823
- [WIP] Applying ColoTensor on TP-1D-row Linear. by @feifeibear in #831
- Init Conext supports lazy allocate model memory by @feifeibear in #842
- [Tensor] TP Linear 1D row by @Wesley-Jzy in #843
- [Tensor] add assert for colo_tensor 1Drow by @Wesley-Jzy in #846
- [Tensor] init a simple network training with ColoTensor by @feifeibear in #849
- [Tensor ] Add 1Drow weight reshard by spec by @Wesley-Jzy in #854
- [Tensor] add layer norm Op by @feifeibear in #852
- [tensor] an initial dea of tensor spec by @feifeibear in #865
- [Tensor] colo init context add device attr. by @feifeibear in #866
- [tensor] add cross_entropy_loss by @feifeibear in #868
- [Tensor] Add function to spec and update linear 1Drow and unit tests by @Wesley-Jzy in #869
- [tensor] customized op returns ColoTensor by @feifeibear in #875
- [Tensor] get named parameters for model using ColoTensors by @feifeibear in #874
- [Tensor] Add some attributes to ColoTensor by @feifeibear in #877
- [Tensor] make a simple net works with 1D row TP by @feifeibear in #879
- [tensor] wrap function in the torch_tensor to ColoTensor by @Wesley-Jzy in #881
- [Tensor] make ColoTensor more robust for getattr by @feifeibear in #886
- [Tensor] test model check results for a simple net by @feifeibear in #887
- [tensor] add ColoTensor 1Dcol by @Wesley-Jzy in #888
Gemini + ZeRO
- [zero] add zero tensor shard strategy by @1SAA in #793
- Revert "[zero] add zero tensor shard strategy" by @feifeibear in #806
- [gemini] a new tensor structure by @feifeibear in #818
- [gemini] APIs to set cpu memory capacity by @feifeibear in #809
- [DO NOT MERGE] [zero] init fp16 params directly in ZeroInitContext by @ver217 in #808
- [gemini] collect cpu-gpu moving volume in each iteration by @feifeibear in #813
- [gemini] add GeminiMemoryManger by @1SAA in #832
- [zero] use GeminiMemoryManager when sampling model data by @ver217 in #850
- [gemini] polish code by @1SAA in #855
- [gemini] add stateful tensor container by @1SAA in #867
- [gemini] polish stateful_tensor_mgr by @1SAA in #876
- [gemini] accelerate adjust_layout() by @ver217 in #878
CLI
- [cli] added distributed launcher command by @YuliangLiu0306 in #791
- [cli] added micro benchmarking for tp by @YuliangLiu0306 in #789
- [cli] add missing requirement by @FrankLeeeee in #805
- [cli] fixed a bug in user args and refactored the module structure by @FrankLeeeee in #807
- [cli] fixed single-node process launching by @FrankLeeeee in #812
- [cli] added check installation cli by @FrankLeeeee in #815
- [CLI] refactored the launch CLI and fixed bugs in multi-node launching by @FrankLeeeee in #844
- [cli] refactored micro-benchmarking cli and added more metrics by @FrankLeeeee in #858
Pipeline Parallelism
- [pipelinable]use pipelinable context to initialize non-pipeline model by @YuliangLiu0306 in #816
- [pipelinable]use ColoTensor to replace dummy tensor. by @YuliangLiu0306 in #853
Misc
- [hotfix] fix auto tensor placement policy by @ver217 in #775
- [hotfix] change the check assert in split batch 2d by @Wesley-Jzy in #772
- [hotfix] fix bugs in zero by @1SAA in #781
- [hotfix] fix grad offload when enabling reuse_fp16_shard by @ver217 in #784
- [refactor] moving memtracer to gemini by @feifeibear in #801
- [log] display tflops if available by @feifeibear in #802
- [refactor] moving grad acc logic to engine by @feifeibear in #804
- [log] local throughput metrics by @feifeibear in #811
- [Bot] Synchronize Submodule References by @github-actions in #810
- [Bot] Synchronize Submodule References by @github-actions in #819
- [refactor] moving InsertPostInitMethodToModuleSubClasses to utils. by @feifeibear in #824
- [setup] allow installation with python 3.6 by @FrankLeeeee in #834
- Revert "[WIP] Applying ColoTensor on TP-1D-row Linear." by @feifeibear in #835
- [dependency] removed torchvision by @FrankLeeeee in #833
- [Bot] Synchronize Submodule References by @github-actions in #827
- [unittest] refactored unit tests for change in dependency by @FrankLeeeee in #838
- [setup] use env var instead of option for cuda ext by @FrankLeeeee in #839
- [hotfix] ColoTensor pin_memory by @feifeibear in #840
- modefied the pp build for ckpt adaptation by @Gy-Lu in #803
- [hotfix] the bug of numel() in ColoTensor by @feifeibear in #845
- [hotfix] fix _post_init_method of zero init ctx by @ver217 in #847
- [hotfix] add deconstructor for stateful tensor by @ver217 in #848
- [utils] refactor profiler by @ver217 in #837
- [ci] cache cuda extension by @FrankLeeeee in #860
- hotfix tensor unittest bugs by @feifeibear in #862
- [usability] added assertion message in registry by @FrankLeeeee in #864
- [doc] improved docstring in the communication module by @FrankLeeeee in #863
- [doc] improved docstring in the logging module by @FrankLeeeee in #861
- [doc] improved docstring in the amp module by @FrankLeeeee in #857
- [usability] improved error messages in the context modu...