AIGC’s landing threshold plummeted： the hardware budget was reduced to 1／46 in one breath, and one line of code was automatically parallelized.

Mingmin originated from Aofei Temple.Quantum bit | WeChat official account QbitAI

From AI painting to NLP big model, the landing cost of AIGC has been laid down at one time!

Not much to say, just look at the results:

Stable Diffusion 2.0 training/fine-tuning/reasoning can save up to 5.6 times of memory consumption, and make the hardware cost drop directly.1/46, a line of code can be enabled;BLOOM, a large model with 175 billion parameters, makes single-machine reasoning, which saves 4 times the memory consumption and reduces the hardware cost.One-tenth.One line of code realizes automatic search for the best parallel strategy, significantly reducing the threshold for distributed training, and natively supporting popular AI model libraries such as Hugging Face and Timm.

You know, on the other side of the AIGC explosion, the high cost plagues the whole industry.

Last week, StockAI, one of the first AI painting companies, was forced to announce the closure of the platform. For no other reason, the founder said:

The driving cost of the company is too high, and the current income is unsustainable.

Even with the support of OpenAI and Microsoft behind ChatGPT, an announcement was issued a few weeks after the platform was opened, and users began to limit the daily use times.

The implication is nothing more than four words:Can’t afford to burn.

In a word, reducing the landing cost of AI big model is an urgent problem in the industry at present.

At the same time, the open source AI big model solutionColossal-AIIn the past year, it has quickly become popular, and 7k+Star has been harvested on GitHub.

If you drop this plan as above, it will come from its hands.

How exactly is it achieved? Look down,

Open source address:https://github.com/hpcaitech/ColossalAI

Stable Diffusion2.0 low-cost training/fine-tuning/reasoning

Compared with version 1.0, Stable Diffusion 2.0 not only improves the pixels of the image generated by the model, but also introduces Depth2img model, text-guided repair model and so on, with more perfect functions.

This wave of novelty actually surprised and caught users off guard.

(After all, I haven’t understood 1.0 yet)

But then again, it’s still an old problem. The cost of landing AIGC model is high.

Take Stable Diffusion as an example. The Stability AI behind it maintains more than 4,000 GPU clusters of NVIDIA A100, and has spent more than 50 million dollars in operating costs.

Faced with fast iterative models, algorithms and downstream tasks,How to reduce the application costIt has become the core issue of AIGC’s real landing.

Stable Diffusion 2.0 is built on the easy-to-use PyTorch Lightning framework.

As the official big model solution of PyTorch Lightning, Colossal-AI followed up for the first time.

The specific contents are as follows:

The memory consumption can be saved by 5.6 times, and the hardware cost can be reduced to 1/46 at most.Support DreamBooth single GPU for fast personalized fine tuning.Reasoning memory consumption saves 2.5 times.Moreover, the scheme will also be merged into Hugging Face in the near future, which will further facilitate users’ use.

train

In order to speed up the training and reduce the training cost,Use a larger batch sizeIt has become an effective means widely used. However, the limited memory capacity of GPU seriously limits the size of batch size and pushes up the threshold of training hardware.

Through a series of video memory optimization techniques, Colossal-AI enables Stable Diffusion to use the video memory requirements of large batch size 16 training on each GPU on average.From 64.5GB to 11.6GB., saving 5.6 times, and can also be extended to single GPU or multi-GPU parallel.

Compared with using the most advanced A100 80GB, it only needs3060 and other consumer graphics cardsCan meet the demand,The hardware cost has dropped to 1/46 at the highest..

Therefore, more users can carry out the research and application of Stable Diffusion at low cost on consumer GPU.

Back-memory optimization

Flash Attention

As early as version 1.0 of Stable Diffusion, Colossal-AI took the lead in introducing Flash Attention technology and succeeded.Increase the speed of attention by 104%, will be end-to-end trainingPeak memory reduced by 23%.

Flash Attention is an accelerated version of long sequence attention, which uses Flat to reduce the number of memory reads/writes between high-bandwidth memory (HBM) of GPU. At the same time, Flash attention designs an approximate attention algorithm for massive sparse attention, which is faster than any existing approximate attention method.

In version 1.0 of stable diffusion, there are only a few attention layers in the whole Diffusion Model, and Flash attention has not shown its performance advantages.

In Stable Diffusion 2.0, a large number of convolution layers are replaced by attention layers, which further exerts the memory optimization potential of Flash Attention.

ZeRO + Gemini

Colossal-AI supports the method of ZeRO redundancy optimizer to eliminate memory redundancy. Compared with the classical data parallelism strategy, Colossal-AI can greatly improve the efficiency of memory use without sacrificing computational granularity and communication efficiency.

In addition, Colossal-AI also introducedChunk mechanismFurther improve the performance of ZeRO.

A continuous set of parameters in the operation order is stored in a Chunk (Chunk is a continuous memory space), and each Chunk has the same size.

Chunk organizing memory can ensure the efficient use of network bandwidth between PCI-e and GPU-GPU, reduce communication times and avoid potential memory fragmentation.

However, Colossal-AI’s heterogeneous memory space manager Gemini supports offloading optimizer state from GPU to CPU to save GPU memory occupation.

GPU memory and CPU memory (composed of CPU DRAM or NVMe SSD memory) can be used at the same time to break through the limitation of single GPU memory wall and further expand the scale of trainable model.

One line of code to get started quickly

As an official partner of PyTorch Lightning,Only one line of code is needed to call the above video memory optimization of Colossal-AI..

from lightning.pytorch import trainer, LightningModulefrom lightning.pytorch.strategies import ColossalAIStrategyMystrategy = ColossalAIStrategy(use_chunk=True, enable_distributed_storage=True, placement_policy=auto)trainer = Trainer(accelerator="gpu", devices=4, precision=16, strategy=Mystrategy)trainer.fit(model)

DreamBooth fine-tuning

Colossal-AI also released the fine-tuning scheme of DreamBooth model "conveniently" while launching the acceleration scheme of Stable Diffusion 2.0.

This is the model released by Google in August this year. It just needs to3-5Pictures, together with words, can make the specified objects migrate to other scenes or styles.

The biggest difference from Dall-E 2, Imagen, etc. is that DreamBooth can select objects.Faithful reduction.

In the scheme, users only need to run the file train_dreambooth_colossalai.py directly, so that they can give full play to the memory optimization of Colossal-AI in this fine-tuning task, and fine-tune their graphic model personalized and quickly, which greatly reduces the use threshold.

reason

Because model reasoning is insensitive to numerical accuracy, it is possible to realize low-precision and low-cost reasoning.

For the Stable Diffusion 2.0 model, you can use theAdd a line of code, support the model’s Int8 quantitative reasoning,Memory consumption is reduced by 2.5 times, and only 3.1GB is needed.And doe not cause significant performance loss.

model = replace_module(model)

Reasoning 175 billion BLOOM model with RTX3090

On the other side of AI painting explosion, the trend of NLP big model continues.

In July this year, Hugging Face released an open source model with 175 billion parameters.BLOOMIt took 384 pieces of A100 to make it.

If the common FP32/FP16 is directly used for reasoning, and 8 GPUs in a single node use model parallelism,Each GPU needs to consume at least 87.5GB/43.8GB of video memory..

Such a large memory occupation, even the most advanced 8-card A100(80GB/40GB) server, alsoInference service cannot be deployed directly.And multi-node reasoning will bring heavy extra cost and communication overhead.

Based on this situation, Colossal-AI realizes efficient Int8 quantization and model parallel reasoning, which can serve the reasoning of large models such as BLOOM with 175 billion parameters,8-card server deployed to consumer graphics cards such as 3090/4090.At the same time, it does not produce significant CPU memory occupation improvement and performance loss.

Compared with the original A100 scheme, the hardware deployment cost can be reduced to the original one.One-tenth.

By quantizing the model with Int8, Colossal-AI can reduce the total memory occupation of the model from 352.3GB(FP16) to 185.6GB. At the same time, using the model parallel technology of Colossal-AI, it willThe occupation of each graphics card is reduced to 23.2GB..

In model parallelism, in order not to increase the CPU memory occupation, Colossal-AI quantifies and segments the model in the main process, and lazy_init is used in each other to obtain the meta model which occupies little memory and video memory, and then the model parameters are passed between processes through gloo backend.

Through the above scheme, the CPU memory occupation can reach the peak without loading the model parameters in stages, reaching the theoretical optimal level. Compared with the "pipeline-like" distribution mode in which the model is divided into layers, model parallelism can improve the efficiency of memory utilization under non-intensive requests.

One line of code automatically parallelizes.

The distributed mixed deployment of large models is a very complicated problem.

At present, the common distributed large-scale model training schemes all rely on repeated attempts by users and configuration and deployment by system experts based on experience.

However, this is very unfriendly to most AI developers, because everyone doesn’t want to spend too much time and energy on studying distributed systems and trial and error.

Thus, Colossal-AI’s efficient and easy-to-use automatic parallel system can be said to solve everyone’s urgent needs.

Just add one line of code., it can provide cluster information and single-machine training model to obtain distributed training ability, andNative support includes popular AI model libraries such as Hugging Face and Timm..

# wrap the model using auto_enginemodel, optimizer = auto_engine(model, optimizer, cluster_info)# normal training loop…

Therefore, Colossal-AI can greatly reduce the threshold for AI developers to use distributed technology to train and fine-tune large models. At the same time, the automatic parallel system can find a more efficient parallel scheme from the finer-grained search parallel strategy.

Graph Tracing

Colossal-AI isThe first automatic parallel system based on PyTorch framework using static graph analysis.

PyTorch is a dynamic graph framework, and obtaining its static execution plan is a long-term research problem in the field of machine learning system.

Colossal-AI uses ColoTracer based on torch.FX Tracer to derive and record the meta-information of each tensor during tracing, such as tensor shape, dims, dtype, etc., which can help the subsequent automatic parallel strategy search.

Therefore, Colossal-AI has better model generalization ability, instead of relying on model name or manual modification to adapt parallel strategy.

Fine-grained distributed training strategy search

Colossal-AI will search the strategy for each op with the fastest running time as the goal under the limitation of memory budget, and finally get the real training strategy, including the segmentation strategy of each tensor, the types of communication operators to be inserted between different computing nodes, whether to replace operators, etc.

Tensor parallelism and data parallelism in the existing system, and mixed parallelism such as column segmentation and row segmentation used by NVIDIA in parallel systems such as Megatron-LM are all subsets of the strategies that can be searched by automatic parallelism.

In addition to these parallel modes that can be specified manually, Colossal-AI’s automatic parallel system has the ability to specify a unique parallel mode for each op, so it is possible to find a better parallel strategy than manual segmentation that relies on expert experience and trial-and-error configuration.

Distributed tensor and shape consistency System

Similar to the latest release of DTensor by PyTorch, Colossal-AI also uses device mesh to manage the cluster abstractly.

Specifically, Colossal-AI uses sharding spec to label the distributed storage state of tensor, and uses shape consistency manager to automatically convert the same tensor between different sharding specs.

This greatly improves the versatility and ease of use of Colossal-AI. With the help of shape consistency manager, tensor can be segmented without burden, without worrying that the output of upstream op and the input of downstream OP are stored differently in the cluster.

Compared with PyTorch DTensor, Colossal-AI has the following three advantages:

The device mesh of Colossal-AI can profiling the cluster performance index and estimate the time consumption of different communication operators.The shape consistency of Colossal-AI will greedily search for the conversion mode between Harding Specs, instead of simply converting one dimension at a time, so as to find a more efficient conversion path, and thus make the conversion communication overhead between Harding Specs lower.Adding all_to_all operation makes Colossal-AI more extensible, which can show great advantages when training on large-scale clusters.Combined with activation checkpoint.

As an essential memory compression technology in large model training, Colossal-AI also provides an automatic search function for activation checkpoint.

Compared with most technical solutions that aim at maximum video memory compression, Colossal-AI’s search goal is to find the fastest activation checkpoint solution within the video memory budget.

At the same time, in order to avoid the explosion of search time caused by modeling the search of activation checkpoint into SPMD solver, Colossal-AI has designed 2-stage search, so it can search for an effective and feasible distributed training scheme in a reasonable time.

About Colossal-AI

Colossal-AI, a universal deep learning system, faces the era of big model, which can realize efficient and rapid deployment of AI big model training and reasoning, and reduce the application cost of AI big model.

Since the open source, Colossal-AI has been ranked first in the GitHub hot list for many times, with over 7,000 GitHub Star, and has been successfully selected as the official course of international AI and HPC top conferences such as SC, AAAI and PPoPP.

Colossal-AI related solutions have been successfully applied by well-known manufacturers in autonomous driving, cloud computing, retail, medicine, chips and other industries, and received wide acclaim.

For example, the recent explosion of fireChatGPT, not yet open source and does not have networking function. Colossal-AI has successfully helped a Fortune 500 company to develop a chat bot model with enhanced online search engine capabilities.

portal

Open source address:https://github.com/hpcaitech/ColossalAI

Reference link:https://www.hpc-ai.tech/blog/colossal-ai-0-2-0

Reporting/feedback