Name:
Yanyu Li
Title:
Accelerating Large Scale Generative AI: a Comprehensive Study
Date:
8/16/2024
Time:
11:00:00 AM
Committee Members:
Prof. Yanzhi Wang (Advisor)
Prof. David Kaeli
Prof. Kaushik Chowdhury
Abstract:
We have witnessed the great success of deep learning in various domains, such as the emerging large language models (LLMs) and Artificial General Intelligence (AGI), diffusion models for image and video generation, and classic vision tasks including classification, segmentation, detection, etc. Built with linear, convolution, and attention blocks, Deep Neural Networks (DNNs) play a vital role in the performance revolution. However, powerful DNNs often call for tremendous computation and storage size, which hinders their wide adoption. For instance, LLMs and diffusion models generally have billions of parameters and hundreds of GMACs, which is prohibitive for edge deployment. As a result, Efficient AI has become a hot research area. In this work, with algorithm optimizations and co-designs with hardware platform, we pursue the appealing features of edge or user-end AI, where we cut down energy consumption, shorten response latency, shrink model storage size, eliminate the need for cloud server access and protect user privacy. Firstly, we systematically investigate quantization, pruning, and architecture search techniques for efficient vision backbones. We do a comprehensive study on quantization number system and precision, and propose a novel mix-scheme mix-precision quantization technique to maximize hardware utilization and minimize performance loss. Regarding network pruning, we propose a novel indicator-based approach, named Pruning-as-Search, that is fully differentiable and automatically decides pruning policies, outperforming human tuning methods in terms of performance and efficiency. Further, we address the long-existing issue of rigid network width design, proposing a family of flexible-width pruned networks with minimal per-layer redundancy. As for architecture search, we formulate a joint optimization objective of both size and latency, releasing a series of efficient Vision Transformers, named EfficientFormer (V1 and V2), to serve as strong vision backbones with MobileNet-level size and millisecond-level latency on mobile phones.
Secondly, we make dedicated optimizations for large-scale generative tasks, i.e., Stable Diffusion (SD) for text-to-image generation, which serves as pioneer work to enable their mobile deployment. With the proposed efficient architecture design and novel step distillation, we shrink the generation latency of SD by a magnitude, from more than 1 minute to generate a 512$\times$512 image to 1~2 seconds, while preserving the stunning generative quality. We extend our work to the even more challenging video generation task, enabling 2-bit inference and single step adversarial distillation to speedup video diffusion models by a magnitude.