Compression/ Decompression Primitives for accelerating AI and HPC workloads

This post discusses Compression and Decompression two important primitives which needs attention both for data at rest and data in motion especially from AI and HPC Servers which are heavily dependent on memory and storage capacity besides the speed of access to data sets used for training and inference use cases. It is common knowledge that language models are sharded across different servers for concurrency both for the model framework itself and data sets used with large number of parameters (~N trillions) for training. Active working set is accessed in these workloads from memory local and remote to servers both for training and inference purposes (e.g: HBM(x=2/3/4). This is used for speed of access of hot data and DDR-5/6/7 used for working set capacity related to warm data besides fast storage on IO’s used for infrequent data accesses). Majority of data capacity resides for such use cases in fast storage (either NVMe or aggregated pools) connected to these servers and moved to active memory (HBM/DDR) as per demand, there is separate criterion used to move from DDR to HBMx which is not discussed in this post. Modern Data layout (which is a separate topic in itself) techniques to maximize the capacity in DDR and HBM memories relies on finding good tradeoffs on Compression ratio and Decompression bandwidth for accesses from these memories. Compression and Decompression is also used extensively for tabular data sets used for databases (e.g: in memory databases) especially for searching, indexing and DML operations besides KV caches used in AI inference process. Then there are myriads of video related applications which are large consumers of compression and decompression algorithms. In this post I am not going to discuss very low-level compressed layout like used for sparse matrices (e.g: CSR/CSC/…) used to speed up accesses for matrix arithmetic’s though these are very important for Transformer(LLM) use cases. Here most of content relates to compression and decompression of data at rest and data in motion used primarily on memories/storage and in highspeed links used for data access and sharing.

In general Compression is computationally expensive and invoked less frequently for most workloads. Decompression on other hand is needed very often for making use of data which has been compressed and kept in memory or fast storage.

Type of compression chosen depends mainly on two prime use cases, data at rest applicable to storage and dense memory whereas data in motion compression used in high-speed links used to connect these memories/storage devices to server as well as on interconnects used for data movements among servers. Compression is also used in modern system on chip parts for data movement on silicon itself among different compute clusters or cores. Therefore, it is important to consider few aspects of these compression algorithms, one is primarily compression ratio(CR) which determines saving in capacity for data at rest type of use cases primarily, second is Decompression speed which is very useful for real time data fetches on high speed links in compressed form to save link bandwidth and decompress it on silicon for its target consumers. Target consumers could be last level cache inside system on chip, could be scratch pad memory for keeping relatively static data (e.g: GPU constant memory or shared memory) or fixed function hardware having its own sram being populated from main memory by making use of decompression of data for intended its computation.

Following table lists some of compression/decompression algorithms with its compression ratios and decompression speeds, primarily following is quoted for text/web-based data obtained from Corpuses like canterburry ,enwik8 and web content corpuses that are not very large and typically 100 to 300MB in size.

Algorithm	Typical Compression Ratio	Typical De-compression Speed	Typical used application	Hardware friendly
Zstandard	4-6.0	>~300MB/s	Databases, Archiving and to some extent AI	Yes, compression speed(fast)
Brotli	3.4-6.9	334-508MB/s	Web, General	Yes, compression speed (moderate)
Deflate (Zlib)	2.9-5.5	323-425MB/s	Legacy	No, compression speed (moderate)
LZ4	2.1-3.3	~500MB/s	In Memory Caches, AI hardware applicable	Yes, Compression Speed (Very Fast)
LZO	1.4-2.5	~500+MB/s	System On Chip Silicon, Fast Storage	Yes, Compression Speed (very fast)
Snappy	2-2.5	~500+MB/s	System On Chip Silicon, Fast Storage, low latency, analytics	Yes, Compression Speed (very fast)
bzip2	3.7-5.8	30-52MB/s	Storage	No, Compression Speed (Slow)
ZPAQ	5.7-19	30-50MB/s	Archive	No, Compression Speed (very slow)

I need to mention based on above table you may see the applicability to underlying place to implement these algorithms in AI or general server (Note it place to implement these is very important and make a large difference in performance). But suffice to say Dictionary-based algorithms (examples: LZ77, LZ78, Brotli, Zstandard, DEFLATE) are useful since they give good tradeoff between decompression speed and compression Ratio’s besides being hardware friendly in terms of incorporating it in server design. There is another class of algorithms like RLE and they are very friendly for system on chip type of design. To give a flavor:

RLE (high repetition)	10x-33x (but can be Nx faster based on place of implementation)	Extremely High	Compression speed could be very high as well based on implementation
RLE (Random data)	No Gain	Fast	Fast

It goes without saying server performance can be changed dramatically based on algorithms discussed in this post.