Fast enough to change the architecture

Compression algs have often been a black box to me. But recently I've been working on low level kernels and columnar data and have needed to balance memory pressure, CPU budgets and data transfer.

As part of this work, LZ4 kept impressing me. It is a very simple algorithm that still manages to achieve decent compression. It's the wrong choice for storage where you probably want to optimise for reduced file size, but for hot code paths it can be an excellent choice.

This article attempts to explain LZ4: how it works and why you should consider it.

What LZ4 is

LZ4 is an LZ77-style block compressor. Each sequence is:

Literals: raw bytes copied into the output
Match: copy N bytes from an earlier offset in the output buffer

The block format uses a one-byte token with two 4-bit fields (literal length and match length), optional length extensions, then literal bytes, then a 16-bit offset and match length. There is no entropy coder in the block format, unlike DEFLATE/gzip, which adds Huffman coding on top of LZ77.

That simplicity is the point. Official LZ4 documentation cites compression above ~500 MB/s per core and decompression in multiple GB/s per core on modern hardware. LZ4_HC trades CPU for ratio; the fast mode trades ratio for speed.

Family	Rough intuition
gzip/DEFLATE	LZ77 + Huffman
Brotli	Strong web compression, slower at high levels
Zstd	Excellent modern general-purpose trade-off
LZ4	Lower ratio, extremely fast decode
Snappy	Similar “fast compression” niche

How an LZ4 sequence works

Walk through tokens — modest 92% size, easy to read.

Example

Input (uncompressed)

abcabcabcxyz

Block bytes (compressed)

(no block bytes yet)

Start

LZ4 walks the input left-to-right. Each sequence emits a token byte, optional literal bytes, then an optional match offset. No bytes are written yet.

Step 1 / 4

Simplified LZ4 block sequences — extended match bytes are illustrative. Real frames add headers, checksums, and variable-length integer encoding.

Why speed changes the trade-off

The trade-off flips depending on the path.

If you compress once and archive forever, ratio wins and LZ4 is usually not your first pick. But most of the systems I care about do not look like cold storage. Data gets loaded, scanned, filtered, passed to another thread, partially aggregated, and thrown away. On that path you are balancing bytes moved (disk, network, postMessage, shared memory), bytes expanded (decompression CPU), and bytes used (parse, materialise, or run a kernel over the result).

LZ4 matters because decompression is cheap enough that moved bytes and post-decode work often dominate. You are not trying to shrink a backup tape. You are trying to keep hot paths from drowning in memory bandwidth and boundary copies.

Crossing thread and process boundaries

Every hand-off has a tax. In the browser that is postMessage between the main thread and workers. On the server it is channels between threads or processes. Even with transferable ArrayBuffers, which avoid a copy by moving ownership, you still fetched, allocated, and queued those bytes somewhere.

That is where fast compression earns its keep before the boundary. Ship a smaller payload, decompress on the far side, run your work there. Compression on the send side is slower than LZ4 decode, but you move fewer bytes across the boundary and into each worker's address space.

In columnar query engines this shows up directly: each worker receives a compressed chunk, decompresses page payloads inside the kernel, then runs filters and aggregates without round-tripping through JavaScript. The orchestrator owns I/O and scheduling; the kernel owns decode plus compute. The same pattern applies on native thread pools. Parallelism only pays off if each task carries a lean payload.

The win is not automatic. If compression ratio is poor, or you immediately need the full uncompressed column in JS anyway, you have added work for nothing. But when the consumer is a tight Wasm or native kernel that only touches part of the data, compress-on-send / decompress-in-kernel is often the faster end-to-end path.

When decode is cheaper than reading raw

This still feels backwards the first time you see it: why decompress inside a filter loop instead of storing plain arrays?

Because many hot kernels are memory-bound, not compute-bound. A scan over millions of floats or dictionary ids is mostly loads from RAM and cache fills. LZ4 decode runs at multiple GB/s per core on modern hardware. Reading two or three times as many raw bytes through the same memory subsystem often costs more than the decompress step.

On repetitive columnar data, read fewer bytes, decompress fast, scan a compact typed column, and you often beat reading the full raw column and scanning immediately. The decompressor is simple enough to stay inline in the hot loop without blowing the instruction cache. Gzip drags entropy decoding into the read path.

Layout matters more than the compressor brand. A dumb LZ77 implementation can win if the bytes you feed it compress well and the kernel sees a tight column after decode.

LZ4 wins on decode throughput, not compression ratio. That is enough to make compression a systems design tool, not a storage trick.

JSON vs binary vs columnar

The sections above assume you are already holding bytes in a shape the kernel can use. In practice you choose that shape long before you pick LZ4.

JSON compresses well because of repeated keys and punctuation, but you still parse text after decode. Typed arrays skip that parse step, yet raw floats often compress poorly because similar numbers do not look similar as bytes. Columnar layout groups alike values so LZ4 sees repeated byte runs, and a filter kernel can walk one typed column instead of skipping fields across wide rows.

JSON is fine for small config blobs. Typed arrays are fine when the data is already compact. Columnar layout is what makes the compress → decompress → scan pipeline pay off at scale.

Shaping numeric data

LZ4 does not understand numbers. It sees bytes. Timestamps, ids, and slowly changing floats all look like random 4- or 8-byte patterns unless you reshape them first. Delta encoding, byte shuffling, scaling to integers, null masks, and column separation arrange data so a simple compressor can win and the kernel after decode does less work per row.

Data	Transform	Why
Timestamps / ids	Delta	Neighbours become small integers
Small signed deltas	Delta + zigzag	Small unsigned bytes
Slow floats	Scale → int → delta	Repeated byte runs
Multi-byte integers	Byte shuffle	Group low/high bytes
Wide rows	Columnar split	Similar values adjacent

What to benchmark

Micro-benchmarks of compression ratio alone will lie to you. Measure compressed size on the path you actually use (storage, transfer, postMessage payload), compression throughput on the write path, decompression throughput on the read path, kernel time after decode, and end-to-end time to answer a simple query across workers.

For browser workloads, include JS/Wasm boundary costs and whether results come back as transferable buffers or get copied into the main thread.

Conclusion

I reach for LZ4 on hot read paths where decode CPU needs to stay predictable: worker handoffs, column pages, browser-local analytics. I skip it for cold storage, already-compressed payloads, and anywhere I have not tried reshaping the data first. Benchmark the full pipeline on the path users hit, not the ratio chart in isolation.