For the "all zeros" case, my concern is that you said you're forcing a reset every 1024 words. This implies that if you have N kilowords of zero data, then it takes N times as much space as a single kiloword of data.
Good compression algorithms effectively use the same storage for highly-redundant data (not limited to all zeros or even all the same single word, though all zeros can sometimes be a bit smaller), whether it's 1 kiloword or 1 gigaword (there might be a couple bytes difference since they need to specify a longer variable-size integer).
And this does not require giving up on random-access if you care about that - you can just separately include an "extent table" (works for large regular repeats - you will have to detect this anyway for other compression strategies, which normally give up on random-access), or (for small repeats only) use strides, or ...
For reference, BTRFS uses 128KiB chunks for its compression to support mmap and seeking. Of course, the caller should make sure to keep decompressed chunks in cache.
Yes because you have to have some metadata that describes how to decompress the compressed data. This is the case in all compression algorithms I know.
As an example Lz4 and zstd also have a compressBound() function that calculates this.
I'm pretty sure it's the case in all compression algorithms, period — if you make some inputs smaller, you have to make other inputs larger. There's a pigeonhole style argument for it. The trick of course is to make the inputs you expect to actually encounter smaller, ideally while enlarging other inputs as little as possible.
This is missing a lot of context.
What integer patterns does it do well on, and what patterns does it do poorly on?
How many strategies does it support? It only mentions delta which is not compression. Huffman, RLE, variable-length encoding ...
Does it really just "give up" at C/1024 compression if your input is a gigabyte of zeros?
Working on improving and clarifying this!
It only does delta and bitpacking now.
It should do fairly well for a bunch of zeroes because it does bitpacking.
I’m working on adding rle/ffor and also clarifying the strategy and making it flexible to modify the format internally so it won’t break API
For the "all zeros" case, my concern is that you said you're forcing a reset every 1024 words. This implies that if you have N kilowords of zero data, then it takes N times as much space as a single kiloword of data.
Good compression algorithms effectively use the same storage for highly-redundant data (not limited to all zeros or even all the same single word, though all zeros can sometimes be a bit smaller), whether it's 1 kiloword or 1 gigaword (there might be a couple bytes difference since they need to specify a longer variable-size integer).
And this does not require giving up on random-access if you care about that - you can just separately include an "extent table" (works for large regular repeats - you will have to detect this anyway for other compression strategies, which normally give up on random-access), or (for small repeats only) use strides, or ...
For reference, BTRFS uses 128KiB chunks for its compression to support mmap and seeking. Of course, the caller should make sure to keep decompressed chunks in cache.
Sorry for asking, but there are too many weird project in this field. Can compressed_size be bigger than input.len?
Yes because you have to have some metadata that describes how to decompress the compressed data. This is the case in all compression algorithms I know.
As an example Lz4 and zstd also have a compressBound() function that calculates this.
I'm pretty sure it's the case in all compression algorithms, period — if you make some inputs smaller, you have to make other inputs larger. There's a pigeonhole style argument for it. The trick of course is to make the inputs you expect to actually encounter smaller, ideally while enlarging other inputs as little as possible.