Grouping

Grouping method is my current project which I am trying to improve.
I created this method because I saw some bytes content which will be unable to be compressed,
by the other methods(RLE, Huffman, LZW, etc.).
Also I see that this method can be done recursively, it depends on the the compressed content.

This method is different with other compression method, the differences are:
- Does not use or have dictionary to store duplicated words.
- Does not search for repetition character (not necessary).

Like the name implies, it search for bytes which are in a group (must be in a row at the moment),
the bytes in group must fall in to the group size, e.g.: 64, it will search for bytes which fall into that
range, it could be any bytes from and between (ASCII) 150 to 213 or between 0 to 63.

At this moment as far as my research goes, this grouping method gain ratio is very small compare to the other methods, the gain only in the bits count, if we use 64 as the group size then if means we use 6 bits only for each ASCII (8 bits), therefore we gain 2 bits for each ASCII, small isn't it ?
but if we could have or make all the value content in like this then we can get quite big gain ratio,
if we use group size 64 then maximum gain ratio is 25% (one time compress), but if we use group size of 128 then maximum gain ratio is 12.5% (one time compress).

The group size should be equal to the bit value, otherwise we will have no idea about how many bits we gain, e.g.: 16(4 bits), 64(6 bits), 128(7 bits).

It has to use a byte flag, which indicate the the Lowest value.
For example if the Lowest is 100, the group should content value between 100 to 163.
And like other compression method, it has a minimum limit for Total-In-Group, e.g.: 24 bytes.

Algorithm is very simple, example: group size is 64 (6 bits) after we get the Lowest and
Total-In-Group equal or greater than minimum then the next step is to reduce the content value
by Lowest:

 
Content Value
Reduced by Lowest = 63 ( New Value )
Binary Value (left to right)
1
69
6
011000
2
99
36
001001
3
98
35
110001
4
72
9
100100
5
63
0
000000
6
123
60
001111
7
81
18
010010
8
74
11
110100
9
75
12
011100
10
95
32
000001
11
104
41
110011
12
78
15
111100
13
96
33
100001
14
115
52
001011
15
114
51
110011
16
77
14
011100

At this point we get the compressed content by combining all 6 bits into 8 bits, we get:
01100000-10011100-01100100-00000000-11110100-10110100 = 6-57-38-0-47-45
01110000-00011100-11111100-10000100-10111100-11011100 = 14-56-63-33-61-49

Original size = 16 bytes * 8 bits = 128 bits
Compressed size = 12 bytes * 8 bits = 96 bits
Gain = 128 bits - 96 bits = 32 bits (4 bytes)

The 4 bytes gain is not so bad, and if we are lucky the compressed content could be recompress,
the sample above can be recompressed (recursive compression).

Lowest = 0, because Highest ASCII is 63 (all ASCII less than 64)

 
Content Value
Reduced by Lowest = 0 (New Value )
Binary Value (left to right)
1
6
6
011000
2
57
57
100111
3
38
38
011001
4
0
0
000000
5
47
47
111101
6
45
45
101101
7
14
14
011100
8
56
56
000111
9
63
63
111111
10
33
33
100001
11
61
61
101111
12
49
49
100011

Again, we should combine these 6 bits into 8 bits, we get:
01100010-01110110-01000000 = 70-110-2
11110110-11010111-00000111 = 111-235-224
11111110-00011011-11100011 = 127-216-199

Original size = 12 bytes * 8 bits = 96 bits
Compressed size = 9 bytes * 8 bits = 72 bits
Gain = 96 bits - 72 bits = 24 bits (3 bytes)

After this compression it seem the current compressed content is cant be re-compress because Lowest is 2 and Highest is 235, the gap is greater than 64, see the result.

So, the final result only required 9 bytes from original 16 bytes, we gain 7 bytes ( 43.75%)

To Decompress is very simple just reverse the algorithm by:
1. Separate 8 bits into 6 bits.
2. Add (the 6 bits value ) by Lowest.
3. (goto step number 1 if recursive, do until all recursive is done)

It is very simple, isn't it ? BUT this kind of content is quite rare, especially a compressed content by Huffman or LZW, because this compressed content ASCII value very often have a wide gap in between, for example: 10, 200, 30, 255, 0, 150, soon.....

So, I am working on some math calculation such as Flattening, which can make this kind of content to be more friendly, less gap value, more compressible. So far I need quite much additional bytes, these additional bytes is often unaffordable.

So anyone have any idea how to arrange these ASCII into more compressible content with minimal additional bytes ? or even better without additional byte at all !!!
(" Two or more head better than one ")

NOTE: If you have question, idea or correction to the description above, please do not be hesitate to contact me.

LZW

Author Site Map Disclaimer
HMaxF Ultimate Recursive Lossless Compression Research
2001 - 2003 (c) All Rights Reserved.