|
Grouping
Grouping method is my current project which I am trying to improve.
I created this method because I saw some bytes content which will be unable
to be compressed,
by the other methods(RLE, Huffman,
LZW, etc.).
Also I see that this method can be done recursively, it depends on the
the compressed content.
This method is different with other compression method, the differences
are:
- Does not use or have dictionary to store duplicated words.
- Does not search for repetition character (not necessary).
Like the name implies, it search for bytes which are in a group (must
be in a row at the moment),
the bytes in group must fall in to the group size, e.g.: 64, it will search
for bytes which fall into that
range, it could be any bytes from and between (ASCII) 150 to 213 or between
0 to 63.
At this moment as far as my research goes, this grouping method gain
ratio is very small compare to the other methods, the gain only in the
bits count, if we use 64 as the group size then if means we use 6 bits
only for each ASCII (8 bits), therefore we gain 2 bits for each ASCII,
small isn't it ?
but if we could have or make all the value content in like this then we
can get quite big gain ratio,
if we use group size 64 then maximum gain ratio is 25% (one time compress),
but if we use group size of 128 then maximum gain ratio is 12.5% (one
time compress).
The group size should be equal to the bit value, otherwise we will have
no idea about how many bits we gain, e.g.: 16(4 bits), 64(6 bits), 128(7
bits).
It has to use a byte flag, which indicate the the Lowest value.
For example if the Lowest is 100, the group should content value between
100 to 163.
And like other compression method, it has a minimum limit for Total-In-Group,
e.g.: 24 bytes.
Algorithm is very simple, example: group size is 64 (6 bits) after we
get the Lowest and
Total-In-Group equal or greater than minimum then the next step is to
reduce the content value
by Lowest:
| |
Content Value
|
Reduced by Lowest = 63 ( New Value )
|
Binary Value (left to right)
|
| 1 |
69
|
6
|
011000
|
| 2 |
99
|
36
|
001001
|
| 3 |
98
|
35
|
110001
|
| 4 |
72
|
9
|
100100
|
| 5 |
63
|
0
|
000000
|
| 6 |
123
|
60
|
001111
|
| 7 |
81
|
18
|
010010
|
| 8 |
74
|
11
|
110100
|
| 9 |
75
|
12
|
011100
|
| 10 |
95
|
32
|
000001
|
| 11 |
104
|
41
|
110011
|
| 12 |
78
|
15
|
111100
|
| 13 |
96
|
33
|
100001
|
| 14 |
115
|
52
|
001011
|
| 15 |
114
|
51
|
110011
|
| 16 |
77
|
14
|
011100
|
At this point we get the compressed content by combining all 6 bits into
8 bits, we get:
01100000-10011100-01100100-00000000-11110100-10110100 = 6-57-38-0-47-45
01110000-00011100-11111100-10000100-10111100-11011100 = 14-56-63-33-61-49
Original size = 16 bytes * 8 bits = 128 bits
Compressed size = 12 bytes * 8 bits = 96 bits
Gain = 128 bits - 96 bits = 32 bits (4 bytes)
The 4 bytes gain is not so bad, and if we are lucky the compressed content
could be recompress,
the sample above can be recompressed (recursive compression).
Lowest = 0, because Highest ASCII is 63 (all ASCII less than 64)
| |
Content Value
|
Reduced by Lowest = 0 (New Value )
|
Binary Value (left to right)
|
| 1 |
6
|
6
|
011000
|
| 2 |
57
|
57
|
100111
|
| 3 |
38
|
38
|
011001
|
| 4 |
0
|
0
|
000000
|
| 5 |
47
|
47
|
111101
|
| 6 |
45
|
45
|
101101
|
| 7 |
14
|
14
|
011100
|
| 8 |
56
|
56
|
000111
|
| 9 |
63
|
63
|
111111
|
| 10 |
33
|
33
|
100001
|
| 11 |
61
|
61
|
101111
|
| 12 |
49
|
49
|
100011
|
Again, we should combine these 6 bits into 8 bits, we get:
01100010-01110110-01000000 = 70-110-2
11110110-11010111-00000111 = 111-235-224
11111110-00011011-11100011 = 127-216-199
Original size = 12 bytes * 8 bits = 96 bits
Compressed size = 9 bytes * 8 bits = 72 bits
Gain = 96 bits - 72 bits = 24 bits (3 bytes)
After this compression it seem the current compressed content is cant be re-compress because Lowest is 2 and Highest is 235, the gap is greater than 64, see the result.
So, the final result only required 9 bytes from original 16 bytes, we
gain 7 bytes ( 43.75%)
To Decompress is very simple just reverse the algorithm by:
1. Separate 8 bits into 6 bits.
2. Add (the 6 bits value ) by Lowest.
3. (goto step number 1 if recursive, do until all recursive is done)
It is very simple, isn't it ? BUT this kind of content is quite rare,
especially a compressed content by Huffman or
LZW, because this compressed content ASCII value
very often have a wide gap in between, for example: 10, 200, 30, 255,
0, 150, soon.....
So, I am working on some math calculation such as Flattening,
which can make this kind of content to be more friendly, less gap value,
more compressible. So far I need quite much additional bytes, these additional
bytes is often unaffordable.
So anyone have any idea how to arrange these ASCII into more compressible
content with minimal additional bytes ? or even better without additional
byte at all !!!
(" Two or more head better than one ")
NOTE: If you have question, idea or correction to the description above,
please do not be hesitate to contact me.
|