Analysis on compressed file content

What is this analysis for? I provided this analysis just to let us know the kind of compressed data content look like, so we know how can we compress them again (recompress) or make them more compressible. Normally the result of compressed files content is similar one with another, they have all the 256 kinds of ASCII byte value and the total number of each ASCII is about equal, the most frequent and the least frequent ASCII value not much different. Each ASCII value is well-spread almost evenly in the content. Below is an example of what a compressed file content may looks like.

.

As you can see from the chart above, the content is quite balance. An decompress file in other hand may have all the ASCII value, may be not, and even though they have all ASCII, the frequent of each value should be so much different, maybe the most frequent value is 1000 (one thousand) and the least frequent is 10 (ten) only. So far any normal decompress file I have encountered are in this condition. Below is an example of a normal decompress file content may look like.

So what can we do about this? What should we do to make compressed file content like decompressed file content? So they can be recompressed with good ratio gain?

Since I started my research I found two solutions, maybe there are more solution which I have not know yet, anyway my solution are:
Make them unbalance. Make one or more ASCII have a much higher total than the rest, how to do that? Convert or eliminate some ASCII, this way surely can make them compressible. Please read about Flattening you will find more about this technique.
ASCII gathering (separating). Gather some ASCII in one block and others at the other, so they are not well spread. This technique useful, so we can do block-by-block compression. It is not matter whether they are all have equal total. Because they are separated, it is easier to compress.

If I know these are the answers then I must have an ultimate compressor, right? Well not really, because to make these happen, I need some additional bytes for flagging in the resulted file and so far these additional bytes is a little bit too much, it is around one eigth of the source size.

Remember these are not the only solution, with more time and more creative idea, soon or later we will have an improved (better) solution.

 


Author Site Map Disclaimer
HMaxF Ultimate Recursive Lossless Compression Research
2001 - 2003 (c) All Rights Reserved.