Sunday, May 11, 2008

A better compression algorithm

I am presented with an interesting problem. Usually, my employer burns money indiscriminately, but lately, with the market in tailspin, all costs are being evaluated. To avoid being one of those costs, I need to find a way to save money for the company. One of those ways is file storage space for document management systems. Unlike your basic $50, 100gb drives that you buy at Circuit City for you dell, corporate disk storage is highly expensive with EMC-SRDF storage running for 1TB at 1m.

Audit and regulatory rules requires that basically all files are kept. A large number of those files are data feeds from external systems. The files are structured and are in a readable format such as fixed length, delimited, or XML. My idea, which is not that unique, is to apply a heuristic compression algorithm to the data files. I am going to leverage the work done by the FIXML Protocol committee on the FAST specification, which defines a number of optimal heuristic encoding schemes. FAST defines a compression algorithm for market-data, but the same principles apply to file storage.

http://www.fixprotocol.org/fast

The concept is quite interesting. The compression algorithm basically attempts to find data patterns in the file, and encode them away. Let's say you have column that's an incrementing number: 1, 2, 3, ... n, n+1. The encoder will identify that this is an incrementing column, and encode it as algo: { previous + 1, starting with 0 }. We've just encoded away an entire column and took no space to do it. Let's try another example: abcdefg, abcdefe, abcdefn, abcdef5, etc... In this case, the first "abcdef" is the same in all the columns, and only the last character changes. We can encode this as a constant, and only send the last character: g, e, n, 5, etc...
There are a lot more sophisticated algorithms defined in the FAST protocol, but you get the idea.

The data in the file starts to mean something. The encoder actually attempts to represent the patterns present in the file. The patterns have a potential to save a lot more space then a traditional compression algorithm based on Huffman Encoding. How much space: how about average case of > 80%, compared with best case of 40% for ZIP. And don't forget, the result can still be zipped.

The program will read a file, scan through all the data points, figure out the optimal encoding algorithm, and then actually do the compression. The encoding algorithm will be needed to decompress the file. The first field in the file will carry the bytes needed for the encoding algorithm, followed by the encoding algorithm, and finally the data. This allows us to store the encoding scheme with the file.

One enhancement to FAST would be to allow the pre-processor to re-arrange the file. Data optimization is mostly based on previous records, so the more similar subsequent entries are, the higher the compression rate. Another enhancement maybe to bit map away typed fields. If a million entry file has 100 unique types, it might be more optimal to encode the bitmap separately, and then encode away the type id. Another extension maybe to see whether a corollary between fields rather then between subsequent records exists.

Another extension to this architecture is to write the files in a way as to improve lookup cost: index the files, and an intuitive UI, for the user to jump to the needed entry.

I have high hopes for this algorithm. If it can really encode away 90% of the file, then the space savings just might save my job. Well, at least until the next round of cost cutting.

No comments: