Sunday, May 11, 2008

J2ee, Jee, EJB 2 and 3, Spring, Weblogic,

Bla, bla, bla, bla. First, I'd like to say that I hate EJB 2, and starting to seriously dislike EJB 3. J2ee, and the stupid rebranding to JEE. There are many things that are wrong with EJB 2. One of them is the security model, another is the QL language, another is their retarded persistence approach. The only gain, and even that's a stretch would be the transaction management. But, even there, the EJB authors completely f'd up and created a convoluted model.

EJB's, originally, were supposed to save the day. Usher in a day where corporate code monkeys can get a bit dumber and focus on the "business logic" and forget all that complicated plumbing like transaction management, and threading. Threading, ha, said the EJB spec writes. No threads needed. The container will take care of that for you. Don't worry your pretty little head, little developer, just go play in your little sandbox with your pretty little business logic. Well, many developers did just that. And in time, they've realized that not only did EJB's complicate everything but you needed EJB experts just to support the frankenstein monstrosity that get produced.

Somewhere around there Spring has dawned, and flowers bloomed. Now Rod has the right idea, but the implementation lacked. Spring has proved to be another monster. The issue really is in the complexity. In order to get any value from spring, or even understand how spring works requires deep internal knowledge, something that Spring and EJB fail to mention. Spring has an insane dependency on configuration. In fact to such a degree as to make the code unreadable. A developer now has to funnel through massive amounts of XML config, plus a massive amounts of useless interfaces only to find some impl classes that's AOP'd into the damn dao. Now, don't get me wrong all you spring wankers, I am more than aware of all the benefits of IOC and AOP and unit testing. I know that's what's running through your head. Heck, he doesn't understand proper test driven development, bla, bla, bla. How dare he criticize that which is holy and has saved the day. Well, I don't like it. It replaces something really bad. I agree with that. I believe in the simplicity of design and the readability of the code. I agree in the principle value of AOP and IOC, but I wonder if there is a better and cleaner way to achieve some of the same things Spring set out to do.

EJB 3.0 persistence part is yet another over-arching spec-iness of the writes. It's basically useless for anything slightly larger than a pet-store website. The goals are definately admirable, but they must have known how far of the target they would actually hit. They are attempting to map pojo's into a relational table structure. What they don't tell you is that it's impossible to accomplish without seriously hampering the design of your relational model. Now, if you had an object database, perhaps, I wouldn't be saying that. Perhaps, a more valid attempt for this will come from the space of semantic networks, and companies like Business Objects with the concepts of a universe design, which provides a layer above the raw data.

To return to EJB and Spring bashing, I disagree with the basic notion of the goals. Each attempts to reduce how much a developer needs to know. Of course the reason for that, is to dummy down the developer, replace him/her with a body from a cheap foreign country, and keep-on churning code. A different model would be to replace the army of code monkeys with a few diligent developers, but move the responsibility of business development to the BUSINESS. At the end of the day, no one knows the end goal except the business. And no developer will ever be better than the business user at solving the business problem. Now, a lot of developers try, and they are usually bad developers. So, they are bad at business, they are bad at programming. And yes, they need yet another framework to keep them from hurting themselves. I think we should focus our attention at reducing our field to a few competent experts that deserve the title of the developer who focus on the technical development that enables the business users to do the thing that most of the developers do today: business coding.

A better compression algorithm

I am presented with an interesting problem. Usually, my employer burns money indiscriminately, but lately, with the market in tailspin, all costs are being evaluated. To avoid being one of those costs, I need to find a way to save money for the company. One of those ways is file storage space for document management systems. Unlike your basic $50, 100gb drives that you buy at Circuit City for you dell, corporate disk storage is highly expensive with EMC-SRDF storage running for 1TB at 1m.

Audit and regulatory rules requires that basically all files are kept. A large number of those files are data feeds from external systems. The files are structured and are in a readable format such as fixed length, delimited, or XML. My idea, which is not that unique, is to apply a heuristic compression algorithm to the data files. I am going to leverage the work done by the FIXML Protocol committee on the FAST specification, which defines a number of optimal heuristic encoding schemes. FAST defines a compression algorithm for market-data, but the same principles apply to file storage.

http://www.fixprotocol.org/fast

The concept is quite interesting. The compression algorithm basically attempts to find data patterns in the file, and encode them away. Let's say you have column that's an incrementing number: 1, 2, 3, ... n, n+1. The encoder will identify that this is an incrementing column, and encode it as algo: { previous + 1, starting with 0 }. We've just encoded away an entire column and took no space to do it. Let's try another example: abcdefg, abcdefe, abcdefn, abcdef5, etc... In this case, the first "abcdef" is the same in all the columns, and only the last character changes. We can encode this as a constant, and only send the last character: g, e, n, 5, etc...
There are a lot more sophisticated algorithms defined in the FAST protocol, but you get the idea.

The data in the file starts to mean something. The encoder actually attempts to represent the patterns present in the file. The patterns have a potential to save a lot more space then a traditional compression algorithm based on Huffman Encoding. How much space: how about average case of > 80%, compared with best case of 40% for ZIP. And don't forget, the result can still be zipped.

The program will read a file, scan through all the data points, figure out the optimal encoding algorithm, and then actually do the compression. The encoding algorithm will be needed to decompress the file. The first field in the file will carry the bytes needed for the encoding algorithm, followed by the encoding algorithm, and finally the data. This allows us to store the encoding scheme with the file.

One enhancement to FAST would be to allow the pre-processor to re-arrange the file. Data optimization is mostly based on previous records, so the more similar subsequent entries are, the higher the compression rate. Another enhancement maybe to bit map away typed fields. If a million entry file has 100 unique types, it might be more optimal to encode the bitmap separately, and then encode away the type id. Another extension maybe to see whether a corollary between fields rather then between subsequent records exists.

Another extension to this architecture is to write the files in a way as to improve lookup cost: index the files, and an intuitive UI, for the user to jump to the needed entry.

I have high hopes for this algorithm. If it can really encode away 90% of the file, then the space savings just might save my job. Well, at least until the next round of cost cutting.