Friday, September 14, 2007

Data Warehousing

I have recently been immersed in the world of BI, OLAP, XMLA, MDX, DW, UDM, Cube, ROLAP, MOLAP, HOLAP, star schema, snowflake, dimensions and facts.

A data warehouse is a special form of repository that sacrifices storage space for ease of retrieval. The data is stored in a special normalized form that literally looks like a star schema. If you change one attribute, an entire row is duplicated. The data is normalized in a way to ease retrieval and reduce table joins. The data warehouse is nothing but a giant relational database whose schema design makes using plain SQL downright ugly. On top of this repository, lies one or more cubes that represent an aggregated view of the massive amounts of data. There are multiple forms of the cube: multi-dimensional online analytical processing(MOLAP), relational online analytical processing (ROLAP), and hybrid online analytical processing (HOLAP). A ROLAP cube is nothing but a special engine that converts the user requests into SQL and passes it to the relational database, a MOLAP is a pre-aggregated cube that allows the user fast retrieval without consistently requiring the underling data store, and a HOLAP is a hybrid of those two approaches. The reason for the cube technology is that it allows the user to slice and dice massive amounts of data online without any developers involvement. On top of the cube technology, there are a set of user front ends either web based or desktop. One such company is Panorama. Each GUI tool communicated with the cube in a standard language called MDX. A multi-dimensional expression language. An XML version of this language is the XMLA protocol, which was originally invented by the cube GUI company Panorama. Microsoft bought out their original tool, and further developed it into what is today called Microsoft Analysis Services 2005, which is a leading cube framework.

So to summarize:
UDB(Relational Database)
Microsoft Analysis Services 2005 (Cube)
Panorama (GUI)

Now the price, well, for a full blow BI (Business Intelligence) solution, you're easily looking into millions just on storage alone not to mention the license costs of the products. There are free solutions at least on the GUI side: one good one is jpivot.

A Data warehouse is a very powerful concept. It allows you to literally analyze your data in real-time. The business users use a friendly GUI to slice and dice their data, aggregate the numbers in different ways, generate reports, etc... The concept allows you to see ALL your data in any way the user imagines or at least in the number of dimensions defined on your cube. A dimension, by the way, is an attribute that it makes sense to slice by. For example, dates or type columns are good dimensions. A fact on the other hand is the business item that you're aggregating. For example, a trade would be considered a fact.

Once you have a data warehouse the next logical extension is KPI (Key Performance Indicators). Imagine looking at a dashboard, and have pretty colors with green, yellow, and red telling you how much money you're making/losing at that point. KPI are special rules that are applied to the data at the lowest level. When you aggregate up, the colors change depending on how you're slicing the data. This allows you to start at the very top of which region isn't doing so well, and then drill down to the very desk that's loosing money.

A further extension of data warehousing is data mining. This is a off-shot of AI and covers areas such as cluster detection, association rules, etc... There will be further blogs covering this in more detail.

So, if you have a huge budget, I recommend you give this a try. Your company will thank you for it(later). And if you don't have a huge budget, understand whether your problem fits in the BI world, and ask for a huge budget. I've seen too many companies take a cheap route and end up with half baked solutions that have no future.