Tuesday, July 15, 2014

Time to retire B Tree from database storage? Rise of cross correlation and the need for change in design

B Tree has been the standard data structure to store data in a database. Pages for a database are maintained in a BTree with various semantics. BTree has been the stable choice for such applications for a long time. BTree is perfect for many reasons. It has more width, allowing for lesser disk access even for a large number of pages, allowing for better storage structures, faster access of pages.

But, BTree has a drawback, which is actually a drawback for all one dimensional data structures. BTree gives no way to track correlations between multiple attributes or capture the "one-ness" of the various attributes present in the same tuple.

Building a BTree index on top of the data doesnt help as well. All attributes are treated independently and the data is arranged per page or per attribute.

The problem here is that the current design can lead to query planner not knowing anything about correlated data or the possibility of correlation between the actual data present in two attributes of a relation. This can lead to query planner not using the correlation to generate more optimal plans.

Let us discuss a case where this kind of design can lead to sub optimal query plans. Consider the following data:

Minimum Stopping at 40 kph

A strong correlation is present between car age and minimum stopping at 40 kph. Looking for a query which could be a part of an analytical application. A predictive application trying to look at predicting the next tuple should get plans based on the already existing correlation between car age and minimum stopping at 40 kph.

If we use a data structure like Doubly Chained Tree (http://en.wikipedia.org/wiki/Doubly_chained_tree), we can effectively capture the data that is "multi dimensional". The capturing can be as simple maintaining pointers to data items for the same attribute to having a complex design capturing the actual correlation in terms of median and other statistical terms.

Using a multi dimensional data structure also allows for faster lookups from disk for queries that involve fetching large data for a single attribute and then joining it to another attribute of the same table. Using DCT, we can simply go and fetch the values for both the attributes simultaneously, using multiple pointers present.

This shall require a lot of research. Lets hope we get positive results for this one.

For more details, please look at:  https://github.com/atris/Concerted