Tuesday, July 9, 2013

Make me a data store please!

Today, I am going to talk about something which requires a bit of creativity, intuition and experience, as well as a *lot* of research(doesn't everything require that?). I will start with two *real* real world cases(yup,they are absolutely real and happened in real life) and then move on to discuss the point I am trying to highlight there.

Case 1:

We got a case on PostgreSQL IRC where a guy was trying to store his massive amount of data in a postgres database, and wanted to optimize his queries's execution times, so wanted to know the best way to design his schemas and tables. We asked him about his data, and he wanted to store his data in
hstore data type i.e. in key value pairs form. When a lot of pairs are stored in a single hstore value, it can get a bit inefficient, especially when you have a lot of writes, since a change in any key value pair will lead to a write of the entire value to the disk. So, suppose we have 100 key value pairs in one store value, a change in any of them will lead to a write to disk of the entire 100 key value pairs.

I recommended him to go and use a graph database instead, as his data had a lot of connectedness in it. That seemed to the right idea to him, and he went off with it.

Case 2:

A friend of mine, who works for a scientific development organization, recently had some queries regarding the design of his data storage. He had a lot of key value pairs sort of data and wanted to store them in a form where his queries get efficient(same thing as everybody,eh?). The punch was that he wanted postgres to store his main data, as he wanted the stability and maturity of postgres( :) ) but wanted his queries to become better.

I told him to use a NoSQL database as a cache (Memcache would be good here, I believe) and keep the hottest key value pairs there. This would require some work for cache invalidation and updating, probably manually or through some application, but would work for him.

Now,what exactly am I trying to highlight here?

The main point I am trying to highlight is that *data storage requires thought*. You cannot do the same design as your next door neighbor and expect it to work for you as you would expect.

Another thing it requires is flexibility. If your data design till now used relational databases, it may or may not be necessary that your data will fit in nicely in the relational model anymore. And vice versa.

One more point I would like to discuss is scalability. Of course, when you start your video sharing website, do not plan on the scale of YouTube (at least, not yet). But, do not design and plan strictly according to your current workload. Think for the future, and design your data stores with the future in mind.

Another thing I commonly observe is how people try to use relational databases (I have seen that for postgres, I cannot speak for other relational DBMS) as NoSQL, or graph databases. This hurts me quite a bit. The fundamental thing that people have to understand is that relational databases are built on the principals of relational algebra and the relational model. If we try to violate the principals of relational model, we are trying to violate the fundamental principals that run our database. If we persist in doing that, how can we expect the performance that relational databases normally deliver?

Think of it like a sandwich. Take the bread, which has been developed, refined, tested with time and holds the experience of a lot of bakers, and add your own toppings till the sandwich is good enough to make you very,very happy. But do not try to make whole wheat bread taste like honey bread, and vice versa.

Next time, when you design your data stores, think, think, and think.

Till next time,



No comments:

Post a Comment