Tuesday, July 9, 2013

Make me a data store please!

Today, I am going to talk about something which requires a bit of creativity, intuition and experience, as well as a *lot* of research(doesn't everything require that?). I will start with two *real* real world cases(yup,they are absolutely real and happened in real life) and then move on to discuss the point I am trying to highlight there.

Case 1:

We got a case on PostgreSQL IRC where a guy was trying to store his massive amount of data in a postgres database, and wanted to optimize his queries's execution times, so wanted to know the best way to design his schemas and tables. We asked him about his data, and he wanted to store his data in
hstore data type i.e. in key value pairs form. When a lot of pairs are stored in a single hstore value, it can get a bit inefficient, especially when you have a lot of writes, since a change in any key value pair will lead to a write of the entire value to the disk. So, suppose we have 100 key value pairs in one store value, a change in any of them will lead to a write to disk of the entire 100 key value pairs.

I recommended him to go and use a graph database instead, as his data had a lot of connectedness in it. That seemed to the right idea to him, and he went off with it.

Case 2:

A friend of mine, who works for a scientific development organization, recently had some queries regarding the design of his data storage. He had a lot of key value pairs sort of data and wanted to store them in a form where his queries get efficient(same thing as everybody,eh?). The punch was that he wanted postgres to store his main data, as he wanted the stability and maturity of postgres( :) ) but wanted his queries to become better.

I told him to use a NoSQL database as a cache (Memcache would be good here, I believe) and keep the hottest key value pairs there. This would require some work for cache invalidation and updating, probably manually or through some application, but would work for him.

Now,what exactly am I trying to highlight here?

The main point I am trying to highlight is that *data storage requires thought*. You cannot do the same design as your next door neighbor and expect it to work for you as you would expect.

Another thing it requires is flexibility. If your data design till now used relational databases, it may or may not be necessary that your data will fit in nicely in the relational model anymore. And vice versa.

One more point I would like to discuss is scalability. Of course, when you start your video sharing website, do not plan on the scale of YouTube (at least, not yet). But, do not design and plan strictly according to your current workload. Think for the future, and design your data stores with the future in mind.

Another thing I commonly observe is how people try to use relational databases (I have seen that for postgres, I cannot speak for other relational DBMS) as NoSQL, or graph databases. This hurts me quite a bit. The fundamental thing that people have to understand is that relational databases are built on the principals of relational algebra and the relational model. If we try to violate the principals of relational model, we are trying to violate the fundamental principals that run our database. If we persist in doing that, how can we expect the performance that relational databases normally deliver?

Think of it like a sandwich. Take the bread, which has been developed, refined, tested with time and holds the experience of a lot of bakers, and add your own toppings till the sandwich is good enough to make you very,very happy. But do not try to make whole wheat bread taste like honey bread, and vice versa.

Next time, when you design your data stores, think, think, and think.

Till next time,



Friday, July 5, 2013

Scale Me Up -- Musings of a scalability researcher

Scalability. What does it imply?

More connections? Yes
More traffic? Yes
More concurrency issues? Yes
More performance issues? Yes

My work at EnterpriseDB leads me down a path less trodden, the intricate and fine details of scalability.
My musings lie here, in this blog post, which you are reading currently.

Having recently completed the connection throttler project in PEM, I moved on to research optimal architectures and protocols to establish a connection between a client and host databases running on different servers, with PEM server as the intermediary. The architecture had to be scalable. This was a new experience for me, as I had to research network protocols which may work well for our use cases. I was a bit intimated at first, but then remembered a quote I had read somewhere:

"If you are getting comfortable, move out."

This is what lead me to get down and start working on it. A lot of interesting points came up, and I learnt some things which are invaluable.

I was aware of the standard protocols, SSH, TLS for security etc. Dave advised me to go and research IPSec tunnel as well, which I did.

Some points here:

1) When looking at protocols for scalability, look at the performance. See how much time computations for the protocol would take, and how much CPU may be used for the same. To take an example, in my research, I found that TLS's authentication through PKI is time consuming and CPU consuming, compared to IPSec's pre shared key, which is much faster.

2) Look at the protocols your protocol is dependent upon. For e.g. IPSec depends on UDP, which does not support fragmentation, and hence cannot support large data packets. This is a disadvantage over TLS and TCP.

So, again, you have a plus on each side, and a corresponding minus on each side as well.

The above points are pretty specific to a part of scalability components, but demonstrate an important point, which I am trying to highlight here:

There is no clear answer.

Wow, where did that come from?

Well, I feel that this is the answer to most of the things we do, decision we make. Every component, every option has its own pros and cons, and we need to evaluate everything on the same level, with an open mind.

All solutions are applicable, we just need to find the one that best suits our use cases and work loads.

I will keep you all updated.