I am currently working as an intern in EnterpriseDB. Here, at EnterpriseDB, I am involved in solving scalability issues for PEM, EDB's major product which is a monitoring tool for your Postgres databases.
This is the first time I have been working on such a highly concurrent system. It is a different experience altogether, and the issues faced are very different from what you normally get when working in a single threaded system.
First up is number. Highly concurrent systems, as the name suggests, usually have a high number of concurrently executing threads, which is something that leads to other stuff, if not issues.
Then, control. Highly concurrent systems can get out of control quickly, and if not planned and designed in a well thought manner, can lead to unpredictable behavior. Ensuring serial behavior in such cases is one such potential challenge faced here.
Then, performance.Contrary to common belief that threading *always* leads to better performance, in some cases, the thread overhead and maintenance can lead to performance lags, which can hit your business.
Then, finally, serialization and concurrent access control, and lock management. This is one issue that is usually faced by every concurrent system, not only the high concurrency ones. For highly concurrent, this becomes all the more challenge, and ensuring that locks are acquired and released at appropriate parts, and ensuring that they are released, becomes another potential challenge.
My current project works on limiting the number of active connections. This leads to designing a system which efficiently queues the waiting threads i.e. those threads, which cannot be granted a connection immediately.
Now, when you start modifying the existing behavior, you need to ensure that you don't break existing functionality. Not only break it, you shouldn't affect the currently running cases much. One way of doing this(as I have done in this project), is to use the existing code and build over it. Some modifications may be required, but reusing existing code, which has been tested over time and will work is mostly always a good idea. It also saves you from replicating functionalities, and is a good practice in general.
Another point to look at is testing. For such a highly concurrent system, you need to ensure that your functionalities run for large test data, and it works *over time*. What I mean here is that highly concurrent systems may have some different order of execution over multiple runs of the same functionality, and hence, your functionality may have to deal with different orders, and other stuff. You need to test to over time.
Also, you need to test other important things, like order. One important specification in my current project is to ensure that the order in which the threads come in is the order in which they are executed later on, after they have permission to establish connection. This needs some thinking, and some good test cases.
These are the basics. Lets think over them, and continue our discussion in the next blog post.
Peace,
Atri