Wednesday, April 2, 2014

Query plan risk estimation -- 'Query Fudge Factor'

Hi all,

Recent discussion on PG-Hackers has led me to this blog post. The concept of query risk estimation is a concept which shall again trod down the same path i.e. the path of improving the estimation done by the query planner.

The idea started with a discussion on planner hints and why PostgreSQL doesn't have them or shouldn't have them. The main part is that the user cant be allowed to give a plan to the query planner and force its execution, but the user knows more about the data than we do, hence there may be patterns in the data that can lead to good/bad plans and that we aren't aware of.

Lets take an example. One of the most common examples we see is the lack of correlation between multiple attributes in a table. The data for many use cases is some real world data (insurance data for e.g.). Now, PostgreSQL query planner does not take cross column statistics into consideration and hence there may be patterns that we are missing.

Is allowing the user to tell a plan or even a part of plan and the planner following only that plan really a good idea? Maybe not, but we could go and actually get some hinting from the user in place for the selectivity it expects for joining attributes. We shall not have the hints that Oracle has. We can have something which is much cleaner and easier to use,though. That will need more thinking and planning than what we currently know.

There is one thing that can be done which can prove helpful though. There is a source of problems that arise in the query planner due to a reason I shall explain below:

Suppose a table is made with insertion of some data. The query planner maintains data about the distribution of values in the attributes in the table. The data is maintained in two forms MCVs and histogram of the remaining values. MCVs are Most Common Values are values which are most frequent values in the attribute and the exact frequency of MCVs are maintained.

The statistical details maintained in query planner are updated by executing ANALYZE command or are updated by VACUUM.

Suppose the table is inserted with a lot of data/ deletion of a lot data that happens suddenly. The autovaccum run is still pending and hence the distribution in the actual data has changed a lot.n The known stats in the query planner are outdated now and selectivity estimates on those query planner stats will be wrong. Very wrong.

This is not the only case. For e.g., as mentioned by Tom lane, LIMIT assumes a normal distribution of the data and then plans, which can be quite incorrect.

A recent discussion about that was introduction of a query plan risk estimation. The point is to add a new factor in query planning which is the amount of confidence we have in the query plan or selectivity estimate the query planner has just given.

One way is to use a statistical method to predict the amount of deviation of the current actual histogram of an attribute of a table from the known stats in the query planner. The farther the distance, the lesser the confidence in the plan. The same can be achieved using Central Limit Theorem.

A method I proposed is:

http://www.postgresql.org/message-id/CAOeZVieVa+V-6NYzLLofG6uXfs9Y7N4cUzeoYrYKaQg23nohuw@mail.gmail.com

 I havent researched it into deep yet but plan to soon enough. I would love to get some feedback on this though.

I hope that we have something on the lines of query risk factor in soon so that our query plans are better performing.

Peace,

Atri