Query optimization in relational databases

5 min readJun 8, 2021

Take a minute to think about how much data is traveling around during one single day.

How about the data your company generates daily, weekly and monthly? Be sure, it increases exponentially.

We know it’s a lot of data , and this number will be growing, because we are saving data in all the time.

While the amount of data is grows, the speed to retrieve information is decreases and at this point we find ourselves in need to optimize our queries.

It’s like a drawer, where we put our personal documents. In the beginning where there are only a few documents, we find what we need fast, but with time, you’ll put more and more documents in the drawer, and the time needed to find a specific document will increase, unless you look for a way to organize your documents, into folders, or segregate it. The most important is to optimize your search with the least possible effort.

In our databases it isn’t different but they don’t use folders, they use a DBMS optimizer.

According to C.J. Date in the book An Introduction to Database Systems, when we refer to relational databases, we know that query optimization is challenging, especially in large companies, since it needs a good performance on their systems. At this point it’s necessary to use a query optimizer. But it is also an opportunity, because even with a high semantic level of relational expressions, this optimization is feasible.

As stated above, databases have a way to find data using the easiest way through the query optimizer, and you may be asking yourself… But Lucas, what is a query optimizer?

A query optimizer, basically will analyse your query and choose the most efficient way to execute a job. That’s done based on algebraic calculations to define the less costly way to fetch data, freeing the user of that complexity, verifying all the possible plans to find specific information, testing those plans and choosing the best one. So, the query optimizer can estimate which query has the lowest operating cost.

Many factors are related to this cost evaluation, such as whether the data is continuous or not, whether the data is ordered or not, etc.

To estimate the size of operations results and their costs, there are some statistics that can be made, such as:

Tuples quantity for which relation
Distinct values for which secondary index
B-Tree height for each index
Max and Min values of existing keys

With this information the query optimizer will calculate and compare the execution costs of all possible plans and choose the less costly plan.

Cool, we saw that databases have an optimizer that does all the hard work for us, but can we do something to improve our query ? The answer is yes!

There are many ways to improve our query, especially when we have a big amount of data.

Use Indexes

One of the factors that impact the most on a SQL statement is the use of indexes, even more if we think about tables with a big quantity of information. Using indexes will improve the execution and drastically decrease the time to search for the data.

Explore Parallelism

Nowadays, microprocessors are being dominated by multiple cores. Even portable devices like smartphones and tablets are receiving multi-core processors, however many applications don’t take advantage of this fact to extract the best performance available. Although this concerns mostly the application development stage, we can think about parallelism in our databases. In Oracle’s databases and SQL Servers, there are some ways to check if your statement used parallelism or not.

Be careful with joins

For anyone that works with SQL, it’s important to remember that joining many tables can negatively impact any query performance.

Divide a complex statement into many small statements

Mauro Pichiliani says that the best way to get the most performatic query isn’t by making a single, big and complex one, fetching all the data or calculating everything at once. Instead we can break this information into small functions in a report development, bringing ease of code maintenance when necessary.

But what happens if I need to run more than one query to build a report?

In these cases there are some options we may consider, e.g. using Stored Procedures or views.

Set Theory

It’s one of most important and widespread techniques on databases, where we change the statement line to line (cursor) for statements based on set theory (set-based), with this we don’t search line to line, but use the clusters of data. In many situations this is very interesting to use while developing our query.

Besides being very powerful mathematically speaking, if you know how to use this theory, you can get cool results with query performance.

Modify or Delete a small amount of data at a time

It’s important to know, for all statements that modify some data, be implicit or explicit as BEGIN TRANSACTION, there is the necessity of a log, named by Transaction Log.

The writing of this transaction log can have an implication on your application performance, when executed with a large volume of data, because it will start a large number of writing procedures in the log file. But when we use small transactions, we decrease the resources of locks, on the filling logs.

In other words, we break a big problem into small ones to solve.

Well, we can see that there are many ways to improve our query performance, and it’s very important to know well about some concepts and data models for a better development of query, therefore I leave below some tips of articles and books that I used as a basis to develop this article.