Optimizing MySQL queries
One suboptimal database query can effectively paralyse whole application. Tracking down such bottleneck is one thing but speeding it up is not a trivial task. This article is intended to show how one can analyse and refactor slow database query.
In all examples I was using MySQL Workbench and Sakila example database.
DB's structure is illustrated here. Please open it and refer to before reading any assignment's SQL.
Assignment
Let's say we want to have a...payment amount of a latest rental date per film. Query I wrote:
SELECT title, amount FROM ( SELECT film.title, payment.amount FROM film JOIN inventory ON inventory.film_id = film.film_id JOIN rental ON rental.inventory_id = inventory.inventory_id JOIN payment ON payment.rental_id = rental.rental_id ORDER BY film.title, rental.rental_id DESC, amount DESC ) AS f GROUP BY title
On the bright side it shows all relations between tables in a very readable, straightforward way. Too bad it is suboptimal. I could easily support this claim using MySQL's EXPLAIN, but MySQL Workbench has a very useful feature of presenting EXPLAIN's results in a graphical form:
One should read this diagram starting from the lower-left corner. This query is actually very poor when it comes to efficiency. Despite the fact it utilizes indexes JOINing another tables, in the end it has to perform grouping operation and reject almost all results. This stage (rejecting) is depicted as "Full Table Scan" rectangle. This will become even more problematic over time because our tables will grow, not shrink. Especially rentals (history) table.
Let's consider another approach to the same problem:
SELECT film.title, payment.amount FROM film JOIN rental ON rental.rental_id = ( SELECT max(rental_id) FROM rental WHERE inventory_id IN ( SELECT inventory_id FROM inventory WHERE film_id = film.film_id ) ) JOIN payment ON payment.rental_id = rental.rental_id ORDER BY amount DESC
It may look complicated owing to usage of a dependant subquery. So, what is it anyway? Using example is the best way to describe it. Going step by step, we need some columns from film and payment tables. However, there is no way to JOIN them directly and this is why we need rental table for. Unfortunately, there is no straightforward way to JOIN it either.
SELECT film.title, payment.amount FROM film JOIN rental ON rental.rental_id = -- ? JOIN payment ON payment.rental_id = rental.rental_id
Rentals are associated with inventory table rows by a relation called many-to-one. In other words, there may be multiple rentals of the same inventory item which is directly associated with a film (finally). So in order to reach a latest rental I wrote a subquery:
SELECT max(rental_id) FROM rental WHERE inventory_id IN ( SELECT inventory_id FROM inventory WHERE film_id = film.film_id -- <-------- DEPENDENCE! )
It is dependant on the outer query because to calculate outcome it needs a film_id. Therefore, DB engine has to execute it for each row from film table. It may sound as if it was an inefficient solution. Let's dispute this claim using MySQL Workbench:
As you can see, it's not completely bad after all. We see that there are multiple indexes usages, so despite the fact we execute subquery per each row from film table, we still get latest rental quite efficiently.
On a side note - I can't see the reason why MySQL engine used Full Index Scan on the film table. It makes no sense because query returns all rows from film table anyway - I did not introduce any kind of filtering.
Resultant query is quite satisfactory. There is no waste - no rows are rejected at all. It also leaves a room for easy filtering film table.
Comments powered by Disqus.