Post

Optimizing MySQL queries

One suboptimal database query can effectively paralyse whole application. Tracking down such bottleneck is one thing but speeding it up is not a trivial task. This article is intended to show how one can analyse and refactor slow database query.

In all examples I was using MySQL Workbench and Sakila example database.

DB's structure is illustrated here. Please open it and refer to before reading any assignment's SQL.

Assignment

Let's say we want to have a...payment amount of a latest rental date per film. Query I wrote:

SELECT title, amount FROM (
	SELECT film.title, payment.amount
	FROM film
	JOIN inventory
		ON inventory.film_id = film.film_id
	JOIN rental
		ON rental.inventory_id = inventory.inventory_id
	JOIN payment
		ON payment.rental_id = rental.rental_id
	ORDER BY film.title, rental.rental_id DESC, amount DESC
) AS f
GROUP BY title

On the bright side it shows all relations between tables in a very readable, straightforward way. Too bad it is suboptimal. I could easily support this claim using MySQL's EXPLAIN, but MySQL Workbench has a very useful feature of presenting EXPLAIN's  results in a graphical form:

One should read this diagram starting from the lower-left corner. This query is actually very poor when it comes to efficiency. Despite the fact it utilizes indexes JOINing another tables, in the end it has to perform grouping operation and reject almost all results. This stage (rejecting) is depicted as "Full Table Scan" rectangle.  This will become even more problematic over time because our tables will grow, not shrink. Especially rentals (history) table.

Let's consider another approach to the same problem:

SELECT film.title, payment.amount 
FROM film
JOIN rental
	ON rental.rental_id = (
		SELECT max(rental_id)
		FROM rental
		WHERE inventory_id IN (
			SELECT inventory_id
			FROM inventory 
			WHERE film_id = film.film_id
		)
	)
JOIN payment
	ON payment.rental_id = rental.rental_id
ORDER BY amount DESC

It may look complicated owing to usage of a dependant subquery. So, what is it anyway? Using example is the best way to describe it. Going step by step, we need some columns from film and payment tables. However, there is no way to JOIN them directly and this is why we need rental table for. Unfortunately, there is no straightforward way to JOIN it either.

SELECT film.title, payment.amount 
FROM film
JOIN rental
	ON rental.rental_id = -- ?
JOIN payment
	ON payment.rental_id = rental.rental_id

Rentals are associated with inventory table rows by a relation called many-to-one. In other words, there may be multiple rentals of the same inventory item which is directly associated with a film (finally). So in order to reach a latest rental I wrote a subquery:

SELECT max(rental_id)
FROM rental
WHERE inventory_id IN (
    SELECT inventory_id
    FROM inventory 
    WHERE film_id = film.film_id -- <-------- DEPENDENCE!
)

It is dependant on the outer query because to calculate outcome it needs a film_id. Therefore, DB engine has to execute it for each row from film table. It may sound as if it was an inefficient solution. Let's dispute this claim using MySQL Workbench:

As you can see, it's not completely bad after all. We see that there are multiple indexes usages, so despite the fact we execute subquery per each row from film table, we still get latest rental quite efficiently.

On a side note - I can't see the reason why MySQL engine used Full Index Scan on the film table. It makes no sense because query returns all rows from film table anyway - I did not introduce any kind of filtering.

Resultant query is quite satisfactory. There is no waste - no rows are rejected at all. It also leaves a room for easy filtering film table.

This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.