Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using redundant conditions to unlock indexes in MySQL #164

Open
luantranminh opened this issue Jun 11, 2023 · 0 comments
Open

Using redundant conditions to unlock indexes in MySQL #164

luantranminh opened this issue Jun 11, 2023 · 0 comments
Labels

Comments

@luantranminh
Copy link
Owner

luantranminh commented Jun 11, 2023

https://planetscale.com/blog/redundant-and-approximate-conditions

Let's say you have a todos table with a created_at column that records a timestamp of when the record was created.

CREATE TABLE `todos` (
  `id` int NOT NULL AUTO_INCREMENT,
  `title` varchar(255) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `created_at` (`created_at`)
)

Obfuscated indexes

In this table, we've added an index to the created_at column to quickly filter by that timestamp. When we query against the created_at column to find records created in the last 24 hours, we see that MySQL is using the index as we'd expect:

EXPLAIN SELECT * FROM todos WHERE created_at > NOW() - INTERVAL 24 HOUR;

-- | id | type  | possible_keys | key        | key_len | ref | rows | filtered | Extra                 |
-- |----|-------|---------------|------------|---------|-----|------|----------|-----------------------|
-- |  1 | range | created_at    | created_at | 4       |     |    1 |   100.00 | Using index condition |

However, if we wrap this column in a function, we're obfuscating the column from MySQL, and it can no longer use the index.

EXPLAIN SELECT * FROM todos WHERE YEAR(created_at) = 2023;

-- | id | type | possible_keys | key | key_len | ref | rows  | filtered | Extra       |
-- |----|------|---------------|-----|---------|-----|-------|----------|-------------|
-- |  1 | ALL  |               |     |         |     | 39746 |   100.00 | Using where |

In some cases, there are ways around index obfuscation. In this example, we could use a range scan instead of the YEAR function to obtain the same result.

EXPLAIN SELECT * FROM todos WHERE created_at BETWEEN '2023-01-01 00:00:00' AND '2023-12-31 23:59:59';

-- | id | type  | possible_keys | key        | key_len | ref | rows | filtered | Extra                 |
-- |----|-------|---------------|------------|---------|-----|------|----------|-----------------------|
-- |  1 | range | created_at    | created_at | 4       |     |    1 |   100.00 | Using index condition |

Redundant conditions in MySQL

What are redundant conditions?

Let's take a look at a contrived example to illustrate the point. In this example, we're selecting the todos with an id of less than five.

SELECT
  *
FROM
  todos
WHERE
  id < 5
  and 
  id < 10 -- This does... nothing

In this case, a redundant condition might be id < 10.

Use case

We're going to expand our todos table definition a little bit to add due_date and due_time columns. (Storing date and time separately is usually not advised, but it helps us prove the point.)

CREATE TABLE `todos` (
  `id` int NOT NULL AUTO_INCREMENT,
  `title` varchar(255) NOT NULL,
  `due_date` date NOT NULL,
  `due_time` time NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `due_date` (`due_date`),
  KEY `created_at` (`created_at`)
)

Given this table, if you want to query for todos that are due in the next day, you're stuck using the ADDTIME function:

EXPLAIN SELECT
  *
FROM
  todos
WHERE
  ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY;

-- | id | type | possible_keys | key | key_len | ref | rows  | filtered | Extra       |
-- |----|------|---------------|-----|---------|-----|-------|----------|-------------|
-- |  1 | ALL  |               |     |         |     | 39746 |   100.00 | Using where |

We do have an index on due_date, but the index cannot be used because we're performing an operation on it (adding the time). There is no easy way to de-obfuscate this column either since the due_time is different for every row.

To work around this, let's add a redundant condition on due_date alone. When adding the condition, we need to make sure that it's logically impossible to change the result set, which means our redundant condition should be broader than our actual condition.

EXPLAIN SELECT
  *
FROM
  todos
WHERE
  -- The real condition
  ADDTIME(due_date, due_time) BETWEEN NOW() AND NOW() + INTERVAL 1 DAY
  AND
  -- The redundant condition 
  due_date BETWEEN CURRENT_DATE AND CURRENT_DATE + INTERVAL 1 DAY

-- | id | type  | possible_keys | key      | key_len | ref | rows | filtered | Extra                              |
-- |----|-------|---------------|----------|---------|-----|------|----------|------------------------------------|
-- |  1 | range | due_date      | due_date | 3       |     |    1 |   100.00 | Using index condition; Using where |

MySQL will first use the index to eliminate most of the table, then the slower ADDTIME will be used to eliminate the few remaining false positives. The redundant condition is doing its job perfectly!

Domain-specific redundant conditions

In the case of our todos table, let's add an updated_at column that will be populated with the timestamp of the last time the record was changed.

CREATE TABLE `todos` (
  `id` int NOT NULL AUTO_INCREMENT,
  `title` varchar(255) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` timestamp DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
  PRIMARY KEY (`id`),
  KEY `created_at` (`created_at`)
)

In this scenario, we still only have an index on created_at, but if we want to query against updated_at, we might be able to add a redundant condition based on our knowledge of the application. If, given our understanding of the application, we can be sure that created_at is always equal to or earlier than updated_at, we can use this to our advantage.

This query, which looks for records that were last modified before January 1st of 2023, will scan the entire table because there is no index on updated_at:

SELECT
  *
FROM
  todos
WHERE
  updated_at < '2023-01-01 00:00:00'

This query will return the same results but uses the created_at index to eliminate records and then filters out the false positives.

SELECT
  *
FROM
  todos
WHERE
  updated_at < '2023-01-01 00:00:00'
  AND 
  created_at < '2023-01-01 00:00:00'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant