MySQL, Sub-selects and IN()

I ran across a bug in MySQL that was causing a simple query to take thirty minutes to an hour (yes, an hour) to return results on three million rows in an InnoDB table. The end of about an hour’s worth the digging yielded a known issue in the MySQL 5.x series which is a most unusual bug. The heart of the issue is in the implementation of IN using sub-selects inside of the IN clause. If you use a select statement inside the IN statement, then the query planner refuses to use an index for the outer select statement (even if you FORCE INDEX). This is apparently a known bug in MySQL 5.x, and is supposed to be fixed in MySQL 6.0 (presently in Alpha).

One of the very strange observations about this bug is that the issue does not appear to occur with “NOT IN”; It would seem fairly safe to assume that they both share similar logic because of their similarity, but that appears to not be the case. In the query that was coming across this issue we have a mix of IN and NOT IN statements, both of which were using sub-selects, and yet once I removed the IN sub-select, everything worked fine.

In fact, removing the sub-select from the IN clause reduced query time from over an hour to a few seconds at the most; I’d call that a pretty big improvement.

The query itself seemed pretty innocuous – it looked something like this:

SELECT *
  FROM animals
 WHERE (species_id IN (SELECT species_id
                         FROM enemy_species
                        WHERE enemy_id = 123));

Pretty simple – the outer query uses the contents of the inner sub-select. You would expect the outer query, which resolves to use an index from the animals table, as the intermediate result looks something like this:

SELECT *
  FROM animals
 WHERE (species_id IN (1, 3, 5));

Imagine my surprise when this query, which accesses a table of 3M records with proper indices, was taking well over thirty minutes to run in some cases.

Solution: Pre-load the contents of the sub-select

The quickest solution is likely to re-write the query with a pre-loaded set of data (i.e pre-fetch the results to be used for the condition of the inner IN), and then just produce a query like what the query planner should make. In this case, something like this:

species_ids = Species.find(:conditions => {:enemy_id => 123}).collect({|s| s.id})
animals = Animal.find(:conditions => ["species_id IN ?", species_ids])

Solution: Re-write the query as a JOIN or use a temporary table

Some queries you could re-write as a JOIN statement, or use a temporary table with an index to optimize your sub-query results if need be.