Matthias Pigulla wrote:
> Hi all,
>
> can someone explain me why this query gets executed the way it is? I
> simplified it as much as possible and think it's a conceptual/logical
> thing, so I'll omit - at least for now - the table definitions and
> sample data for brevity.
>
> -- "Superquery"
> EXPLAIN SELECT t1.id, t1.name
> FROM document AS t1
> WHERE t1.id
> IN (
> -- "Subquery"
> SELECT DISTINCT data_id
> FROM wfd_reference
> WHERE wfd_field_id =699
> AND ref_data_id
> IN ( 171 )
> )
>
> This gives:
>
> *************************** 1. row ***************************
> id: 1
> select_type: PRIMARY
> table: t1
> type: ALL
> possible_keys: NULL
> key: NULL
> key_len: NULL
> ref: NULL
> rows: 277
> Extra: Using where
> *************************** 2. row ***************************
> id: 2
> select_type: DEPENDENT SUBQUERY
> table: wfd_reference
> type: ref
> possible_keys: field_data,test
> key: field_data
> key_len: 4
> ref: const,func
> rows: 4
> Extra: Using where; Using temporary
>
> However - why do we need the *dependent* subquery at all? The subquery
> can be executed on its own, as it does not depend on any information of
> the "superquery".
>
> EXPLAIN SELECT DISTINCT data_id
> FROM wfd_reference
> WHERE wfd_field_id =699
> AND ref_data_id
> IN ( 171 )
>
> *************************** 1. row ***************************
> id: 1
> select_type: SIMPLE
> table: wfd_reference
> type: ref
> possible_keys: field_data,test
> key: test
> key_len: 4
> ref: const,const
> rows: 9
> Extra: Using where; Using temporary
>
> Now if I just take the result of this query, concat the data_ids on the
> application level and build the superquery as follows:
>
> -- "two-staged superquery variant"
> EXPLAIN SELECT t1.id, t1.name
> FROM document AS t1
> WHERE t1.id
> IN (
> 32, 31, 30, 53, 56, 57, 58, 59, 60, 111
> )
>
> *************************** 1. row ***************************
> id: 1
> select_type: SIMPLE
> table: t1
> type: range
> possible_keys: PRIMARY
> key: PRIMARY
> key_len: 2
> ref: NULL
> rows: 10
> Extra: Using where
>
> the results seem to be much better.
>
> I was afraid of the dependent subquery for the "ALL" scan of t1, as t1
> will become huge. OTOH, the subquery will be very restrictive: The
> number of data_ids will always be very small, at least compared to the
> number of rows in t1.
>
> Are there any non-obvious reasons for the behaviour described above? Is
> that something that cannot be optimized right now? Am I too
> short-sighted with my "optimization" approach?
>
> Thanks a lot,
> Matthias
As others have already pointed out, mysql often doesn't optimize subqueries
properly, and fixing that doesn't seem to be a high priority right now.
Subqueries are relatively new in mysql, so it probably shouldn't be
surprising that they aren't optimized as well as other things. Fortunately,
most subqueries can be rewritten as joins
<http://dev.mysql.com/doc/mysql/en/rewriting-subqueries.html>, which mysql
does a good job of optimizing.
I believe your query is equivalent to
SELECT DISTINCT t1.id, t1.name
FROM document AS t1
JOIN wfd_reference AS wfd ON t1.id = wfd.data_id
WHERE wfd.wfd_field_id = 699
AND wfd.ref_data_id IN (171);
which should perform better than the (incorrectly) dependent subquery,
because it will be properly optimized. Mysql will use the 'test' index in
wfd_reference to pick the rows which match the conditions, then use t1's
primary key to get the corresponding rows from t1.
I notice the DISTINCT in your subquery. I take it that there may be
multiple rows in wfd_reference with the same data_id, wfd_field_id = 699,
and ref_data_id IN (171)? (If not, just leave out the DISTINCT in the
above.) In that case, it may be possible that your 2-step approach will be
even faster than the JOIN with DISTINCT. You can do this directly in mysql,
instead of in your app, by using a temporary table to store the inner query
result. Something like
CREATE TEMPORARY TABLE matches
SELECT DISTINCT data_id
FROM wfd_reference
WHERE wfd_field_id = 699
AND ref_data_id IN (171);
SELECT t1.id, t1.name
FROM document AS t1
JOIN matches ON t1.id = matches.data_id;
DROP TABLE matches;
Michael