Hi Johnny,
I just gave that a try but it didn't help as I suspected.
I still believe the problem is in mysql not being able to handle set
subtractions. Therefore, it has to perform the work harder to return the
rows that represent a "no match" with NULL values in place so they can then
be filtered by the WHERE clause.
This type of query seems to be a corner case in mysql one should be aware
about when working with large datasets.
Cheers,
Leonardo Borges
www.leonardoborges.com
On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers <johnny@stripped>wrote:
> Leonardo,
>
> I think a new compound key on email_id and activity in the activities table
> may help.
>
> I'm not sure if this will help or not, Its hard to test w/o having a large
> data set to test against.
>
>
> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
> leonardoborges.rj@stripped> wrote:
>
>> Sure can:
>>
>> show create table activities;
>>
>> CREATE TABLE `activities` (
>> `id` int(11) NOT NULL AUTO_INCREMENT,
>> `user_id` int(11) DEFAULT NULL,
>> `email` varchar(100) DEFAULT NULL,
>> `country_iso` varchar(2) DEFAULT NULL,
>> `tags` varchar(255) DEFAULT NULL,
>> `postcode` int(11) DEFAULT NULL,
>> `activity` varchar(100) DEFAULT NULL,
>> `page_id` int(11) DEFAULT NULL,
>> `donation_frequency` varchar(100) DEFAULT NULL,
>> `email_id` int(11) DEFAULT NULL,
>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>> CURRENT_TIMESTAMP,
>> PRIMARY KEY (`id`),
>> KEY `user_idx` (`user_id`),
>> KEY `email_idx` (`email_id`),
>> KEY `activity_idx` (`activity`)
>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>
>>
>> And the explain:
>>
>>
>>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>> | id | select_type | table | type | possible_keys | key
>> | key_len | ref | rows | Extra |
>>
>>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>> | 1 | SIMPLE | u | index | NULL |
>> id_idx | 5 | NULL | 972064 | Using index |
>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
>> user_idx | 5 | getup.u.id | 20 | Using where |
>>
>>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers
> <johnny@stripped>wrote:
>>
>>> Can you post show create table for activity and explain output of the
>>> problem query?
>>>
>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
> <leonardoborges.rj@stripped>
>>> wrote:
>>>
>>> Hello everyone,
>>>
>>> I have an increasingly popular web application running on top of mysql
>>> and
>>> due to its popularity, I'm running into performance issues. After
>>> carefully
>>> examining database indexes and tuning queries I was able to pin down the
>>> slowest part of the system.
>>>
>>> The app's got a user segmentation tool that allows you to filter users
>>> based
>>> on a range of criteria from which the slowest is: "Select all users that
>>> did
>>> not receive the email of id 100"
>>>
>>> To answer this question we turn to the activities table, which is
>>> basically
>>> a denormalized log of actions taken by the user in this format:
>>> user_id | activity | email_id | ...
>>> 10 | email_sent | 100 | ...
>>> 10 | subscribed | NULL | ...
>>> 10 | email_open | 100 | ...
>>>
>>>
>>> Given this table and the question above, the usual way of finding out all
>>> users who did not receive this email is through the use of a left outer
>>> join, such as:
>>>
>>> select u.id
>>> from users u
>>> left outer join activities a
>>> on u.id = a.user_id
>>> and a.activity = 'email_sent'
>>> and a.email_id = 100
>>> where a.user_id is null
>>>
>>> That's all fine for medium-ish tables. However our current activities
>>> table
>>> has over 13 million rows, slowing the hell out of this left outer join,
>>> taking about 52 seconds in my machine.
>>>
>>> What this query is trying to do is to get the relative complement of set
>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>> subtraction, thus the reason for these queries being slow.
>>>
>>> Based on that I've setup a test database on Postgresql, which supports
>>> this
>>> very set operation and rewrote the query to look like this:
>>>
>>> select u.id
>>> from users u
>>> except
>>> select a.user_id
>>> from activities a
>>> where a.activity = 'email_sent'
>>> and a.email_id = 100;
>>>
>>> The fact that postgresql knows how to subtract sets brought this query
>>> down
>>> to only 4 seconds.
>>>
>>> My question then is: since this is a somewhat common query in our system,
>>> are there any workarounds I could use in mysql to improve things?
>>>
>>> I did find one myself, but it's a bit convoluted and might not perform
>>> well
>>> under load, but the following sql script gives me similar performance in
>>> mysql:
>>>
>>> create temporary table email_sent_100
>>> select a.user_id
>>> from user_activity_events a
>>> where a.activity = 'email_sent'
>>>
>>>
>>> and a.email_id = 100;
>>>
>>> create index user_id_idx on email_sent_100(user_id); //this could
>>> potentially bring the runtime down in the case of a larg temp table.
>>>
>>> select count(u.id)
>>> from users u
>>> left outer join email_sent_100 s
>>> on u.id = s.user_id
>>> and s.user_id is null;
>>>
>>> A lot more lines and a lot more complex, but does the job in this
>>> example.
>>>
>>> I'd appreciate your thoughts.
>>>
>>> Cheers,
>>> Leonardo Borges
>>> www.leonardoborges.com
>>>
>>>
>>
>
>
> --
> -----------------------------
> Johnny Withers
> 601.209.4985
> johnny@stripped
>