List:General Discussion« Previous MessageNext Message »
From:Gregory Magarshak Date:March 31 2011 12:32pm
Subject:Re: A common request
View as plain text  
By the way, sorry ... I wanted to clarify one thing:

I am trying to FILTER by the unique index (fb_uid) in this case, but 
JOIN on the primary key (id) and I don't need to use any other fields. I 
am guessing that the MySQL indexes map indexed fields (fb_uid) to the 
primary key (id) so I wouldn't have to touch the disk. Am I right about 
that?



On 3/31/11 8:29 AM, Gregory Magarshak wrote:
> Thanks for your insight! But I'm still worried about the performance 
> of IN ( big list of values ). Can you tell me how it is implemented?
>
> Suppose I have SELECT a FROM b WHERE c IN (1, 4, 5, 117, 118, 119, 
> ..., 387945)
>
> 1) If I put 200 values there, does it do 200 individual SELECTs 
> internally, and union them? Or does it notice that c has a "UNIQUE" 
> index and thus at most one row can be returned per SELECT, and does 
> them all at once?
>
> 2) If I want to get just the primary key, or join with another table 
> based on just the primary key, does this query ever touch the disk 
> (assuming the index is in memory, which I think it always is -- 
> correct me if I'm wrong about that).
>
> The way I would recommend doing it (for BTREE indexes, anyway) is to 
> sort the values in ascending order, and do the search in one pass 
> through the index. The index is already in memory, and it would be 
> straightforward to modify a binary search algorithm to find the rows 
> corresponding to monotonically ascending values of the primary key, 
> all in one pass.
>
> Even if the binary search algorithm is run 200 or 2000 times for a 
> list, it would still be faster than hitting the disk. (Even though the 
> CPU cache performance would be worse.)
>
> Can you let me know the specifics of it, and especially how I can 
> avoid hitting the I/O bottlenecks?
>
> Thank you,
> Greg
>
> On 3/29/11 4:17 PM, Peter Brawley wrote:
>> > Why not optimize the IN ( ... ) to do the same type of thing?
>>
>> If the argument to IN() is a list of values, it'll be OK. If it's a 
>> SELECT, in 5.0 it will be slower than molasses (see "The unbearable 
>> slowness of IN()" at http://www.artfulsoftware.com/queries.php.
>>
>> > I always tried to avoid joins because I am planning to horizontally 
>> partition my data.
>>
>> A severe & unfortunate constraint. Can't help you there.
>>
>> PB
>>
>> -----
>>
>> On 3/29/2011 1:27 PM, Gregory Magarshak wrote:
>>> Yes, this would be fine. But often, the list of friends is obtained 
>>> from a social network like facebook, and is not stored internally. 
>>> Basically, I obtain the friend list in a request to facebook, and 
>>> then see which of those users have created things. So would I have 
>>> to create a temporary table and insert all those uids just to make a 
>>> join? Why not optimize the IN ( ... ) to do the same type of thing?
>>>
>>> There is also a second problem: I want to use MySQL Cluster, because 
>>> I expect to have many users. Would it be efficient to use JOIN 
>>> between the friends table and the articles table? Both tables are 
>>> partitioned by user_id as the primary key, so the join would have to 
>>> hit many different nodes. I always tried to avoid joins because I am 
>>> planning to horizontally partition my data. But if MySQL cluster can 
>>> handle this join transparently and split it up based on the 
>>> partition, then that's fine. Do you have any info on this?
>>>
>>> Greg
>>>
>>> On 3/29/11 2:10 PM, Peter Brawley wrote:
>>>> > How can I quickly find all the articles written by this user's 
>>>> friends, and not just random articles?
>>>>
>>>> Taking the simplest possible case, with table 
>>>> friends(userID,friendID) where each friendID refers to a userID in 
>>>> another row, the friends of userID u are ...
>>>>
>>>> select friendID from user where userID=u;
>>>>
>>>> so articles by those friends of u are ...
>>>>
>>>> select a.* from article a join (  select friendID from user where 
>>>> userID=u ) f on a.userID=f.friendID;
>>>>
>>>> PB
>>>>
>>>> -----
>>>>
>>>> On 3/29/2011 12:50 PM, Gregory Magarshak wrote:
>>>>> Hey there. My company writes a lot of social applications, and 
>>>>> there is one operation that is very common, but I don't know if 
>>>>> MySQL supports it in a good way. I thought I'd write to this list 
>>>>> for two reasons:
>>>>>
>>>>>     1) Maybe MySQL has a good way to do this, and I just don't 
>>>>> know about it
>>>>>
>>>>>     2) Propose to MySQL developers a simple algorithm which would 
>>>>> greatly improve MySQL support for social networking apps.
>>>>>
>>>>>     Here is the situation. Let's say I have built a social 
>>>>> networking application where people create and edit some item 
>>>>> (article, photo, music mix, whatever). Now, a typical user logs 
>>>>> in, and this user has 3000 friends. How can I quickly find all the 
>>>>> articles written by this user's friends, and not just random 
>>>>> articles?
>>>>>
>>>>>     Ideally, I would want to write something like this:
>>>>>
>>>>>     SELECT * FROM article WHERE user_id IN (345789, 324875, 398, 
>>>>> ..., 349580)
>>>>>
>>>>>     basically, execute a query with a huge IN ( ... ). Maybe if 
>>>>> this would exceed the buffer size for the MySQL wire protocol, I 
>>>>> would break up the list into several lists, and execute several 
>>>>> queries, and union the results together myself.
>>>>>
>>>>>     But my point is, this is very common for social networking 
>>>>> apps. Every app wants to show "the X created by your friends", or 
>>>>> "friends of yours (given some list from a social network) who have 
>>>>> taken action X".
>>>>>
>>>>>     Here is how I would do it if I had raw access to the MySQL 
>>>>> index in memory:
>>>>>
>>>>>     a) Sort the list of entries in the IN, in ascending order.
>>>>>
>>>>>     b) Do *ONE* binary search through the index (assuming it's a 
>>>>> BTREE index) and get them all in one pass. If it's a HASH index or 
>>>>> something, I would have to look up each one individually.
>>>>>
>>>>>     The benefits of this approach would be that this common 
>>>>> operation would be done extremely quickly. If the index fits 
>>>>> entirely in memory, and I just want to get the primary keys (i.e. 
>>>>> get the list of friends who did X), the disk isn't even touched. 
>>>>> In addition, for BTREE indexes, I would just need ONE binary 
>>>>> search, because the entries have been sorted in ascending order.
>>>>>
>>>>>     Does MySQL have something like this? And if not, perhaps you 
>>>>> can add it in the next version? It would really boost MySQL's 
>>>>> support for social networking apps tremendously. Alternative, how 
>>>>> can I add this to my MySQL? Any advice would be appreciated.
>>>>>
>>>>> Sincerely,
>>>>> Gregory Magarshak
>>>>> Qbix
>>>>>
>>>
>>>
>

Thread
A common requestGregory Magarshak29 Mar
  • Re: A common requestPeter Brawley29 Mar
    • Re: A common requestGregory Magarshak29 Mar
      • Re: A common requestPeter Brawley29 Mar
        • Re: A common requestGregory Magarshak31 Mar
          • Re: A common requestGregory Magarshak31 Mar
            • Re: A common requestJohan De Meersman31 Mar
          • Re: A common requestmos31 Mar
            • Re: A common requestJohan De Meersman31 Mar
            • Re: A common requestmos31 Mar
              • Re: A common requestWm Mussatto31 Mar
  • Re: A common requestSander de Bruijne29 Mar