By the way, sorry ... I wanted to clarify one thing:
I am trying to FILTER by the unique index (fb_uid) in this case, but
JOIN on the primary key (id) and I don't need to use any other fields. I
am guessing that the MySQL indexes map indexed fields (fb_uid) to the
primary key (id) so I wouldn't have to touch the disk. Am I right about
that?
On 3/31/11 8:29 AM, Gregory Magarshak wrote:
> Thanks for your insight! But I'm still worried about the performance
> of IN ( big list of values ). Can you tell me how it is implemented?
>
> Suppose I have SELECT a FROM b WHERE c IN (1, 4, 5, 117, 118, 119,
> ..., 387945)
>
> 1) If I put 200 values there, does it do 200 individual SELECTs
> internally, and union them? Or does it notice that c has a "UNIQUE"
> index and thus at most one row can be returned per SELECT, and does
> them all at once?
>
> 2) If I want to get just the primary key, or join with another table
> based on just the primary key, does this query ever touch the disk
> (assuming the index is in memory, which I think it always is --
> correct me if I'm wrong about that).
>
> The way I would recommend doing it (for BTREE indexes, anyway) is to
> sort the values in ascending order, and do the search in one pass
> through the index. The index is already in memory, and it would be
> straightforward to modify a binary search algorithm to find the rows
> corresponding to monotonically ascending values of the primary key,
> all in one pass.
>
> Even if the binary search algorithm is run 200 or 2000 times for a
> list, it would still be faster than hitting the disk. (Even though the
> CPU cache performance would be worse.)
>
> Can you let me know the specifics of it, and especially how I can
> avoid hitting the I/O bottlenecks?
>
> Thank you,
> Greg
>
> On 3/29/11 4:17 PM, Peter Brawley wrote:
>> > Why not optimize the IN ( ... ) to do the same type of thing?
>>
>> If the argument to IN() is a list of values, it'll be OK. If it's a
>> SELECT, in 5.0 it will be slower than molasses (see "The unbearable
>> slowness of IN()" at http://www.artfulsoftware.com/queries.php.
>>
>> > I always tried to avoid joins because I am planning to horizontally
>> partition my data.
>>
>> A severe & unfortunate constraint. Can't help you there.
>>
>> PB
>>
>> -----
>>
>> On 3/29/2011 1:27 PM, Gregory Magarshak wrote:
>>> Yes, this would be fine. But often, the list of friends is obtained
>>> from a social network like facebook, and is not stored internally.
>>> Basically, I obtain the friend list in a request to facebook, and
>>> then see which of those users have created things. So would I have
>>> to create a temporary table and insert all those uids just to make a
>>> join? Why not optimize the IN ( ... ) to do the same type of thing?
>>>
>>> There is also a second problem: I want to use MySQL Cluster, because
>>> I expect to have many users. Would it be efficient to use JOIN
>>> between the friends table and the articles table? Both tables are
>>> partitioned by user_id as the primary key, so the join would have to
>>> hit many different nodes. I always tried to avoid joins because I am
>>> planning to horizontally partition my data. But if MySQL cluster can
>>> handle this join transparently and split it up based on the
>>> partition, then that's fine. Do you have any info on this?
>>>
>>> Greg
>>>
>>> On 3/29/11 2:10 PM, Peter Brawley wrote:
>>>> > How can I quickly find all the articles written by this user's
>>>> friends, and not just random articles?
>>>>
>>>> Taking the simplest possible case, with table
>>>> friends(userID,friendID) where each friendID refers to a userID in
>>>> another row, the friends of userID u are ...
>>>>
>>>> select friendID from user where userID=u;
>>>>
>>>> so articles by those friends of u are ...
>>>>
>>>> select a.* from article a join ( select friendID from user where
>>>> userID=u ) f on a.userID=f.friendID;
>>>>
>>>> PB
>>>>
>>>> -----
>>>>
>>>> On 3/29/2011 12:50 PM, Gregory Magarshak wrote:
>>>>> Hey there. My company writes a lot of social applications, and
>>>>> there is one operation that is very common, but I don't know if
>>>>> MySQL supports it in a good way. I thought I'd write to this list
>>>>> for two reasons:
>>>>>
>>>>> 1) Maybe MySQL has a good way to do this, and I just don't
>>>>> know about it
>>>>>
>>>>> 2) Propose to MySQL developers a simple algorithm which would
>>>>> greatly improve MySQL support for social networking apps.
>>>>>
>>>>> Here is the situation. Let's say I have built a social
>>>>> networking application where people create and edit some item
>>>>> (article, photo, music mix, whatever). Now, a typical user logs
>>>>> in, and this user has 3000 friends. How can I quickly find all the
>>>>> articles written by this user's friends, and not just random
>>>>> articles?
>>>>>
>>>>> Ideally, I would want to write something like this:
>>>>>
>>>>> SELECT * FROM article WHERE user_id IN (345789, 324875, 398,
>>>>> ..., 349580)
>>>>>
>>>>> basically, execute a query with a huge IN ( ... ). Maybe if
>>>>> this would exceed the buffer size for the MySQL wire protocol, I
>>>>> would break up the list into several lists, and execute several
>>>>> queries, and union the results together myself.
>>>>>
>>>>> But my point is, this is very common for social networking
>>>>> apps. Every app wants to show "the X created by your friends", or
>>>>> "friends of yours (given some list from a social network) who have
>>>>> taken action X".
>>>>>
>>>>> Here is how I would do it if I had raw access to the MySQL
>>>>> index in memory:
>>>>>
>>>>> a) Sort the list of entries in the IN, in ascending order.
>>>>>
>>>>> b) Do *ONE* binary search through the index (assuming it's a
>>>>> BTREE index) and get them all in one pass. If it's a HASH index or
>>>>> something, I would have to look up each one individually.
>>>>>
>>>>> The benefits of this approach would be that this common
>>>>> operation would be done extremely quickly. If the index fits
>>>>> entirely in memory, and I just want to get the primary keys (i.e.
>>>>> get the list of friends who did X), the disk isn't even touched.
>>>>> In addition, for BTREE indexes, I would just need ONE binary
>>>>> search, because the entries have been sorted in ascending order.
>>>>>
>>>>> Does MySQL have something like this? And if not, perhaps you
>>>>> can add it in the next version? It would really boost MySQL's
>>>>> support for social networking apps tremendously. Alternative, how
>>>>> can I add this to my MySQL? Any advice would be appreciated.
>>>>>
>>>>> Sincerely,
>>>>> Gregory Magarshak
>>>>> Qbix
>>>>>
>>>
>>>
>