List:Commits« Previous MessageNext Message »
From:Alexander Barkov Date:January 14 2011 11:47am
Subject:Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)
Bug#58165
View as plain text  
Hi Martin,

Martin Hansson wrote:
> Alexander Barkov skrev 2011-01-12 10.21:
>> Hi Martin,
>>
>> I'm sorry for being slow. We've been on holiday.
>>
>> Martin Hansson wrote:
>>> Hi again!
>>>
>>> Alexander Barkov wrote:
>>>> Hi Martin,
>>>>
>>>>
>>>> Martin Hansson wrote:
>>>>> Hi Alexander and Sergey!
>>>>>
>>>>> Alexander Barkov skrev 2010-12-29 11.14:
>>>>>> Hi Sergey, Martin,
>>>>>>
>>>>>> Sergey Glukhov wrote:
>>>>>>> Hi Martin,
>>>>>>>
>>>>>>> On 12/28/2010 11:33 AM, Martin Hansson wrote:
>>>>>>>> Hi Sergey,
>>>>>>>>
>>>>>>>> Sergey Glukhov skrev 2010-12-23 12.44:
>>>>>>>>> Hi Martin, see comments below
>>>>>>>>>
>>>>>>>>> On 12/22/2010 03:30 PM, Martin Hansson wrote:
>>>>>>>>>> #At
> file:///data0/martin/bzrroot/bug58165/5.1bt-lazy_copy/ 
>>>>>>>>>> based on 
>>>>>>>>>>
> revid:sergey.glukhov@stripped
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>      modified:
>>>>>>>>>>        mysql-test/r/func_str.result
>>>>>>>>>>        mysql-test/t/func_str.test
>>>>>>>>>>        sql/item_strfunc.cc
>>>>>>>>>>        sql/sql_string.cc
>>>>>>>>> ...
>>>>>>>>>> === modified file 'sql/item_strfunc.cc'
>>>>>>>>>> --- a/sql/item_strfunc.cc    2010-11-11 10:25:23
> +0000
>>>>>>>>>> +++ b/sql/item_strfunc.cc    2010-12-22 12:30:13
> +0000
>>>>>>>>>> @@ -1305,8 +1305,15 @@ String
> *Item_func_substr_index::val_str(
>>>>>>>>>>     null_value=0;
>>>>>>>>>>     uint delimiter_length=
> delimiter->length();
>>>>>>>>>>     if (!res->length() || !delimiter_length ||
> !count)
>>>>>>>>>> -    return&my_empty_string;        // Wrong
> parameters
>>>>>>>>>> -
>>>>>>>>>> +  {
>>>>>>>>>> +    if (str->copy("", 0,
> default_charset_info))        // 
>>>>>>>>>> Wrong parameters
>>>>>>>>>> +    {
>>>>>>>>>> +      /* Out of memory */
>>>>>>>>>> +      null_value= true;
>>>>>>>>>> +      return NULL;
>>>>>>>>>> +    }
>>>>>>>>>> +    return str;
>>>>>>>>>> +  }
>>>>>>>>>>     res->set_charset(collation.collation);
>>>>>>>>>>
>>>>>>>>>>   #ifdef USE_MB
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Using my_empty_string is not safe, the bug shows it.
>>>>>>>>> So we should remove my_empty_string from all places
>>>>>>>>> where it is used. 
>>>>>>>> Removing my_empty_string altogether is way too large to
> be done 
>>>>>>>> as an MRU bug fix (Believe me, I tried.) I suggest filing
> a 
>>>>>>>> separate report about it. Let's make this customer happy
> first 
>>>>>>>> and then we'll deal with the other problems we discovered
> while 
>>>>>>>> working on it.
>>>>>>> I was too optimistic with the request about removing
> my_empty_string
>>>>>>> from all places. As far as I can see problematic places are
>>>>>>> set_var.cc, sql_class.cc. However we could fix all places in
>>>>>>> item_strfunc.cc at least. it should be not a problem.
>>>>>>> It seems we should introduce new method, say
>>>>>>>
>>>>>>> Item::empty_string()
>>>>>>> {
>>>>>>>   str_value.set("", 0, collation.collation);
>>>>>>>   return &str_value;
>>>>>>> }
>>>>>>
>>>>>> Sergey, I like your idea about introducing this new method.
>>>>>> Please check if it can be moved to Item_str_func:
>>>>>>
>>>>>> String *Item_str_func::empty_string()
>>>>>> {
>>>>>>   str_value.set("", 0, collation.collation);
>>>>>>   return &str_value;
>>>>>> }
>>>>> Nice idea, will save us from cluttering up the code.
>>>>>>
>>>>>>>
>>>>>>> and replace &my_empty_string with empty_string().
>>>>>>>
>>>>>>
>>>>>> It should be safe to replace all &my_empty_string instances
>>>>>> to the new method calls in item_str_func.cc,
>>>>>> even under terms of a MRU bug. Looks like a fairly small change:
>>>>>> "grep my_empty_string item_strfunc.cc" reports about 22 lines.
>>>>> done. I've added a comment that it should be removed.
>>>>>>
>>>>>>
>>>>>> Note, my_empty_string is a reminder from 4.0 times,
>>>>>> when we had a single global character set for the entire server.
>>>>>> my_empty_string should be eventually removed everywhere from the
>>>>>> sources. Fixing item_strfunc.cc looks like a good start.
>>>>> Thank you for the background. It's really strange that the bug has 
>>>>> stayed under the radar for that long.
>>>>
>>>> I think this is because it shows up under very rare conditions,
>>>> e.g. wrong arguments, or end-of-memory.
>>> I can agree about out-of-memory, that doesn't happen all the time 
>>> :-), but wrong arguments should happen more often that every 10 years 
>>> ;-) I guess most people would discover the data type was wrong and 
>>> didn't file a report.
>>>>
>>>>>
>>>>> I also have a question for you: Why do we even return a *pointer* 
>>>>> to a String object from the Item_*::val_str() methods? I see no 
>>>>> good reason for this. The String class is really small, only 5 
>>>>> small members and we could almost always exploit the return value 
>>>>> optimization anyways. We lose all of the safety benefits of the 
>>>>> shared string buffers by passing pointers around. For instance this 
>>>>> bug would not exists if we did that. To a C++ programmer this is 
>>>>> completely insane.
>>>>>
>>>>> Some historical reason?
>>>>
>>>> This is done for performance reasons.
>>> But it doesn't boost performance. Quite on the contrary, see below.
>>>>
>>>> If we returned objects instead of pointers, then for every row
>>>> we'd have to do "new String(...)", then allocate Ptr,
>>> Nope. "new String()" is done only if you want to heap allocate 
>>> strings (or MEM_ROOT, to be precise). By returning String objects, 
>>> they are allocated on the stack instead, which is cheaper.
>>>
>>> This is how to receive a string on the stack:
>>>
>>> void foo() {
>>>  ...
>>>  String myString = getMeAString();
>>>
>>>  // Calls the copy constructor.
>>>  // myString.Ptr is now the same as myStringToo.Ptr.
>>>  // Nothing is allocated.
>>>  String myStringToo = myString;
>>>  ...
>>> }
>>>
>>> This is how to declare a function returning a string:
>>>
>>> String getMeAString() {
>>>
>>>  // This string will actually be allocated on the stack in foo's 
>>> stack frame
>>>  // since it is constructed as part of the return statement.
>>>  // This is called the return value optimization.
>>>  return String();
>>>
>>> }
>>>
>>>> then send result to the client side in protocol.cc, and then
>>>> free the memory allocated for the String and its Ptr.
>>> Stack memory is freed automatically. And note that I'm not talking 
>>> about the string buffer pointed to by String::Ptr. That buffer would 
>>> be allocated and deallocated exactly the same way. However, we 
>>> wouldn't have to allocate as many copies as we have to now because we 
>>> could benefit from the safety that the String class provides, see below.
>>>>
>>>> Now String is created only one time per Item, but what's more 
>>>> important,
>>>> Ptr() is allocated one time! So val_str() reuses the same buffer
>>>> for every row.
>>> The buffer has very little to do with the String object. Only when 
>>> realloc() is called is a new buffer allocated.
>>>
>>> Thus, the current implementation is actually *worse* for performance. 
>>> First and foremost: We have to care about allocating and freeing 
>>> String objects in various MEM_ROOTs. Secondly, since we reuse the 
>>> same String *object*, we cannot quite be sure whether the same String 
>>> object, and hence buffer, is used in different places.
>>>
>>> The String class is designed to be very efficient if several String 
>>> objects point to the same buffer. Every String object knows if it is 
>>> sharing the buffer (*Ptr) with another object. So unless you are 
>>> going to write to the buffer, you can just use it. And if you are 
>>> going to write to it, you only copy it if another String object owns 
>>> it. If you own it yourself, you don't have to copy it. That is the 
>>> purpose if copy_if_not_alloced().
>>>
>>> If you instead return pointers, all these nice things go out the 
>>> window. The String object keeps track of whether it owns the buffer 
>>> or if some other String object owns it. But if you use the same 
>>> String *object*, nothing can be taken for granted.
>>
>> Buffers are still shared. 
> That is the whole point. Buffers should be shared, but with the current 
> design we don't know whether they are shared or not. Hence when we're in 
> doubt we must always copy.
>> For example, if you do
>>
>>   SELECT LEFT(CONCAT(...))...;
>>
>> Item_func_left returns pointer to the same buffer with what
>> Item_func_concat has returned.
> Do you mean this?
> 
> String *Item_func_left::val_str(String *str)

yes.

> 
> It returns a pointer to a String object. Without analyzing the code, I 
> suspect that Item_func_concat::val_int() returns a pointer to a String 
> object, 

I guess you mean Item_func_concat::val_str() ?

> and then Item_func_left probably returns a pointer to the same 
> String object. Not good.

Item_func_left returns pointer to the same object *only* when
"length" argument is big enough to cover the entire argument string.
It's Ok to return a pointer to exactly the same String object when it's 
not modified.

In case when length is small enough and we need to shrink,
it uses tmp_value - different String object, different length,
but the same char* buffer.


>>
>> Also, String objects are also shared.
> This is the problem.
>> They are created only at fix_fields time and then reused for each row.
> I'm not familiar with this, but this sounds like a bad idea. The buffers 
> should be reused, not the String objects.

LEFT(CONCAT(),length) with big "length" argument is the case
when it's Ok to reuse the entire String object.

LEFT(CONCAT(),length) with small "length" argument is
an example where we reuse only the buffer, but don't reuse
the String object.

So both ways take place in the code. And I think both are correct.

The pieces of the code where we modify the String object came
from the argument are wrong. They should be fixed when found.

>>
>> If we returned objects instead of pointers, we'd have to allocate
>> objects per row, which I believe would slow down performance.
> Seriously, how much could this cost, given that they are always 
> allocated on the stack? The price would be the same as for 5 local 
> variables.

I don't fully understand your proposal yet.

Suppose we have:

   SELECT LEFT(CONCAT('a','b'),1);

Execution stack looks like this:


#0  Item_func_concat::val_str (this=0x7fffe8004d20, str=0x7ffff6d91010)
     at item_strfunc.cc:284
#1  0x00000000005366a0 in Item_func_left::val_str (this=0x7fffe8004e78,
     str=<value optimized out>) at item_strfunc.cc:1138
#2  0x00000000004e7b36 in Item::send (this=0x7fffe8004e78, 
protocol=0xc1cbc8,
     buffer=0x7ffff6d91010) at item.cc:5573
#3  0x0000000000585144 in select_send::send_data (this=0x7fffe8005028,
     items=<value optimized out>) at sql_class.cc:1691


The String object "buffer" is created on stack on level 3
and then is passed to Item_func_left::val_str and
Item_func_concat::val_str as an argument "str".

Item_func_concat::val_str populates and returns it.
Item_func_left::val_str does not use it, because it uses tmp_value,
to be able to modify "length" safely.

How would the same look using your approach?


Now suppose we have:
   SELECT LEFT(CONCAT('a','b',1000));

Again, "buffer" is created on stack level 3.
Item_func_concat::val_sr uses it.
Item_func_left::val_str reuses it, because no modifications needed.

How would the same look using the proposed approach?

Will return value optimization work in both cases?
Are you sure there will never be any extra copies comparing
to the current approach?


>>
>>
>>>
>>> For instance, using my_empty_string would be fine if we didn't return 
>>> pointers. A val_str() function would return a copy of the 
>>> my_empty_string object, all of its fields would be copies from the 
>>> original instance, hence pointing to the same buffer, except alloced 
>>> == false. Then if you needed to write to the buffer you would do 
>>> copy_if_not_alloced(). If we didn't have to write to it, no 
>>> allocation whatsoever would occur. What happens now is we _always_ 
>>> allocate a buffer, we _always_ allocate at least one String object, 
>>> then we write an empty string to the buffer.
>>>
>>> Besides, Instead of sharing a char*, there should also be a 
>>> String_buffer class that would be shared by String objects. Then the 
>>> fields Alloced_length, str_charset and alloced should be moved into 
>>> that class making the String class even smaller; the size of two 
>>> ints. This would give a substantial perfomance gain.
>>
>> This is a good idea. But I'm afraid we cannot afford this change now -
>> too much code is affected.
> Of course. The Eternal Argument. :-) But it is good to identify what the 
> real problem is. Then you can make better choices in the future. For 
> instance we could make sure that any new functions returning String's 
> will return objects rather than pointers.

We need to do some experimenting on performance gain we'll get
with this change, and further analysing if this performance gain
really worthy the risk, as this is a really big code change and
it will take some time to stabilize the new code.

If you volunteer to do such experimenting, it would be really great!


> 
> Best Regards
> 
> Martin

Thread
bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534) Bug#58165Martin Hansson28 Dec
Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Martin Hansson28 Dec
Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Alexander Barkov29 Dec
  • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Martin Hansson29 Dec
    • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Alexander Barkov29 Dec
      • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Martin Hansson30 Dec
        • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Alexander Barkov12 Jan
          • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Martin Hansson12 Jan
            • Re: bzr commit into mysql-5.1-bugteam branch (martin.hansson:3534)Bug#58165Alexander Barkov14 Jan