Hi again!
Alexander Barkov wrote:
> Hi Martin,
>
>
> Martin Hansson wrote:
>> Hi Alexander and Sergey!
>>
>> Alexander Barkov skrev 2010-12-29 11.14:
>>> Hi Sergey, Martin,
>>>
>>> Sergey Glukhov wrote:
>>>> Hi Martin,
>>>>
>>>> On 12/28/2010 11:33 AM, Martin Hansson wrote:
>>>>> Hi Sergey,
>>>>>
>>>>> Sergey Glukhov skrev 2010-12-23 12.44:
>>>>>> Hi Martin, see comments below
>>>>>>
>>>>>> On 12/22/2010 03:30 PM, Martin Hansson wrote:
>>>>>>> #At file:///data0/martin/bzrroot/bug58165/5.1bt-lazy_copy/
> based
>>>>>>> on
> revid:sergey.glukhov@stripped
>>>>>>>
>>>>>>>
>>>>>>> modified:
>>>>>>> mysql-test/r/func_str.result
>>>>>>> mysql-test/t/func_str.test
>>>>>>> sql/item_strfunc.cc
>>>>>>> sql/sql_string.cc
>>>>>> ...
>>>>>>> === modified file 'sql/item_strfunc.cc'
>>>>>>> --- a/sql/item_strfunc.cc 2010-11-11 10:25:23 +0000
>>>>>>> +++ b/sql/item_strfunc.cc 2010-12-22 12:30:13 +0000
>>>>>>> @@ -1305,8 +1305,15 @@ String
> *Item_func_substr_index::val_str(
>>>>>>> null_value=0;
>>>>>>> uint delimiter_length= delimiter->length();
>>>>>>> if (!res->length() || !delimiter_length || !count)
>>>>>>> - return&my_empty_string; // Wrong parameters
>>>>>>> -
>>>>>>> + {
>>>>>>> + if (str->copy("", 0, default_charset_info)) //
> Wrong
>>>>>>> parameters
>>>>>>> + {
>>>>>>> + /* Out of memory */
>>>>>>> + null_value= true;
>>>>>>> + return NULL;
>>>>>>> + }
>>>>>>> + return str;
>>>>>>> + }
>>>>>>> res->set_charset(collation.collation);
>>>>>>>
>>>>>>> #ifdef USE_MB
>>>>>>>
>>>>>>
>>>>>> Using my_empty_string is not safe, the bug shows it.
>>>>>> So we should remove my_empty_string from all places
>>>>>> where it is used.
>>>>> Removing my_empty_string altogether is way too large to be done as
>>>>> an MRU bug fix (Believe me, I tried.) I suggest filing a separate
>>>>> report about it. Let's make this customer happy first and then
>>>>> we'll deal with the other problems we discovered while working on
> it.
>>>> I was too optimistic with the request about removing my_empty_string
>>>> from all places. As far as I can see problematic places are
>>>> set_var.cc, sql_class.cc. However we could fix all places in
>>>> item_strfunc.cc at least. it should be not a problem.
>>>> It seems we should introduce new method, say
>>>>
>>>> Item::empty_string()
>>>> {
>>>> str_value.set("", 0, collation.collation);
>>>> return &str_value;
>>>> }
>>>
>>> Sergey, I like your idea about introducing this new method.
>>> Please check if it can be moved to Item_str_func:
>>>
>>> String *Item_str_func::empty_string()
>>> {
>>> str_value.set("", 0, collation.collation);
>>> return &str_value;
>>> }
>> Nice idea, will save us from cluttering up the code.
>>>
>>>>
>>>> and replace &my_empty_string with empty_string().
>>>>
>>>
>>> It should be safe to replace all &my_empty_string instances
>>> to the new method calls in item_str_func.cc,
>>> even under terms of a MRU bug. Looks like a fairly small change:
>>> "grep my_empty_string item_strfunc.cc" reports about 22 lines.
>> done. I've added a comment that it should be removed.
>>>
>>>
>>> Note, my_empty_string is a reminder from 4.0 times,
>>> when we had a single global character set for the entire server.
>>> my_empty_string should be eventually removed everywhere from the
>>> sources. Fixing item_strfunc.cc looks like a good start.
>> Thank you for the background. It's really strange that the bug has
>> stayed under the radar for that long.
>
> I think this is because it shows up under very rare conditions,
> e.g. wrong arguments, or end-of-memory.
I can agree about out-of-memory, that doesn't happen all the time :-),
but wrong arguments should happen more often that every 10 years ;-) I
guess most people would discover the data type was wrong and didn't file
a report.
>
>>
>> I also have a question for you: Why do we even return a *pointer* to
>> a String object from the Item_*::val_str() methods? I see no good
>> reason for this. The String class is really small, only 5 small
>> members and we could almost always exploit the return value
>> optimization anyways. We lose all of the safety benefits of the
>> shared string buffers by passing pointers around. For instance this
>> bug would not exists if we did that. To a C++ programmer this is
>> completely insane.
>>
>> Some historical reason?
>
> This is done for performance reasons.
But it doesn't boost performance. Quite on the contrary, see below.
>
> If we returned objects instead of pointers, then for every row
> we'd have to do "new String(...)", then allocate Ptr,
Nope. "new String()" is done only if you want to heap allocate strings
(or MEM_ROOT, to be precise). By returning String objects, they are
allocated on the stack instead, which is cheaper.
This is how to receive a string on the stack:
void foo() {
...
String myString = getMeAString();
// Calls the copy constructor.
// myString.Ptr is now the same as myStringToo.Ptr.
// Nothing is allocated.
String myStringToo = myString;
...
}
This is how to declare a function returning a string:
String getMeAString() {
// This string will actually be allocated on the stack in foo's stack
frame
// since it is constructed as part of the return statement.
// This is called the return value optimization.
return String();
}
> then send result to the client side in protocol.cc, and then
> free the memory allocated for the String and its Ptr.
Stack memory is freed automatically. And note that I'm not talking about
the string buffer pointed to by String::Ptr. That buffer would be
allocated and deallocated exactly the same way. However, we wouldn't
have to allocate as many copies as we have to now because we could
benefit from the safety that the String class provides, see below.
>
> Now String is created only one time per Item, but what's more important,
> Ptr() is allocated one time! So val_str() reuses the same buffer
> for every row.
The buffer has very little to do with the String object. Only when
realloc() is called is a new buffer allocated.
Thus, the current implementation is actually *worse* for performance.
First and foremost: We have to care about allocating and freeing String
objects in various MEM_ROOTs. Secondly, since we reuse the same String
*object*, we cannot quite be sure whether the same String object, and
hence buffer, is used in different places.
The String class is designed to be very efficient if several String
objects point to the same buffer. Every String object knows if it is
sharing the buffer (*Ptr) with another object. So unless you are going
to write to the buffer, you can just use it. And if you are going to
write to it, you only copy it if another String object owns it. If you
own it yourself, you don't have to copy it. That is the purpose if
copy_if_not_alloced().
If you instead return pointers, all these nice things go out the window.
The String object keeps track of whether it owns the buffer or if some
other String object owns it. But if you use the same String *object*,
nothing can be taken for granted.
For instance, using my_empty_string would be fine if we didn't return
pointers. A val_str() function would return a copy of the
my_empty_string object, all of its fields would be copies from the
original instance, hence pointing to the same buffer, except alloced ==
false. Then if you needed to write to the buffer you would do
copy_if_not_alloced(). If we didn't have to write to it, no allocation
whatsoever would occur. What happens now is we _always_ allocate a
buffer, we _always_ allocate at least one String object, then we write
an empty string to the buffer.
Besides, Instead of sharing a char*, there should also be a
String_buffer class that would be shared by String objects. Then the
fields Alloced_length, str_charset and alloced should be moved into that
class making the String class even smaller; the size of two ints. This
would give a substantial perfomance gain.
>
>>
>> Thank you for the review!
>
> Will you commit the new version soon?
Done.
Best Regards
Martin