Hi Martin,
I'm sorry for being slow. We've been on holiday.
Martin Hansson wrote:
> Hi again!
>
> Alexander Barkov wrote:
>> Hi Martin,
>>
>>
>> Martin Hansson wrote:
>>> Hi Alexander and Sergey!
>>>
>>> Alexander Barkov skrev 2010-12-29 11.14:
>>>> Hi Sergey, Martin,
>>>>
>>>> Sergey Glukhov wrote:
>>>>> Hi Martin,
>>>>>
>>>>> On 12/28/2010 11:33 AM, Martin Hansson wrote:
>>>>>> Hi Sergey,
>>>>>>
>>>>>> Sergey Glukhov skrev 2010-12-23 12.44:
>>>>>>> Hi Martin, see comments below
>>>>>>>
>>>>>>> On 12/22/2010 03:30 PM, Martin Hansson wrote:
>>>>>>>> #At
> file:///data0/martin/bzrroot/bug58165/5.1bt-lazy_copy/ based
>>>>>>>> on
> revid:sergey.glukhov@stripped
>>>>>>>>
>>>>>>>>
>>>>>>>> modified:
>>>>>>>> mysql-test/r/func_str.result
>>>>>>>> mysql-test/t/func_str.test
>>>>>>>> sql/item_strfunc.cc
>>>>>>>> sql/sql_string.cc
>>>>>>> ...
>>>>>>>> === modified file 'sql/item_strfunc.cc'
>>>>>>>> --- a/sql/item_strfunc.cc 2010-11-11 10:25:23 +0000
>>>>>>>> +++ b/sql/item_strfunc.cc 2010-12-22 12:30:13 +0000
>>>>>>>> @@ -1305,8 +1305,15 @@ String
> *Item_func_substr_index::val_str(
>>>>>>>> null_value=0;
>>>>>>>> uint delimiter_length= delimiter->length();
>>>>>>>> if (!res->length() || !delimiter_length ||
> !count)
>>>>>>>> - return&my_empty_string; // Wrong
> parameters
>>>>>>>> -
>>>>>>>> + {
>>>>>>>> + if (str->copy("", 0, default_charset_info))
> // Wrong
>>>>>>>> parameters
>>>>>>>> + {
>>>>>>>> + /* Out of memory */
>>>>>>>> + null_value= true;
>>>>>>>> + return NULL;
>>>>>>>> + }
>>>>>>>> + return str;
>>>>>>>> + }
>>>>>>>> res->set_charset(collation.collation);
>>>>>>>>
>>>>>>>> #ifdef USE_MB
>>>>>>>>
>>>>>>>
>>>>>>> Using my_empty_string is not safe, the bug shows it.
>>>>>>> So we should remove my_empty_string from all places
>>>>>>> where it is used.
>>>>>> Removing my_empty_string altogether is way too large to be done
> as
>>>>>> an MRU bug fix (Believe me, I tried.) I suggest filing a separate
>
>>>>>> report about it. Let's make this customer happy first and then
>>>>>> we'll deal with the other problems we discovered while working on
> it.
>>>>> I was too optimistic with the request about removing my_empty_string
>>>>> from all places. As far as I can see problematic places are
>>>>> set_var.cc, sql_class.cc. However we could fix all places in
>>>>> item_strfunc.cc at least. it should be not a problem.
>>>>> It seems we should introduce new method, say
>>>>>
>>>>> Item::empty_string()
>>>>> {
>>>>> str_value.set("", 0, collation.collation);
>>>>> return &str_value;
>>>>> }
>>>>
>>>> Sergey, I like your idea about introducing this new method.
>>>> Please check if it can be moved to Item_str_func:
>>>>
>>>> String *Item_str_func::empty_string()
>>>> {
>>>> str_value.set("", 0, collation.collation);
>>>> return &str_value;
>>>> }
>>> Nice idea, will save us from cluttering up the code.
>>>>
>>>>>
>>>>> and replace &my_empty_string with empty_string().
>>>>>
>>>>
>>>> It should be safe to replace all &my_empty_string instances
>>>> to the new method calls in item_str_func.cc,
>>>> even under terms of a MRU bug. Looks like a fairly small change:
>>>> "grep my_empty_string item_strfunc.cc" reports about 22 lines.
>>> done. I've added a comment that it should be removed.
>>>>
>>>>
>>>> Note, my_empty_string is a reminder from 4.0 times,
>>>> when we had a single global character set for the entire server.
>>>> my_empty_string should be eventually removed everywhere from the
>>>> sources. Fixing item_strfunc.cc looks like a good start.
>>> Thank you for the background. It's really strange that the bug has
>>> stayed under the radar for that long.
>>
>> I think this is because it shows up under very rare conditions,
>> e.g. wrong arguments, or end-of-memory.
> I can agree about out-of-memory, that doesn't happen all the time :-),
> but wrong arguments should happen more often that every 10 years ;-) I
> guess most people would discover the data type was wrong and didn't file
> a report.
>>
>>>
>>> I also have a question for you: Why do we even return a *pointer* to
>>> a String object from the Item_*::val_str() methods? I see no good
>>> reason for this. The String class is really small, only 5 small
>>> members and we could almost always exploit the return value
>>> optimization anyways. We lose all of the safety benefits of the
>>> shared string buffers by passing pointers around. For instance this
>>> bug would not exists if we did that. To a C++ programmer this is
>>> completely insane.
>>>
>>> Some historical reason?
>>
>> This is done for performance reasons.
> But it doesn't boost performance. Quite on the contrary, see below.
>>
>> If we returned objects instead of pointers, then for every row
>> we'd have to do "new String(...)", then allocate Ptr,
> Nope. "new String()" is done only if you want to heap allocate strings
> (or MEM_ROOT, to be precise). By returning String objects, they are
> allocated on the stack instead, which is cheaper.
>
> This is how to receive a string on the stack:
>
> void foo() {
> ...
> String myString = getMeAString();
>
> // Calls the copy constructor.
> // myString.Ptr is now the same as myStringToo.Ptr.
> // Nothing is allocated.
> String myStringToo = myString;
> ...
> }
>
> This is how to declare a function returning a string:
>
> String getMeAString() {
>
> // This string will actually be allocated on the stack in foo's stack
> frame
> // since it is constructed as part of the return statement.
> // This is called the return value optimization.
> return String();
>
> }
>
>> then send result to the client side in protocol.cc, and then
>> free the memory allocated for the String and its Ptr.
> Stack memory is freed automatically. And note that I'm not talking about
> the string buffer pointed to by String::Ptr. That buffer would be
> allocated and deallocated exactly the same way. However, we wouldn't
> have to allocate as many copies as we have to now because we could
> benefit from the safety that the String class provides, see below.
>>
>> Now String is created only one time per Item, but what's more important,
>> Ptr() is allocated one time! So val_str() reuses the same buffer
>> for every row.
> The buffer has very little to do with the String object. Only when
> realloc() is called is a new buffer allocated.
>
> Thus, the current implementation is actually *worse* for performance.
> First and foremost: We have to care about allocating and freeing String
> objects in various MEM_ROOTs. Secondly, since we reuse the same String
> *object*, we cannot quite be sure whether the same String object, and
> hence buffer, is used in different places.
>
> The String class is designed to be very efficient if several String
> objects point to the same buffer. Every String object knows if it is
> sharing the buffer (*Ptr) with another object. So unless you are going
> to write to the buffer, you can just use it. And if you are going to
> write to it, you only copy it if another String object owns it. If you
> own it yourself, you don't have to copy it. That is the purpose if
> copy_if_not_alloced().
>
> If you instead return pointers, all these nice things go out the window.
> The String object keeps track of whether it owns the buffer or if some
> other String object owns it. But if you use the same String *object*,
> nothing can be taken for granted.
Buffers are still shared. For example, if you do
SELECT LEFT(CONCAT(...))...;
Item_func_left returns pointer to the same buffer with what
Item_func_concat has returned.
Also, String objects are also shared.
They are created only at fix_fields time and then reused for each row.
If we returned objects instead of pointers, we'd have to allocate
objects per row, which I believe would slow down performance.
>
> For instance, using my_empty_string would be fine if we didn't return
> pointers. A val_str() function would return a copy of the
> my_empty_string object, all of its fields would be copies from the
> original instance, hence pointing to the same buffer, except alloced ==
> false. Then if you needed to write to the buffer you would do
> copy_if_not_alloced(). If we didn't have to write to it, no allocation
> whatsoever would occur. What happens now is we _always_ allocate a
> buffer, we _always_ allocate at least one String object, then we write
> an empty string to the buffer.
>
> Besides, Instead of sharing a char*, there should also be a
> String_buffer class that would be shared by String objects. Then the
> fields Alloced_length, str_charset and alloced should be moved into that
> class making the String class even smaller; the size of two ints. This
> would give a substantial perfomance gain.
This is a good idea. But I'm afraid we cannot afford this change now -
too much code is affected.
>>
>>>
>>> Thank you for the review!
>>
>> Will you commit the new version soon?
> Done.
>
> Best Regards
>
> Martin
>