Hi Bar,
Thanks for the explanations to my question/comments. I have some more
comments in-line. (This only contains replies to your first email. I
will get back to the detailed part tomorrow. I was almost done with
this email when I saw your latest reply.)
Alexander Barkov wrote:
> Hi Øystein,
>
>
> Thank you very much for review. I returned from UC on the last weekend.
> Now catching up with my emails. I will send a detailed answer soon.
> Brief answers are inline:
>
>
> Øystein Grøvlen wrote:
>> Bar,
>>
>> I have so far only looked at the C++-part of the patch, not the tests.
>> (I will come back to them later.) I also have not tried to apply the
>> patch yet, which I think is necessary in order to figure out if
>> something is missing. Do you have any recommendation on which version
>> of which branch to apply the patch?
>
>
> The patch was for "mysql-6.0" tree, which I cloned about three
> months ago.
OK. I try will try to apply it.
>
>
>>
>> Most of the changes looks good, but I have a few questions/comments:
>>
>> 1. I think I understood most of the worklog, but I am a bit puzzled
>> with the following statement: "Now this worklog has nothing to do
>> with ascii."
> >
>> Why then does this patch introduce methods called
>> val_ascii and change charsets from binary to latin1? Is this
>> related to what is said in the section "Other automatic
>> conversions"? I am not really able to tell what that section is
>> saying. I think the worklog needs to explain this better.
>
> Thanks for noticing this.
>
> This phrase needs to be either removed,
> or rephrased to something like this:
>
> "This patch does not introduce a build-in 'ascii' character set".
>
OK. So this means that there are no 'ascii' character set, but latin1
is used instead?
My problem is that I am not able to understand from the worklog where
ascii should be used instead of the character set/collation defined by
the connection. Is this so obvious that it does not need to be
mentioned, or is it mentioned in a way that I am not able to
understand?
>
>>
>> 2. Why do you rename methods from val_str to val_ascii? As far as I
>> understand, the val_xxx methods represents the four basic types
>> with operating on int, real, decimal, and string. You are not
>> adding another type here, just changing the representation of
>> strings. Why not just change val_str to return ascii where
>> appropriate? Do you ever need both an ascii and a non-ascii
>> representation of the same item value?
>
> This is done for performance optimization purposes,
> and using the following facts and ideas:
>
> 1. Result of some Items in string context
> depends on @@character_set_results.
> @@character_set_results can be set to a "real multibyte" character
> set like UCS2, UTF16, UTF32. I will use only UTF32 in the examples
> below for convenience.
>
> So the default string result of such functions
> in this circumstances is real multi-byte character set, like UTF32.
>
> For example, all numbers in string context
> return result in @@character_set_results:
>
> SELECT CONCAT(20010101); -> UTF32
>
> Similar, some Fields internally store data in ASCII format,
> like DECIMAL, TIMESTAMP, but when you do val_str(), they return
> "real multibyte" result because @@character_set_results wants so:
I am not sure I understand what you mean by "internally store data in
ASCII format". AFAICT, a TIMESTAMP field is stored as a longlong, and
Field_timestamp::val->str() returns a string with binary charset, but
maybe you are thinking at another level?
>
> SELECT CONCAT(date_column); -> UTF32
>
> So briefly, some data sources can use ASCII representation
> internally, but return multi-byte data only because
> @@character_set_results wants so.
> Therefore, conversion from ASCII to UTF32 is applied internally.
>
>
> 2. Some other functions need in fact ASCII input.
>
> For example,
> inet_aton(), GeometryFromText(), Convert_TZ(), GET_FORMAT().
>
> Similar, fields of certain type, like DATE, TIME,
> when you insert string data into them, expect in fact ASCII input.
> If they get non-ASCII input, for example UTF32, they
> convert input from ASCII to UTF32, and then use ASCII
> representation to do further processing.
>
> 3. Now imagine we pass result of a data source of the first type
> to a data destination of the second type.
>
> What happens:
> a. data source converts data from ASCII to UTF32, because
> @@character_set_results wants so and passes the result to
> data destination.
> b. data destination gets UTF32 string.
> c. data destination converts UTF32 string to ASCII,
> because it needs ASCII representation to be able to handle data
> correctly.
>
> As a result we get two steps of unnecessary conversion:
> From ASCII to UTF32, then from UTF32 to ASCII.
>
> A better way to handle these situations would be to pass ASCII
> representation directly from source to destination.
Thanks, now I understand what you want to achieve.
>
>
> This is what the patch does.
> It actually splits val_str() method into two parts when it is possible:
>
> - Getting ASCII representation, implemented as val_ascii() virtual
> method.
>
> - Converting ASCII representation to what @@character_set_results wants
> (to UTF32 using my example).
> This conversion is implemented in Item_str_ascii_func::val_str().
>
>
> So in other words, val_ascii() forces a data source to return ASCII
> representation where val_str() would return UTF32.
OK, now I think I understand. Item_str_ascii_func::val_str() is now
inherited by all Item classes for functions that should return ascii.
Item::val_ascii() provides a default implementation for classes that
does not implement their own and is based on converting the output of
val_str() to ASCII. By providing its own val_ascii(), a class can
avoid that the internal ascii representation is converted to
@@character_set_results by val_str() and back to ascii by
Field::val_ascii(). It would be nice if this could be explained in the
code.
I think you should consider another name that val_ascii() since that
kind of implies that you are introducing a fifth basic type, but
that's not actually what is the case here, I think..
> Also, the patch helps to get rid of redundant conversion code.
> the conversion step is now implemented only once: in
> Item_str_ascii_func::val_str(). Previously it was implemented
> in every single val_str() implementation of such "ASCII data source"
> functions.
>
I am not able to find removal of such redundant code in the patch.
Could you give an example.
>
>>
>> Note also that as part of the reenginering project, the val_xxx
>> methods will be disappear, and replaced with a value() method that
>> will return a Value object on which one can call type conversion
>> methods. (e.g., value().to_string()). See WL#4760.
>
> Right, and I know you're working on this task already.
>
> I don't think my change will be in hard conflict with WL#4760 very much.
>
> val_ascii() will be just transformed into some other way to ask
> an Item, or a Field to return ASCII representation, instead
> of the default representation dictated by @@character_set_results.
> For example, it can be a method of Value. Say,
>
> Value->I_want_ascii_rather_than_character_set_results();
>
> after that
>
> Value->value() will return ASCII representation by default.
>
> or a flag to value():
>
> Value->value(FLAG_WANT_ASCII_FOR_STRING_RESULT);
The current design is to have a separate metadata object associated
with Item and Field classes which contains metadata that is necessary
for type conversion, and passed to the Value constructor. The
ascii-requirement could be stored in this metadata object.
>
>
>>
>> 3. I feel the use of set_char_length_and_dec is a bit inconsistent.
>> In some classes, decimals is set explicitly and set_char_length
>> called.
>
> Let me try to explain what happens:
>
>
> set_char_length() is done for exact data types.
> "decimals" is set to 0 in constructor in such cases.
set_char_length is also set in some Item_decimal constructors after
decimals has been set.
>
>> In other places, set_char_length_and_dec is called with
>> two parameters.
>
> set_char_length_and_dec() sets both "max_length" and "decimals"
> (i.e. both width and precision).
>
> This is done for float, double and decimal data types.
>
>
>> In yet other places, set_char_length is not used
>> at all. Instead, the multiplication is done without a function
>> call. Some consistency would be good.
>
> This is done in rare cases where calculation of
> "max_length" and "decimal" is more tricky than just
> calling set_char_length_and_dec(). For example,
> in case of hybrid data types.
Since there are more cases where decimals is set than where
set_char_length_and_dec is called with two params, I still question
the value of introducing an extra method. (set_char_length_and_dec
with one param could be replaced with set_char_length since Item
constructor initializes decimals to 0.
>
>>
>> If it were up to me, I would drop the entire
>> set_char_length_and_dec, initialize decimals to 0 in Item
>> constructors, and set decimals explicitly when necessary in
>> fix_lenght_and_dec etc.
>
> Note, max_length depends on collation.
> set_char_length_and_dec() just allows to get rid of
> dozens lines of redundant code like this:
>
> max_length= xxx * collation.collation->mbmaxlen;
> decimals= yyy;
>
> I think this just looks nicer:
>
> set_char_length_and_dec(xxx, yyy);
>
> multiplication is hidden into the method.
The multiplication part is an argument for set_char_length, not
set_char_length_and_dec.
> Note, it will also help later to switch to "store max char_length"
> approach instead of "store max (byte) length",
> which you suggest below in N4.
>
>
>> I also think it is a bit confusing to have
>> both fix_lenght_and_dec and set_char_length_and_dec methods. Not
>> easy to tell the difference from the name.
>
> I agree with you. This is fix_length_and_dec() who has a misleading
> name. It actually sets much more than just "length" and "dec", for
> example "unsigned" flags, "not_null" flags,
> character sets, collations, and some other attributes.
> I guess fix_length_and_dec() should be renamed eventually.
> But I don't think it should be done under terms of WL#2649.
>
> Returning to my methods - set_char_length() and
> set_char_length_and_dec() - they do what their names tell:
> set "max_length" (using length-in-characters notation)
> and "decimals".
Your methods also have a side effect that it took me some time to
realize the consequences of: It sets collation to default_charset().
This means that you must be careful not to use set_char_length() when
you want another charset (e.g., ascii). I am afraid that this may not
be obvious to everyone and that could lead to errors being
introduced.
>
>
>>
>> 4. There is a lot of multiplication/division back and forth in
>> max_char_length and set_char_length. (By the way, should not the
>> latter be called set_max_char_length?). Did you consider storing
>> max_char_length, instead?
>
> Agree. And I did. But this would look like a huge refactoring task.
> Don't think this is something for WL#2649.
> Moreover, WL#2649 is considered now as a candidate for 5.4.
> So I'm trying to keep the patch as small as possible,
> to reduce the number of merge conflicts.
That seems reasonable.
>
>
>>
>> 5. Please add some documentation on new methods (e.g.,
>> Item::val_ascii).
>
> I could paste the above explanation somewhere in the code.
> Do you think it will work as such documentation?
Yes, please do.
>
>>
>> 6. It is probably not my business, since I am not the architectural
>> reviewer, but it would be easier to verify the behavior if the work
>> log had contained a table with all functions affected, what they
>> used to return, what they will return, and what charset will be
>> used. The way it is now, it is a bit hard to verify that all has
>> been covered, and, especially, what charset should be used.
>
> I proposed that to PeterG and Sergei, who are the author and
> architecture reviewer for WL#2649. Their verdict was "no needs",
As I said, it would be nice to know when ascii is supposed to be used
or not.
>
>>
>> More detailed comments can be found in-line,
>
> <skip>
>
>
> I will send more detailed answers later.
>
> Many thanks!
>
--
Øystein