Hi Øystein,
Thank you very much for review. I returned from UC on the last weekend.
Now catching up with my emails. I will send a detailed answer soon.
Brief answers are inline:
Øystein Grøvlen wrote:
> Bar,
>
> I have so far only looked at the C++-part of the patch, not the tests.
> (I will come back to them later.) I also have not tried to apply the
> patch yet, which I think is necessary in order to figure out if
> something is missing. Do you have any recommendation on which version
> of which branch to apply the patch?
The patch was for "mysql-6.0" tree, which I cloned about three
months ago.
>
> Most of the changes looks good, but I have a few questions/comments:
>
> 1. I think I understood most of the worklog, but I am a bit puzzled
> with the following statement: "Now this worklog has nothing to do
> with ascii."
>
> Why then does this patch introduce methods called
> val_ascii and change charsets from binary to latin1? Is this
> related to what is said in the section "Other automatic
> conversions"? I am not really able to tell what that section is
> saying. I think the worklog needs to explain this better.
Thanks for noticing this.
This phrase needs to be either removed,
or rephrased to something like this:
"This patch does not introduce a build-in 'ascii' character set".
>
> 2. Why do you rename methods from val_str to val_ascii? As far as I
> understand, the val_xxx methods represents the four basic types
> with operating on int, real, decimal, and string. You are not
> adding another type here, just changing the representation of
> strings. Why not just change val_str to return ascii where
> appropriate? Do you ever need both an ascii and a non-ascii
> representation of the same item value?
This is done for performance optimization purposes,
and using the following facts and ideas:
1. Result of some Items in string context
depends on @@character_set_results.
@@character_set_results can be set to a "real multibyte" character
set like UCS2, UTF16, UTF32. I will use only UTF32 in the examples
below for convenience.
So the default string result of such functions
in this circumstances is real multi-byte character set, like UTF32.
For example, all numbers in string context
return result in @@character_set_results:
SELECT CONCAT(20010101); -> UTF32
Similar, some Fields internally store data in ASCII format,
like DECIMAL, TIMESTAMP, but when you do val_str(), they return
"real multibyte" result because @@character_set_results wants so:
SELECT CONCAT(date_column); -> UTF32
So briefly, some data sources can use ASCII representation
internally, but return multi-byte data only because
@@character_set_results wants so.
Therefore, conversion from ASCII to UTF32 is applied internally.
2. Some other functions need in fact ASCII input.
For example,
inet_aton(), GeometryFromText(), Convert_TZ(), GET_FORMAT().
Similar, fields of certain type, like DATE, TIME,
when you insert string data into them, expect in fact ASCII input.
If they get non-ASCII input, for example UTF32, they
convert input from ASCII to UTF32, and then use ASCII
representation to do further processing.
3. Now imagine we pass result of a data source of the first type
to a data destination of the second type.
What happens:
a. data source converts data from ASCII to UTF32, because
@@character_set_results wants so and passes the result to
data destination.
b. data destination gets UTF32 string.
c. data destination converts UTF32 string to ASCII,
because it needs ASCII representation to be able to handle data
correctly.
As a result we get two steps of unnecessary conversion:
From ASCII to UTF32, then from UTF32 to ASCII.
A better way to handle these situations would be to pass ASCII
representation directly from source to destination.
This is what the patch does.
It actually splits val_str() method into two parts when it is possible:
- Getting ASCII representation, implemented as val_ascii() virtual
method.
- Converting ASCII representation to what @@character_set_results wants
(to UTF32 using my example).
This conversion is implemented in Item_str_ascii_func::val_str().
So in other words, val_ascii() forces a data source to return ASCII
representation where val_str() would return UTF32.
Also, the patch helps to get rid of redundant conversion code.
the conversion step is now implemented only once: in
Item_str_ascii_func::val_str(). Previously it was implemented
in every single val_str() implementation of such "ASCII data source"
functions.
>
> Note also that as part of the reenginering project, the val_xxx
> methods will be disappear, and replaced with a value() method that
> will return a Value object on which one can call type conversion
> methods. (e.g., value().to_string()). See WL#4760.
Right, and I know you're working on this task already.
I don't think my change will be in hard conflict with WL#4760 very much.
val_ascii() will be just transformed into some other way to ask
an Item, or a Field to return ASCII representation, instead
of the default representation dictated by @@character_set_results.
For example, it can be a method of Value. Say,
Value->I_want_ascii_rather_than_character_set_results();
after that
Value->value() will return ASCII representation by default.
or a flag to value():
Value->value(FLAG_WANT_ASCII_FOR_STRING_RESULT);
>
> 3. I feel the use of set_char_length_and_dec is a bit inconsistent.
> In some classes, decimals is set explicitly and set_char_length
> called.
Let me try to explain what happens:
set_char_length() is done for exact data types.
"decimals" is set to 0 in constructor in such cases.
> In other places, set_char_length_and_dec is called with
> two parameters.
set_char_length_and_dec() sets both "max_length" and "decimals"
(i.e. both width and precision).
This is done for float, double and decimal data types.
> In yet other places, set_char_length is not used
> at all. Instead, the multiplication is done without a function
> call. Some consistency would be good.
This is done in rare cases where calculation of
"max_length" and "decimal" is more tricky than just
calling set_char_length_and_dec(). For example,
in case of hybrid data types.
>
> If it were up to me, I would drop the entire
> set_char_length_and_dec, initialize decimals to 0 in Item
> constructors, and set decimals explicitly when necessary in
> fix_lenght_and_dec etc.
Note, max_length depends on collation.
set_char_length_and_dec() just allows to get rid of
dozens lines of redundant code like this:
max_length= xxx * collation.collation->mbmaxlen;
decimals= yyy;
I think this just looks nicer:
set_char_length_and_dec(xxx, yyy);
multiplication is hidden into the method.
Note, it will also help later to switch to "store max char_length"
approach instead of "store max (byte) length",
which you suggest below in N4.
> I also think it is a bit confusing to have
> both fix_lenght_and_dec and set_char_length_and_dec methods. Not
> easy to tell the difference from the name.
I agree with you. This is fix_length_and_dec() who has a misleading
name. It actually sets much more than just "length" and "dec", for
example "unsigned" flags, "not_null" flags,
character sets, collations, and some other attributes.
I guess fix_length_and_dec() should be renamed eventually.
But I don't think it should be done under terms of WL#2649.
Returning to my methods - set_char_length() and
set_char_length_and_dec() - they do what their names tell:
set "max_length" (using length-in-characters notation)
and "decimals".
>
> 4. There is a lot of multiplication/division back and forth in
> max_char_length and set_char_length. (By the way, should not the
> latter be called set_max_char_length?). Did you consider storing
> max_char_length, instead?
Agree. And I did. But this would look like a huge refactoring task.
Don't think this is something for WL#2649.
Moreover, WL#2649 is considered now as a candidate for 5.4.
So I'm trying to keep the patch as small as possible,
to reduce the number of merge conflicts.
>
> 5. Please add some documentation on new methods (e.g.,
> Item::val_ascii).
I could paste the above explanation somewhere in the code.
Do you think it will work as such documentation?
>
> 6. It is probably not my business, since I am not the architectural
> reviewer, but it would be easier to verify the behavior if the work
> log had contained a table with all functions affected, what they
> used to return, what they will return, and what charset will be
> used. The way it is now, it is a bit hard to verify that all has
> been covered, and, especially, what charset should be used.
I proposed that to PeterG and Sergei, who are the author and
architecture reviewer for WL#2649. Their verdict was "no needs",
>
> More detailed comments can be found in-line,
<skip>
I will send more detailed answers later.
Many thanks!