Hi Øystein, Sergey,
Øystein, Thank you very much for your review!
I previously send bried reply. This letter contains a detailed reply.
Sergey, there is a question for you below. Please have a look.
Øystein Grøvlen wrote:
> Bar,
>
> I have so far only looked at the C++-part of the patch, not the tests.
> (I will come back to them later.) I also have not tried to apply the
> patch yet, which I think is necessary in order to figure out if
> something is missing. Do you have any recommendation on which version
> of which branch to apply the patch?
The patch was for "mysql-6.0" tree, which I cloned about three
months ago.
>
> Most of the changes looks good, but I have a few questions/comments:
>
> 1. I think I understood most of the worklog, but I am a bit puzzled
> with the following statement: "Now this worklog has nothing to do
> with ascii."
>
> Why then does this patch introduce methods called
> val_ascii and change charsets from binary to latin1? Is this
> related to what is said in the section "Other automatic
> conversions"? I am not really able to tell what that section is
> saying. I think the worklog needs to explain this better.
Thanks for noticing this.
This phrase needs to be either removed,
or rephrased to something like this:
"This patch does not introduce a build-in 'ascii' character set".
>
> 2. Why do you rename methods from val_str to val_ascii? As far as I
> understand, the val_xxx methods represents the four basic types
> with operating on int, real, decimal, and string. You are not
> adding another type here, just changing the representation of
> strings. Why not just change val_str to return ascii where
> appropriate? Do you ever need both an ascii and a non-ascii
> representation of the same item value?
This is done for performance optimization purposes,
and using the following facts and ideas:
1. Result of some Items in string context
depends on @@character_set_results.
@@character_set_results can be set to a "real multibyte" character
set like UCS2, UTF16, UTF32. I will use only UTF32 in the examples
below for convenience.
So the default string result of such functions
in this circumstances is real multi-byte character set, like UTF32.
For example, all numbers in string context
return result in @@character_set_results:
SELECT CONCAT(20010101); -> UTF32
Similar, some Fields internally store data in ASCII format,
like DECIMAL, TIMESTAMP, but when you do val_str(), they return
"real multibyte" result because @@character_set_results wants so:
SELECT CONCAT(date_column); -> UTF32
So briefly, some data sources can use ASCII representation
internally, but return multi-byte data only because
@@character_set_results wants so.
Therefore, conversion from ASCII to UTF32 is applied internally.
2. Some other functions need in fact ASCII input.
For example,
inet_aton(), GeometryFromText(), Convert_TZ(), GET_FORMAT().
Similar, fields of certain type, like DATE, TIME,
when you insert string data into them, expect in fact ASCII input.
If they get non-ASCII input, for example UTF32, they
convert input from ASCII to UTF32, and then use ASCII
representation to do further processing.
3. Now imagine we pass result of a data source of the first type
to a data destination of the second type.
What happens:
a. data source converts data from ASCII to UTF32, because
@@character_set_results wants so and passes the result to
data destination.
b. data destination gets UTF32 string.
c. data destination converts UTF32 string to ASCII,
because it needs ASCII representation to be able to handle data
correctly.
As a result we get two steps of unnecessary conversion:
From ASCII to UTF32, then from UTF32 to ASCII.
A better way to handle these situations would be to pass ASCII
representation directly from source to destination.
This is what the patch does.
It actually splits val_str() method into two parts when it is possible:
- Getting ASCII representation, implemented as val_ascii() virtual
method.
- Converting ASCII representation to what @@character_set_results wants
(to UTF32 using my example).
This conversion is implemented in Item_str_ascii_func::val_str().
So in other words, val_ascii() forces a data source to return ASCII
representation where val_str() would return UTF32.
Also, the patch helps to get rid of redundant conversion code.
the conversion step is now implemented only once: in
Item_str_ascii_func::val_str(). Previously it was implemented
in every single val_str() implementation of such "ASCII data source"
functions.
>
> Note also that as part of the reenginering project, the val_xxx
> methods will be disappear, and replaced with a value() method that
> will return a Value object on which one can call type conversion
> methods. (e.g., value().to_string()). See WL#4760.
Right, and I know you're working on this task already.
I don't think my change will be in hard conflict with WL#4760 very much.
val_ascii() will be just transformed into some other way to ask
an Item, or a Field to return ASCII representation, instead
of the default representation dictated by @@character_set_results.
For example, it can be a method of Value. Say,
Value->I_want_ascii_rather_than_character_set_results();
after that
Value->value() will return ASCII representation by default.
or a flag to value():
Value->value(FLAG_WANT_ASCII_FOR_STRING_RESULT);
>
> 3. I feel the use of set_char_length_and_dec is a bit inconsistent.
> In some classes, decimals is set explicitly and set_char_length
> called.
Let me try to explain what happens:
set_char_length() is done for exact data types.
"decimals" is set to 0 in constructor in such cases.
> In other places, set_char_length_and_dec is called with
> two parameters.
set_char_length_and_dec() sets both "max_length" and "decimals"
(i.e. both width and precision).
This is done for float, double and decimal data types.
> In yet other places, set_char_length is not used
> at all. Instead, the multiplication is done without a function
> call. Some consistency would be good.
This is done in rare cases where calculation of
"max_length" and "decimal" is more tricky than just
calling set_char_length_and_dec(). For example,
in case of hybrid data types.
>
> If it were up to me, I would drop the entire
> set_char_length_and_dec, initialize decimals to 0 in Item
> constructors, and set decimals explicitly when necessary in
> fix_lenght_and_dec etc.
Note, max_length depends on collation.
set_char_length_and_dec() just allows to get rid of
dozens lines of redundant code like this:
max_length= xxx * collation.collation->mbmaxlen;
decimals= yyy;
I think this just looks nicer:
set_char_length_and_dec(xxx, yyy);
multiplication is hidden into the method.
Note, it will also help later to switch to "store max char_length"
approach instead of "store max (byte) length",
which you suggest below in N4.
> I also think it is a bit confusing to have
> both fix_lenght_and_dec and set_char_length_and_dec methods. Not
> easy to tell the difference from the name.
I agree with you. This is fix_length_and_dec() who has a misleading
name. It actually sets much more than just "length" and "dec", for
example "unsigned" flags, "not_null" flags,
character sets, collations, and some other attributes.
I guess fix_length_and_dec() should be renamed eventually.
But I don't think it should be done under terms of WL#2649.
Returning to my methods - set_char_length() and
set_char_length_and_dec() - they do what their names tell:
set "max_length" (using length-in-characters notation)
and "decimals".
>
> 4. There is a lot of multiplication/division back and forth in
> max_char_length and set_char_length. (By the way, should not the
> latter be called set_max_char_length?). Did you consider storing
> max_char_length, instead?
Agree. And I did. But this would look like a huge refactoring task.
Don't think this is something for WL#2649.
Moreover, WL#2649 is considered now as a candidate for 5.4.
So I'm trying to keep the patch as small as possible,
to reduce the number of merge conflicts.
>
> 5. Please add some documentation on new methods (e.g.,
> Item::val_ascii).
I could paste the above explanation somewhere in the code.
Do you think it will work as such documentation?
>
> 6. It is probably not my business, since I am not the architectural
> reviewer, but it would be easier to verify the behavior if the work
> log had contained a table with all functions affected, what they
> used to return, what they will return, and what charset will be
> used. The way it is now, it is a bit hard to verify that all has
> been covered, and, especially, what charset should be used.
I proposed that to PeterG and Sergei, who are the author and
architecture reviewer for WL#2649. Their verdict was "no needs",
>
> More detailed comments can be found in-line,
>
> --
> Øystein
>
> Alexander Barkov wrote:
> > #At file:///home/bar/mysql-bzr/mysql-6.0.w2649/
> >
> > 2701 Alexander Barkov 2009-03-04
> > WL#2649 Number-to-string conversions
> > added:
> > mysql-test/include/ctype_numconv.inc
> > mysql-test/r/ctype_binary.result
> > mysql-test/t/ctype_binary.test
> > modified:
> > mysql-test/r/bigint.result
> > mysql-test/r/case.result
> > mysql-test/r/ctype_latin1.result
> > mysql-test/r/ctype_ucs.result
> > mysql-test/r/explain.result
> > mysql-test/r/metadata.result
> > mysql-test/r/ps_1general.result
> > mysql-test/r/ps_2myisam.result
> > mysql-test/r/ps_3innodb.result
> > mysql-test/r/ps_4heap.result
> > mysql-test/r/ps_5merge.result
> > mysql-test/r/show_check.result
> > mysql-test/r/union.result
> > mysql-test/suite/ndb/r/ps_7ndb.result
> > mysql-test/t/ctype_latin1.test
> > mysql-test/t/ctype_ucs.test
> > sql/item.cc
> > sql/item.h
> > sql/item_cmpfunc.cc
> > sql/item_cmpfunc.h
> > sql/item_func.cc
> > sql/item_func.h
> > sql/item_geofunc.cc
> > sql/item_geofunc.h
> > sql/item_strfunc.cc
> > sql/item_strfunc.h
> > sql/item_timefunc.cc
> > sql/item_timefunc.h
> > sql/mysql_priv.h
> > sql/sql_select.cc
> > sql/time.cc
> >
>
> ...
>
> > === modified file 'sql/item.cc'
> > --- a/sql/item.cc 2009-02-05 07:49:32 +0000
> > +++ b/sql/item.cc 2009-03-04 06:59:35 +0000
> > @@ -201,6 +201,27 @@ bool Item::val_bool()
> > }
> >
> >
>
> Some documentation on the purpose of this method would be good.
>
> > +String *Item::val_ascii(String *str)
> > +{
> > + DBUG_ASSERT(fixed == 1);
> > +
> > + if (!(collation.collation->state & MY_CS_NONASCII))
> > + return val_str(str);
> > +
> > + uint errors;
> > + String *res= val_str(&str_value);
> > + if (!res)
> > + return 0;
> > +
> > + if ((null_value= str->copy(res->ptr(), res->length(),
> > + collation.collation, &my_charset_latin1,
> > + &errors)))
> > + return 0;
> > +
> > + return str;
> > +}
> > +
> > +
>
> ...
>
> > @@ -912,8 +933,8 @@ bool Item::get_date(MYSQL_TIME *ltime,ui
> > if (result_type() == STRING_RESULT)
> > {
> > char buff[40];
> > - String tmp(buff,sizeof(buff), &my_charset_bin),*res;
> > - if (!(res=val_str(&tmp)) ||
> > + String tmp(buff, sizeof(buff), &my_charset_latin1), *res;
> > + if (!(res= val_ascii(&tmp)) ||
> > str_to_datetime_with_warn(res->ptr(), res->length(),
> > ltime, fuzzydate) <=
> MYSQL_TIMESTAMP_ERROR)
> > goto err;
>
> What is the reasoning behind dates not being dependent on connection
> character set, but always be latin1?
This is to pass correct input to str_to_datetime_with_warn().
Previous code with my_charset_bin didn't work for real
multi-byte character sets, because in case of UCS2
str_to_date_time_with_warn() would get:
'\0' '1' '\0' '9' '\0' '9' '\0' '9' '\0' '-' '\0' '0' '\0' '1'
instead of
'1999-01'. I.t. extra 0x00 bytes would be in the string.
I use "latin1" here just because we don't have a special
built-in character set "ascii".
However, latin1 can perfectly work instead of ascii.
Note, any built-in character set which encodes digits as single
byte would work as well:
'0' = 0x30
'1' = 0x31
...
'9' = 0x39
>
> ...
>
> > @@ -2258,8 +2279,8 @@ Item_decimal::Item_decimal(const char *s
> > name= (char*) str_arg;
> > decimals= (uint8) decimal_value.frac;
> > fixed= 1;
> > - max_length= my_decimal_precision_to_length(decimal_value.intg +
> decimals,
> > - decimals, unsigned_flag);
> > + set_char_length(my_decimal_precision_to_length(decimal_value.intg
> + decimals,
> > + decimals,
> unsigned_flag));
> > }
> >
> > Item_decimal::Item_decimal(longlong val, bool unsig)
> > @@ -2267,8 +2288,8 @@ Item_decimal::Item_decimal(longlong val,
> > int2my_decimal(E_DEC_FATAL_ERROR, val, unsig, &decimal_value);
> > decimals= (uint8) decimal_value.frac;
> > fixed= 1;
> > - max_length= my_decimal_precision_to_length(decimal_value.intg +
> decimals,
> > - decimals, unsigned_flag);
> > + set_char_length(my_decimal_precision_to_length(decimal_value.intg
> + decimals,
> > + decimals,
> unsigned_flag));
> > }
> >
> >
> > @@ -2277,8 +2298,8 @@ Item_decimal::Item_decimal(double val, i
> > double2my_decimal(E_DEC_FATAL_ERROR, val, &decimal_value);
> > decimals= (uint8) decimal_value.frac;
> > fixed= 1;
> > - max_length= my_decimal_precision_to_length(decimal_value.intg +
> decimals,
> > - decimals, unsigned_flag);
> > + set_char_length(my_decimal_precision_to_length(decimal_value.intg
> + decimals,
> > + decimals,
> unsigned_flag));
> > }
> >
> >
> > @@ -2287,8 +2308,7 @@ Item_decimal::Item_decimal(const char *s
> > {
> > my_decimal2decimal(val_arg, &decimal_value);
> > name= (char*) str;
> > - decimals= (uint8) decimal_par;
> > - max_length= length;
> > + set_char_length_and_dec(length, decimal_par);
> > fixed= 1;
> > }
> >
> > @@ -2298,8 +2318,8 @@ Item_decimal::Item_decimal(my_decimal *v
> > my_decimal2decimal(value_par, &decimal_value);
> > decimals= (uint8) decimal_value.frac;
> > fixed= 1;
> > - max_length= my_decimal_precision_to_length(decimal_value.intg +
> decimals,
> > - decimals, unsigned_flag);
> > + set_char_length(my_decimal_precision_to_length(decimal_value.intg
> + decimals,
> > + decimals,
> unsigned_flag));
> > }
> >
> >
> > @@ -2309,8 +2329,8 @@ Item_decimal::Item_decimal(const uchar *
> > &decimal_value, precision, scale);
> > decimals= (uint8) decimal_value.frac;
> > fixed= 1;
> > - max_length= my_decimal_precision_to_length(precision, decimals,
> > - unsigned_flag);
> > + set_char_length(my_decimal_precision_to_length(precision, decimals,
> > + unsigned_flag));
> > }
>
> In the above Item_decimal constructors, you use two ways of setting
> decimals and max_length:
>
> 1. Set decimals and call set_char_length
> 2. Call set_char_length_and_dec with two params
>
> I think some consistency here would be good.
I use "decimal" also as a temporary variable here.
With set_char_length_and_dec() it will look more cumbersome:
str2my_decimal(E_DEC_FATAL_ERROR, str_arg, length, charset, &decimal_value);
name= (char*) str_arg;
fixed= 1;
set_char_length_and_dec(my_decimal_precision_to_length(decimal_value.intg +
decimal_value.frac,
decimal_value.frac,
unsigned_flag),
decimal_value.frac);
8 lines of code instead of 6.
What do you think?
>
> ...
>
> > @@ -531,19 +533,22 @@ void Item_func::fix_num_length_and_dec()
> > for (uint i=0 ; i < arg_count ; i++)
> > {
> > set_if_bigger(decimals,args[i]->decimals);
> > - set_if_bigger(fl_length, args[i]->max_length);
> > + set_if_bigger(fl_length, args[i]->max_char_length());
> > }
> > - max_length=float_length(decimals);
> > + max_length= float_length(decimals);
> > if (fl_length > max_length)
> > {
> > decimals= NOT_FIXED_DEC;
> > max_length= float_length(NOT_FIXED_DEC);
> > }
> > + set_char_length(max_length);
> > }
>
> I do not like this way of using max_length for the intermediate
> results. This way, the unit of max_length is changed from #chars to
> #bytes in the middle of this function. That is a bit confusing.
I think I will add a temporary variable, say char_length,
and will do this instead:
set_char_length(char_length);
Do you agree?
>
> ...
>
> > @@ -599,9 +605,9 @@ void Item_func::count_real_length()
> > if (decimals != NOT_FIXED_DEC)
> > {
> > set_if_bigger(decimals, args[i]->decimals);
> > - set_if_bigger(length, (args[i]->max_length - args[i]->decimals));
> > + set_if_bigger(length, (args[i]->max_char_length() -
> args[i]->decimals));
> > }
> > - set_if_bigger(max_length, args[i]->max_length);
> > + set_if_bigger(max_length, args[i]->max_char_length());
> > }
> > if (decimals != NOT_FIXED_DEC)
> > {
> > @@ -612,6 +618,7 @@ void Item_func::count_real_length()
> > else
> > max_length= length;
> > }
> > + max_length*= collation.collation->mbmaxlen;
> > }
>
> Why don't you use set_char_length here?
I think I will add a temporary variable, say "char_length",
the use this variable in the body of the function, and then
do this instead:
set_char_length(char_length);
Similar to the previous case.
>
> ...
> > @@ -686,7 +692,7 @@ void Item_num_op::find_num_type(void)
> > r0 == STRING_RESULT || r1 ==STRING_RESULT)
> > {
> > count_real_length();
> > - max_length= float_length(decimals);
> > + max_length= float_length(decimals) * collation.collation->mbmaxlen;
> > hybrid_type= REAL_RESULT;
> > }
> > else if (r0 == DECIMAL_RESULT || r1 == DECIMAL_RESULT)
> > @@ -727,7 +733,7 @@ void Item_func_num1::find_num_type()
> > case STRING_RESULT:
> > case REAL_RESULT:
> > hybrid_type= REAL_RESULT;
> > - max_length= float_length(decimals);
> > + max_length= float_length(decimals) * collation.collation->mbmaxlen;
> > break;
> > case DECIMAL_RESULT:
> > break;
>
> and here?
Again, I suggest to add a temporary variable "char_length"
and use set_char_length(char_length) instead of multiplication.
>
> ...
>
> > @@ -757,6 +762,30 @@ void Item_func_numhybrid::fix_length_and
> > }
> >
> >
> > +/*
> > + Would be greate to make it a method of the String class,
>
> Spelling: greate
>
> > + but this would need to include
> > + my_decimal.h from sql_string.h and sql_string.cc, which is not
> desirable.
>
> And don't think you would need to include it in sql_string.h. Forward
> declaration would be sufficient. (But I understand your point)
>
You're right, a forward declaration of my_decimal2string() and
type "my_decimal" in sql_string.h would do. In other hand,
keeping things separately is a good thing.
Not sure which is preferable though :)
> > +*/
> > +bool
> > +str_set_decimal(my_decimal *val, String *str, CHARSET_INFO *cs)
> > +{
> > + if (!(cs->state & MY_CS_NONASCII))
> > + {
> > + my_decimal2string(E_DEC_FATAL_ERROR, val, 0, 0, 0, str);
> > + return FALSE;
> > + }
> > + else
> > + {
> > + uint errors;
> > + char buf[DECIMAL_MAX_STR_LENGTH];
> > + String tmp(buf, sizeof(buf), &my_charset_latin1);
> > + my_decimal2string(E_DEC_FATAL_ERROR, val, 0, 0, 0, &tmp);
> > + return str->copy(tmp.ptr(), tmp.length(), &my_charset_latin1,
> cs, &errors);
> > + }
> > +}
>
> As far as I can tell, this does not handle that the decimal separator
> may different for different languages/locale (e.g., point vs. comma).
> Is this not supported?
set_set_decimal() always use dot as a decimal point.
Locale dependent decimal points and thousand separators are supported
only in the function format():
SELECT FORMAT(1234.5645656, 3, 'de_DE');
>
> ...
>
> > @@ -1835,20 +1868,20 @@ longlong Item_func_bit_neg::val_int()
> >
> > void Item_func_integer::fix_length_and_dec()
> > {
> > - max_length=args[0]->max_length - args[0]->decimals+1;
> > + max_length= args[0]->max_char_length() - args[0]->decimals + 1;
> > uint tmp=float_length(decimals);
> > set_if_smaller(max_length,tmp);
> > - decimals=0;
> > + set_char_length_and_dec(max_length);
> > }
> >
> > void Item_func_int_val::fix_num_length_and_dec()
> > {
> > - max_length= args[0]->max_length - (args[0]->decimals ?
> > - args[0]->decimals + 1 :
> > - 0) + 2;
> > + max_length= args[0]->max_char_length() - (args[0]->decimals ?
> > + args[0]->decimals + 1 :
> > + 0) + 2;
> > uint tmp= float_length(decimals);
> > set_if_smaller(max_length,tmp);
> > - decimals= 0;
> > + set_char_length_and_dec(max_length);
> > }
>
> Here it seems you are mixing char length and byte length. Is the
> above really correct?
I think all lines of the code (except the last one) use "char length"
notation.
The last line converts into byte notation:
set_char_length_and_dec(max_length);
Perhaps with adding a temporary variable "char_length", similar to the
above cases, would look less confusing.
What do you think?
>
> >
> >
> > @@ -1861,7 +1894,7 @@ void Item_func_int_val::find_num_type()
> > case STRING_RESULT:
> > case REAL_RESULT:
> > hybrid_type= REAL_RESULT;
> > - max_length= float_length(decimals);
> > + max_length= float_length(decimals) * collation.collation->mbmaxlen;
> > break;
> > case INT_RESULT:
> > case DECIMAL_RESULT:
> > @@ -1993,10 +2026,11 @@ void Item_func_round::fix_length_and_dec
> > longlong val1;
> > bool val1_unsigned;
> >
> > + collation.set(default_charset(), DERIVATION_COERCIBLE,
> MY_REPERTOIRE_ASCII);
>
> Why is this needed here? I see a lot of use of collation without
> having set it. I have not figured out how you know when it needs to
> be set and when it has been set before.
Item_func_round::fix_length_and_dec() has a number of "returns"
in the middle of the body.
Collation must be set before we calculate max_length.
Perhaps, to make it clearer, we should add a temporary
variable char_length again, replace all "return" to
"goto end" and then put this code:
end:
collation.set(default_charset(), DERIVATION_COERCIBLE,
MY_REPERTOIRE_ASCII);
set_char_length(char_length);
}
What do you think?
>
> > unsigned_flag= args[0]->unsigned_flag;
> > if (!args[1]->const_item())
> > {
> > - max_length= args[0]->max_length;
> > + max_length= args[0]->max_char_length();
> > decimals= args[0]->decimals;
> > if (args[0]->result_type() == DECIMAL_RESULT)
> > {
> > @@ -2005,6 +2039,7 @@ void Item_func_round::fix_length_and_dec
> > }
> > else
> > hybrid_type= REAL_RESULT;
> > + max_length*= collation.collation->mbmaxlen;
> > return;
> > }
>
> More inconsistency: max_char_length() is used to get max_length, but
> set_char_length() is not used to set it. I guess that explains why
> collation.set was needed. Another call for consistency.
The proposal about temporary variable "char_length" should fix it.
Do you agree?
>
> ...
>
>
> > @@ -3987,9 +4025,9 @@ void
> > Item_func_set_user_var::fix_length_and_dec()
> > {
> > maybe_null=args[0]->maybe_null;
> > + collation.set(args[0]->collation.collation, DERIVATION_IMPLICIT);
> > max_length=args[0]->max_length;
> > decimals=args[0]->decimals;
> > - collation.set(args[0]->collation.collation, DERIVATION_IMPLICIT);
> > }
>
> Why did you need to move this statement?
No needs. Reverted.
>
> By the way, why is not set_char_length needed here?
Item_func_set_user_var just copies all attributes from
the expression.
So:
collation.set(args[0]->collation.collation, DERIVATION_IMPLICIT);
max_length= args[0]->max_length;
is effectively the same to:
collation.set(args[0]->collation.collation, DERIVATION_IMPLICIT);
set_char_length(args[0]->max_char_length);
The first versrion requires less CPU (no extra devision and extra
multiplication).
>
> ...
>
> > @@ -5356,8 +5390,8 @@ longlong Item_func_inet_aton::val_int()
> > char buff[36];
> > int dot_count= 0;
> >
> > - String *s,tmp(buff,sizeof(buff),&my_charset_bin);
> > - if (!(s = args[0]->val_str(&tmp))) // If null value
> > + String *s, tmp(buff, sizeof(buff), &my_charset_latin1);
> > + if (!(s = args[0]->val_ascii(&tmp))) // If null value
> > goto err;
> > null_value=0;
>
> There is a comment "// Assume ascii" further down in the code that
> becomes a bit strange now that we know that it is ascii.
You're right, I removed the comment.
>
> >
> >
> > === modified file 'sql/item_func.h'
> > --- a/sql/item_func.h 2009-01-23 19:37:08 +0000
> > +++ b/sql/item_func.h 2009-03-04 06:59:35 +0000
> > @@ -215,7 +215,10 @@ public:
> > { DBUG_ASSERT(fixed == 1); return (longlong) rint(val_real()); }
> > enum Item_result result_type () const { return REAL_RESULT; }
> > void fix_length_and_dec()
> > - { decimals= NOT_FIXED_DEC; max_length= float_length(decimals); }
> > + {
> > + decimals= NOT_FIXED_DEC;
> > + set_char_length(float_length(NOT_FIXED_DEC));
> > + }
> > };
> >
> >
> > @@ -324,7 +327,7 @@ public:
> > double val_real();
> > String *val_str(String*str);
> > enum Item_result result_type () const { return INT_RESULT; }
> > - void fix_length_and_dec() {}
> > + void fix_length_and_dec() { set_char_length_and_dec(21); }
> > };
> >
>
> Why do you need to set this when it is set in the constructor of the
> parent class?
>
> By the way, lot's of fix_length_and_dec functions uses hard-coded
> constants. Why not use the constructor instead?
Thank for noticing this. This change is not needed. Removed.
>
> >
> > @@ -349,7 +352,7 @@ public:
> > longlong val_int();
> > longlong val_int_from_str(int *error);
> > void fix_length_and_dec()
> > - { max_length=args[0]->max_length; unsigned_flag=0; }
> > + { set_char_length_and_dec(args[0]->max_char_length());
> unsigned_flag=0; }
>
> Is this necessary because the charset of this and args[0] might be
> different? Otherwise, the only thing this change does, is add one
> extra multiplication and one extra division.
Yes, charsets of this and args[0] can be different.
CAST(xxx AS SIGNED) returns @@character_set_connection in string
context, while xxx can be of any other character sets.
For example, xxx can be a table column,
or a string literal with character set introducer,
and so on:
mysql> set @@character_set_connection=ucs2;
Query OK, 0 rows affected (0.00 sec)
mysql> select charset(cast(_latin1'1' as signed));
+-------------------------------------+
| charset(cast(_latin1'1' as signed)) |
+-------------------------------------+
| ucs2 |
+-------------------------------------+
1 row in set (0.00 sec)
In this example:
- charset of args[0] is latin1
- charset of this is ucs2
>
> > virtual void print(String *str, enum_query_type query_type);
> > uint decimal_precision() const { return
> args[0]->decimal_precision(); }
> > };
> > @@ -362,7 +365,8 @@ public:
> > const char *func_name() const { return "cast_as_unsigned"; }
> > void fix_length_and_dec()
> > {
> > - max_length= min(args[0]->max_length, DECIMAL_MAX_PRECISION + 2);
> > + set_char_length_and_dec(min(args[0]->max_char_length(),
> > + DECIMAL_MAX_PRECISION + 2));
> > unsigned_flag=1;
> > }
> > longlong val_int();
> > @@ -385,7 +389,7 @@ public:
> > my_decimal *val_decimal(my_decimal*);
> > enum Item_result result_type () const { return DECIMAL_RESULT; }
> > enum_field_types field_type() const { return MYSQL_TYPE_NEWDECIMAL; }
> > - void fix_length_and_dec() {};
> > + void fix_length_and_dec() { set_char_length(max_length); };
>
> I do not understand this. This says that max_length is off by a
> mbmaxlen factor. Why could it not be set correctly in the first
> place?
max_length is initialized in constructor, in this line:
max_length= my_decimal_precision_to_length(len, dec, unsigned_flag);
and means "number of characters".
charset is not know at this point yet.
When we do fix_length_and_dec(), charset is already known.
So we just convert from "number of characters" to "number of bytes".
It gives some inconsistency in what "max_length" means though
in different points of time.
What do you suggest?
>
> ...
> > @@ -714,6 +715,7 @@ public:
> > Item_func_sign(Item *a) :Item_int_func(a) {}
> > const char *func_name() const { return "sign"; }
> > longlong val_int();
> > + void fix_length_and_dec() { set_char_length_and_dec(2); }
> > };
>
> This seems like a change of behavior. Is this fixing a bug?
Yes, this is a bug.
MySQL creates a minimal possible data type.
sign() didn't follow this rule. In 5.0 you get:
mysql> create table t1 as select concat(sign(-1)) as c;
Query OK, 1 row affected (0.01 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> explain t1;
+-------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------------+------+-----+---------+-------+
| c | varbinary(21) | NO | | | |
+-------+---------------+------+-----+---------+-------+
1 row in set (0.04 sec)
With WL#2649 it will work as follows:
mysql> create table t1 as select concat(sign(-1)) as c;
Query OK, 1 row affected (0.04 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> explain t1;
+-------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+-------+
| c | varchar(2) | NO | | | |
+-------+------------+------+-----+---------+-------+
1 row in set (0.01 sec)
>
> ...
>
> > @@ -877,7 +877,7 @@ public:
> > Item_func_ascii(Item *a) :Item_int_func(a) {}
> > longlong val_int();
> > const char *func_name() const { return "ascii"; }
> > - void fix_length_and_dec() { max_length=3; }
> > + void fix_length_and_dec() { set_char_length_and_dec(3); }
> > };
>
> This was not set before either. Is it some of your other changes that
> makes this change necessary, or is it an old bug?
The same problem here. ascii() has been one of the rare functions
which didn't create minimal possible data type.
>
> ...
>
> > @@ -1695,7 +1694,8 @@ public:
> > Item_func_found_rows() :Item_int_func() {}
> > longlong val_int();
> > const char *func_name() const { return "found_rows"; }
> > - void fix_length_and_dec() { decimals= 0; maybe_null=0; }
> > + void fix_length_and_dec()
> > + { Item_int_func::fix_length_and_dec(); maybe_null= 0; }
> > };
> >
> >
> > @@ -1707,8 +1707,7 @@ public:
> > Item_func_uuid_short() :Item_int_func() {}
> > const char *func_name() const { return "uuid_short"; }
> > longlong val_int();
> > - void fix_length_and_dec()
> > - { max_length= 21; unsigned_flag=1; }
> > + void fix_length_and_dec() { set_char_length_and_dec(21);
> unsigned_flag= 1; }
>
> Why is not Item_int_func::fix_length_and_dec() called here like for
> the classes above?
Oops. This is a leftover from a previous patch version.
Item_int_func::fix_length_and_dec() is empty now.
There's no need to call it at all.
I will remove all calls for Item_int_func::fix_length_and_dec().
Do you agree?
>
> > bool check_partition_func_processor(uchar *int_arg) {return FALSE;}
> > };
> >
> >
> > === modified file 'sql/item_geofunc.cc'
> > --- a/sql/item_geofunc.cc 2007-12-21 20:10:05 +0000
> > +++ b/sql/item_geofunc.cc 2009-03-04 06:59:35 +0000
>
> ...
>
> > -String *Item_func_as_wkt::val_str(String *str)
> > +String *Item_func_as_wkt::val_ascii(String *str)
> > {
> > DBUG_ASSERT(fixed == 1);
> > String arg_val;
> > @@ -122,6 +122,7 @@ String *Item_func_as_wkt::val_str(String
> >
> > void Item_func_as_wkt::fix_length_and_dec()
> > {
> > + collation.set(default_charset(), DERIVATION_COERCIBLE,
> MY_REPERTOIRE_ASCII);
> > max_length=MAX_BLOB_WIDTH;
> > maybe_null= 1;
> > }
>
> Why is collation set to default_charset() when val_ascii is to be used
> instead of val_str?
val_ascii() is used only in cases when ASCII input is expected.
On other cases, @@character_set_connection will be used:
mysql> set @@character_set_connection=ucs2; select
charset(AsWKT(PointFromText('Point(1 1)')));
Query OK, 0 rows affected (0.00 sec)
+---------------------------------------------+
| charset(AsWKT(PointFromText('Point(1 1)'))) |
+---------------------------------------------+
| ucs2 |
+---------------------------------------------+
1 row in set (0.00 sec)
> And why do you not use set_char_length() to set the
> collation and length?
This is because we have the TEXT data type, not VARCHAR.
TEXT is limited in *bytes*, not in characters.
>
>
> > @@ -145,7 +146,7 @@ String *Item_func_as_wkb::val_str(String
> > }
> >
> >
> > -String *Item_func_geometry_type::val_str(String *str)
> > +String *Item_func_geometry_type::val_ascii(String *str)
> > {
> > DBUG_ASSERT(fixed == 1);
> > String *swkb= args[0]->val_str(str);
> > @@ -158,8 +159,7 @@ String *Item_func_geometry_type::val_str
> > return 0;
> > /* String will not move */
> > str->copy(geom->get_class_info()->m_name.str,
> > - geom->get_class_info()->m_name.length,
> > - default_charset());
> > + geom->get_class_info()->m_name.length,
> &my_charset_latin1);
> > return str;
> > }
>
> I cannot find anything in the worklog that says that the character set
> for geometry_type should change. Where is that covered?
Currently these functions return BINARY, which is wrong:
mysql> select charset(AsWKT(PointFromText('Point(1 1)')));
+---------------------------------------------+
| charset(AsWKT(PointFromText('Point(1 1)'))) |
+---------------------------------------------+
| binary |
+---------------------------------------------+
1 row in set (0.01 sec)
mysql> select charset(GeometryType(PointFromText('Point(1 1)')));
+----------------------------------------------------+
| charset(GeometryType(PointFromText('Point(1 1)'))) |
+----------------------------------------------------+
| binary |
+----------------------------------------------------+
1 row in set (0.00 sec)
Output from functions AsWKT()/AsText() and GeometryType()
are definitely character strings, and they must report
themself as character strings, not as binary arrays.
They were forgotten in WL#2649 description.
Sergey: I'm going to add these functions into WL#2649. Do you agree?
>
> >
> >
> > === modified file 'sql/item_geofunc.h'
> > --- a/sql/item_geofunc.h 2008-02-22 10:30:33 +0000
> > +++ b/sql/item_geofunc.h 2009-03-04 06:59:35 +0000
>
> ...
>
> > @@ -231,48 +231,60 @@ public:
> > Item_func::print(str, query_type);
> > }
> >
> > - void fix_length_and_dec() { maybe_null= 1; }
> > + void fix_length_and_dec() { set_char_length_and_dec(2);
> maybe_null= 1; }
> > bool is_null() { (void) val_int(); return null_value; }
> > };
> >
> > -class Item_func_isempty: public Item_bool_func
> > +
> > +class Item_geom_bool_func: public Item_bool_func
> > {
> > public:
> > - Item_func_isempty(Item *a): Item_bool_func(a) {}
> > - longlong val_int();
> > + Item_geom_bool_func(Item *a): Item_bool_func(a) {}
> > optimize_type select_optimize() const { return OPTIMIZE_NONE; }
> > + void fix_length_and_dec() { set_char_length_and_dec(1);
> maybe_null= 1; }
> > +};
>
> maybe_null was not set before. Is this a functional change?
Diff doesn't look nice here.
What happes here is:
I added a new intermediate class Item_geom_bool_func,
and incorporated all common features of these three classes:
- Item_func_isempty
- Item_func_issimple
- Item_func_isclosed
These classes all used to set maybe_null to 1
in their respected fix_length_and_dec() methods.
I just moved this code into this new parent class.
No behaviour change.
>
> ...
>
> > @@ -282,11 +294,8 @@ public:
> > Item_func_x(Item *a): Item_real_func(a) {}
> > double val_real();
> > const char *func_name() const { return "x"; }
> > - void fix_length_and_dec()
> > - {
> > - Item_real_func::fix_length_and_dec();
> > - maybe_null= 1;
> > - }
> > + void fix_length_and_dec()
> > + { Item_real_func::fix_length_and_dec(); maybe_null= 1; }
> > };
> >
>
> Personally, I do not think this is a change to the better, but that's
> a matter of personal taste.
>
> ...
>
> > === modified file 'sql/item_strfunc.cc'
> > --- a/sql/item_strfunc.cc 2009-01-09 18:43:23 +0000
> > +++ b/sql/item_strfunc.cc 2009-03-04 06:59:35 +0000
> > @@ -100,7 +100,28 @@ longlong Item_str_func::val_int()
> > }
> >
> >
> > -String *Item_func_md5::val_str(String *str)
> > +String *Item_str_ascii_func::val_str(String *str)
> > +{
> > + DBUG_ASSERT(fixed == 1);
> > +
> > + if (!(collation.collation->state & MY_CS_NONASCII))
> > + return val_ascii(str);
> > +
> > + uint errors;
> > + String *res= val_ascii(&ascii_buf);
> > + if (!res)
> > + return 0;
> > +
> > + if ((null_value= str->copy(res->ptr(), res->length(),
> > + &my_charset_latin1, collation.collation,
> > + &errors)))
> > + return 0;
> > +
> > + return str;
> > +}
>
> This is very identical to Item::val_ascii, only difference is that
> val_ascii is called instead of val_str. This code duplication
> indicates that there is something fishy with this design ...
You're absolutely right they are identical - both do conversion.
However, they do absolutely different conversion:
- From ASCII to collation.collation
vs
- From collation.collation to ASCII.
If an Item is ASCII by its nature, then I define val_ascii() for it,
and do conversion from ASCII to collation.collation in
Item_str_ascii_func::val_str() when we need val_str().
If an Item is not ASCII by its nature, then I define val_str(),
and do conversion from @@character_set_connection to ASCII in
Item::val_ascii() when we need val_ascii().
I don't think anything which really simplifies the code
can be done here.
What do you suggest?
>
> ...
>
> > @@ -3307,11 +3332,11 @@ String* Item_func_inet_ntoa::val_str(Str
> > num[0]=(char) n1+'0';
> > num[1]=(char) n2+'0';
> > num[2]=(char) c+'0';
> > - uint length=(n1 ? 4 : n2 ? 3 : 2); // Remove pre-zero
> > -
> > - (void) str->append(num+4-length,length);
> > + uint length=(n1 ? 4 : n2 ? 3 : 2); // Remove pre-zero
> > + uint dot_length= (p <= buf) ? 1 : 0;
> > + (void) str->append(num + 4 - length, length - dot_length,
> > + &my_charset_latin1);
> > }
> > - str->length(str->length()-1); // Remove last '.';
> > return str;
> > }
>
> The new version seems a bit different from the previous one. Is there
> a bug fix hidden here?
Yes, there was a bug - this functions didn't work well
with real-multi-byte character sets UCS2, UTF16, UTF32.
It removed trailing extra dot *after* result is copied to "str",
which broke multi-byte character in case of UCS2/UTF16/UTF32.
Now it removed trailing extra dot *before* result is copied to "str",
which garantees we remove the dot correctly.
>
> ...
>
> > === modified file 'sql/item_timefunc.h'
> > --- a/sql/item_timefunc.h 2009-02-07 16:00:57 +0000
> > +++ b/sql/item_timefunc.h 2009-03-04 06:59:35 +0000
> > @@ -708,43 +619,7 @@ class Item_extract :public Item_int_func
> > };
> >
> >
> > -class Item_typecast :public Item_str_func
> > -{
> > -public:
> > - Item_typecast(Item *a) :Item_str_func(a) {}
> > - String *val_str(String *a)
> > - {
> > - DBUG_ASSERT(fixed == 1);
> > - String *tmp=args[0]->val_str(a);
> > - null_value=args[0]->null_value;
> > - if (tmp)
> > - tmp->set_charset(collation.collation);
> > - return tmp;
> > - }
> > - void fix_length_and_dec()
> > - {
> > - collation.set(&my_charset_bin);
> > - max_length=args[0]->max_length;
> > - }
> > - virtual const char* cast_type() const= 0;
> > - virtual void print(String *str, enum_query_type query_type);
> > -};
> > -
> > -
> > -class Item_typecast_maybe_null :public Item_typecast
> > -{
> > -public:
> > - Item_typecast_maybe_null(Item *a) :Item_typecast(a) {}
> > - void fix_length_and_dec()
> > - {
> > - collation.set(&my_charset_bin);
> > - max_length=args[0]->max_length;
> > - maybe_null= 1;
> > - }
> > -};
> > -
> > -
> > -class Item_char_typecast :public Item_typecast
> > +class Item_char_typecast :public Item_str_func
> > {
> > int cast_length;
> > CHARSET_INFO *cast_cs, *from_cs;
> > @@ -752,23 +627,31 @@ class Item_char_typecast :public Item_ty
> > String tmp_value;
> > public:
> > Item_char_typecast(Item *a, int length_arg, CHARSET_INFO *cs_arg)
> > - :Item_typecast(a), cast_length(length_arg), cast_cs(cs_arg) {}
> > + :Item_str_func(a), cast_length(length_arg), cast_cs(cs_arg) {}
> > enum Functype functype() const { return CHAR_TYPECAST_FUNC; }
> > bool eq(const Item *item, bool binary_cmp) const;
> > const char *func_name() const { return "cast_as_char"; }
> > - const char* cast_type() const { return "char"; };
> > String *val_str(String *a);
> > void fix_length_and_dec();
> > virtual void print(String *str, enum_query_type query_type);
> > };
> >
> >
> > -class Item_date_typecast :public Item_typecast_maybe_null
> > +class Item_typecast :public Item_str_ascii_func
> > +{
> > +public:
> > + Item_typecast(Item *a) :Item_str_ascii_func(a) {}
> > + virtual const char* cast_type() const= 0;
> > + virtual void print(String *str, enum_query_type query_type);
> > +};
> > +
> > +
> > +class Item_date_typecast :public Item_typecast
>
> Why have you changed the order of the class definitions here? Makes
> the diff harder to read.
Sorry for that. Restored the original order of the classes.
>
> ...
>
> > === modified file 'sql/sql_select.cc'
> > --- a/sql/sql_select.cc 2009-02-13 17:44:21 +0000
> > +++ b/sql/sql_select.cc 2009-03-04 06:59:35 +0000
> > @@ -13660,10 +13660,10 @@ static Field *create_tmp_field_from_item
> > Field_long : make them Field_longlong.
> > */
> > if (item->max_length >= (MY_INT32_NUM_DECIMAL_DIGITS - 1))
> > - new_field=new Field_longlong(item->max_length, maybe_null,
> > + new_field=new Field_longlong(item->max_char_length(), maybe_null,
> > item->name, item->unsigned_flag);
> > else
> > - new_field=new Field_long(item->max_length, maybe_null,
> > + new_field=new Field_long(item->max_char_length(), maybe_null,
> > item->name, item->unsigned_flag);
>
> Why should max_char_length() be used in the constructors and not in the
> if-test? And why is not the same change done for REAL_RESULT?
>
> ...
You're right, it should be max_char_length() in all cases. Fixed.
Please find the fixed version of the patch attached.
Attachment: [text/x-patch] wl2649.diff