List:Commits« Previous MessageNext Message »
From:Øystein Grøvlen Date:May 5 2009 12:44pm
Subject:Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649
View as plain text  
Hi Bar,

Thanks for the explanations to my question/comments.  I have some more
comments in-line.  (This only contains replies to your first email.  I
will get back to the detailed part tomorrow.  I was almost done with
this email when I saw your latest reply.)


Alexander Barkov wrote:
 > Hi Øystein,
 >
 >
 > Thank you very much for review. I returned from UC on the last weekend.
 > Now catching up with my emails. I will send a detailed answer soon.
 > Brief answers are inline:
 >
 >
 > Øystein Grøvlen wrote:
 >> Bar,
 >>
 >> I have so far only looked at the C++-part of the patch, not the tests.
 >> (I will come back to them later.)  I also have not tried to apply the
 >> patch yet, which I think is necessary in order to figure out if
 >> something is missing.  Do you have any recommendation on which version
 >> of which branch to apply the patch?
 >
 >
 > The patch was for "mysql-6.0" tree, which I cloned about three
 > months ago.

OK.  I try will try to apply it.

 >
 >
 >>
 >> Most of the changes looks good, but I have a few questions/comments:
 >>
 >> 1. I think I understood most of the worklog, but I am a bit puzzled
 >>    with the following statement: "Now this worklog has nothing to do
 >>    with ascii."
 >  >
 >> Why then does this patch introduce methods called
 >>    val_ascii and change charsets from binary to latin1?  Is this
 >>    related to what is said in the section "Other automatic
 >>    conversions"?  I am not really able to tell what that section is
 >>    saying.  I think the worklog needs to explain this better.
 >
 > Thanks for noticing this.
 >
 > This phrase needs to be either removed,
 > or rephrased to something like this:
 >
 > "This patch does not introduce a build-in 'ascii' character set".
 >

OK.  So this means that there are no 'ascii' character set, but latin1
is used instead?

My problem is that I am not able to understand from the worklog where
ascii should be used instead of the character set/collation defined by
the connection.  Is this so obvious that it does not need to be
mentioned, or is it mentioned in a way that I am not able to
understand?

 >
 >>
 >> 2. Why do you rename methods from val_str to val_ascii? As far as I
 >>    understand, the val_xxx methods represents the four basic types
 >>    with operating on int, real, decimal, and string.  You are not
 >>    adding another type here, just changing the representation of
 >>    strings. Why not just change val_str to return ascii where
 >>    appropriate?  Do you ever need both an ascii and a non-ascii
 >>    representation of the same item value?
 >
 > This is done for performance optimization purposes,
 > and using the following facts and ideas:
 >
 > 1. Result of some Items in string context
 >    depends on @@character_set_results.
 >    @@character_set_results can be set to a "real multibyte" character
 >    set like UCS2, UTF16, UTF32. I will use only UTF32 in the examples
 >    below for convenience.
 >
 >    So the default string result of such functions
 >    in this circumstances is real multi-byte character set, like UTF32.
 >
 >    For example, all numbers in string context
 >    return result in @@character_set_results:
 >
 >    SELECT CONCAT(20010101); -> UTF32
 >
 >    Similar, some Fields internally store data in ASCII format,
 >    like DECIMAL, TIMESTAMP, but when you do val_str(), they return
 >    "real multibyte" result because @@character_set_results wants so:

I am not sure I understand what you mean by "internally store data in
ASCII format". AFAICT, a TIMESTAMP field is stored as a longlong, and
Field_timestamp::val->str() returns a string with binary charset, but
maybe you are thinking at another level?

 >
 >    SELECT CONCAT(date_column); -> UTF32
 >
 >    So briefly, some data sources can use ASCII representation
 >    internally, but return multi-byte data only because
 >    @@character_set_results wants so.
 >    Therefore, conversion from ASCII to UTF32 is applied internally.
 >
 >
 > 2. Some other functions need in fact ASCII input.
 >
 >    For example,
 >      inet_aton(), GeometryFromText(), Convert_TZ(), GET_FORMAT().
 >
 >    Similar, fields of certain type, like DATE, TIME,
 >    when you insert string data into them, expect in fact ASCII input.
 >    If they get non-ASCII input, for example UTF32, they
 >    convert input from ASCII to UTF32, and then use ASCII
 >    representation to do further processing.
 >
 > 3. Now imagine we pass result of a data source of the first type
 >    to a data destination of the second type.
 >
 >    What happens:
 >    a. data source converts data from ASCII to UTF32, because
 >       @@character_set_results wants so and passes the result to
 >       data destination.
 >    b. data destination gets UTF32 string.
 >    c. data destination converts UTF32 string to ASCII,
 >       because it needs ASCII representation to be able to handle data
 >       correctly.
 >
 >    As a result we get two steps of unnecessary conversion:
 >    From ASCII to UTF32, then from UTF32 to ASCII.
 >
 >    A better way to handle these situations would be to pass ASCII
 >    representation directly from source to destination.

Thanks, now I understand what you want to achieve.

 >
 >
 > This is what the patch does.
 > It actually splits val_str() method into two parts when it is possible:
 >
 > - Getting ASCII representation, implemented as val_ascii() virtual
 >   method.
 >
 > - Converting ASCII representation to what @@character_set_results wants
 >   (to UTF32 using my example).
 >   This conversion is implemented  in Item_str_ascii_func::val_str().
 >
 >
 > So in other words, val_ascii() forces a data source to return ASCII
 > representation where val_str() would return UTF32.

OK, now I think I understand.  Item_str_ascii_func::val_str() is now
inherited by all Item classes for functions that should return ascii.
Item::val_ascii() provides a default implementation for classes that
does not implement their own and is based on converting the output of
val_str() to ASCII.  By providing its own val_ascii(), a class can
avoid that the internal ascii representation is converted to
@@character_set_results by val_str() and back to ascii by
Field::val_ascii().  It would be nice if this could be explained in the 
code.

I think you should consider another name that val_ascii() since that
kind of implies that you are introducing a fifth basic type, but
that's not actually what is the case here, I think..

 > Also, the patch helps to get rid of redundant conversion code.
 > the conversion step is now implemented only once: in
 > Item_str_ascii_func::val_str(). Previously it was implemented
 > in every single val_str() implementation of such "ASCII data source"
 > functions.
 >

I am not able to find removal of such redundant code in the patch.
Could you give an example.

 >
 >>
 >>    Note also that as part of the reenginering project, the val_xxx
 >>    methods will be disappear, and replaced with a value() method that
 >>    will return a Value object on which one can call type conversion
 >>    methods.  (e.g., value().to_string()).  See WL#4760.
 >
 > Right, and I know you're working on this task already.
 >
 > I don't think my change will be in hard conflict with WL#4760 very much.
 >
 > val_ascii() will be just transformed into some other way to ask
 > an Item, or a Field to return ASCII representation, instead
 > of the default representation dictated by @@character_set_results.
 > For example, it can be a method of Value. Say,
 >
 >   Value->I_want_ascii_rather_than_character_set_results();
 >
 > after that
 >
 >   Value->value() will return ASCII representation by default.
 >
 > or a flag to value():
 >
 >   Value->value(FLAG_WANT_ASCII_FOR_STRING_RESULT);

The current design is to have a separate metadata object associated
with Item and Field classes which contains metadata that is necessary
for type conversion, and passed to the Value constructor.  The
ascii-requirement could be stored in this metadata object.

 >
 >
 >>
 >> 3. I feel the use of set_char_length_and_dec is a bit inconsistent.
 >>    In some classes, decimals is set explicitly and set_char_length
 >>    called.
 >
 > Let me try to explain what happens:
 >
 >
 > set_char_length() is done for exact data types.
 > "decimals" is set to 0 in constructor in such cases.

set_char_length is also set in some Item_decimal constructors after
decimals has been set.

 >
 >>  In other places, set_char_length_and_dec is called with
 >>    two parameters.
 >
 > set_char_length_and_dec() sets both "max_length" and "decimals"
 > (i.e. both width and precision).
 >
 > This is done for float, double and decimal data types.
 >
 >
 >>  In yet other places, set_char_length is not used
 >>    at all.  Instead, the multiplication is done without a function
 >>    call.  Some consistency would be good.
 >
 > This is done in rare cases where calculation of
 > "max_length" and "decimal" is more tricky than just
 > calling set_char_length_and_dec(). For example,
 > in case of hybrid data types.

Since there are more cases where decimals is set than where
set_char_length_and_dec is called with two params, I still question
the value of introducing an extra method.  (set_char_length_and_dec
with one param could be replaced with set_char_length since Item
constructor initializes decimals to 0.

 >
 >>
 >>    If it were up to me, I would drop the entire
 >>    set_char_length_and_dec, initialize decimals to 0 in Item
 >>    constructors, and set decimals explicitly when necessary in
 >>    fix_lenght_and_dec etc.
 >
 > Note, max_length depends on collation.
 > set_char_length_and_dec() just allows to get rid of
 > dozens lines of redundant code like this:
 >
 >   max_length= xxx * collation.collation->mbmaxlen;
 >   decimals= yyy;
 >
 > I think this just looks nicer:
 >
 >   set_char_length_and_dec(xxx, yyy);
 >
 > multiplication is hidden into the method.

The multiplication part is an argument for set_char_length, not
set_char_length_and_dec.

 > Note, it will also help later to switch to "store max char_length"
 > approach instead of "store max (byte) length",
 > which you suggest below in N4.
 >
 >
 >> I also think it is a bit confusing to have
 >>    both fix_lenght_and_dec and set_char_length_and_dec methods.  Not
 >>    easy to tell the difference from the name.
 >
 > I agree with you. This is fix_length_and_dec() who has a misleading
 > name. It actually sets much more than just "length" and "dec", for
 > example "unsigned" flags, "not_null" flags,
 > character sets, collations, and some other attributes.
 > I guess fix_length_and_dec() should be renamed eventually.
 > But I don't think it should be done under terms of WL#2649.
 >
 > Returning to my methods - set_char_length() and
 > set_char_length_and_dec() - they do what their names tell:
 > set "max_length" (using length-in-characters notation)
 > and "decimals".

Your methods also have a side effect that it took me some time to
realize the consequences of: It sets collation to default_charset().
This means that you must be careful not to use set_char_length() when
you want another charset (e.g., ascii).  I am afraid that this may not
be obvious to everyone and that could lead to errors being
introduced.

 >
 >
 >>
 >> 4. There is a lot of multiplication/division back and forth in
 >>    max_char_length and set_char_length. (By the way, should not the
 >>    latter be called set_max_char_length?).  Did you consider storing
 >>    max_char_length, instead?
 >
 > Agree. And I did. But this would look like a huge refactoring task.
 > Don't think this is something for WL#2649.
 > Moreover, WL#2649 is considered now as a candidate for 5.4.
 > So I'm trying to keep the patch as small as possible,
 > to reduce the number of merge conflicts.

That seems reasonable.

 >
 >
 >>
 >> 5. Please add some documentation on new methods (e.g.,
 >>    Item::val_ascii).
 >
 > I could paste the above explanation somewhere in the code.
 > Do you think it will work as such documentation?

Yes, please do.

 >
 >>
 >> 6. It is probably not my business, since I am not the architectural
 >>    reviewer, but it would be easier to verify the behavior if the work
 >>    log had contained a table with all functions affected, what they
 >>    used to return, what they will return, and what charset will be
 >>    used. The way it is now, it is a bit hard to verify that all has
 >>    been covered, and, especially, what charset should be used.
 >
 > I proposed that to PeterG and Sergei, who are the author and
 > architecture reviewer for WL#2649. Their verdict was "no needs",

As I said, it would be nice to know when ascii is supposed to be used
or not.

 >
 >>
 >> More detailed comments can be found in-line,
 >
 > <skip>
 >
 >
 > I will send more detailed answers later.
 >
 > Many thanks!
 >


--
Øystein
Thread
bzr commit into mysql-6.0 branch (bar:2701) WL#2649Alexander Barkov4 Mar
  • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Øystein Grøvlen24 Apr
    • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Alexander Barkov29 Apr
      • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Øystein Grøvlen5 May
        • WL#2649Alexander Barkov19 Jun
          • Re: WL#2649Øystein Grøvlen25 Jun
            • Re: WL#2649Alexander Barkov25 Jun
              • Re: WL#2649Øystein Grøvlen25 Jun
                • Re: WL#2649Peter Gulutzan29 Jun
                  • Re: WL#2649Øystein Grøvlen1 Jul
                    • Re: WL#2649Peter Gulutzan2 Jul
                      • Re: WL#2649Øystein Grøvlen3 Jul
                        • Re: WL#2649Peter Gulutzan15 Jul
                        • Re: WL#2649Alexander Barkov20 Aug
                        • Re: WL#2649Alexander Barkov20 Aug
                          • Re: WL#2649Øystein Grøvlen21 Aug
    • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Alexander Barkov5 May
      • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Alexander Barkov5 May
        • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Øystein Grøvlen6 May
      • Re: bzr commit into mysql-6.0 branch (bar:2701) WL#2649Øystein Grøvlen6 May