MySQL Lists are EOL. Please join:

List:Commits« Previous MessageNext Message »
From:Georgi Kodinov Date:May 28 2008 2:19pm
Subject:commit into mysql-5.1 branch (kgeorge:2656) Bug#36676
View as plain text  
#At file:///home/kgeorge/mysql/bzr/merge-5.1-bugteam/

 2656 Georgi Kodinov	2008-05-28 [merge]
      merged 5.0-bugteam to 5.1-bugteam
modified:
  sql/share/charsets/README
  strings/CHARSET_INFO.txt

per-file comments:
  sql/share/charsets/README
    merged bug 36676 to 5.1-bugteam
  strings/CHARSET_INFO.txt
    merged bug 36676 to 5.1-bugteam
=== modified file 'sql/share/charsets/README'
--- a/sql/share/charsets/README	2000-08-22 20:08:34 +0000
+++ b/sql/share/charsets/README	2008-05-28 10:03:47 +0000
@@ -1,28 +1,31 @@
-This directory holds configuration files which allow MySQL to work with
+This directory holds configuration files that enable MySQL to work with
 different character sets.  It contains:
 
-*.conf
-    Each conf file contains four tables which describe character types,
+charset_name.xml
+    Each charset_name.xml file contains information for a simple character
+    set.  The information in the file describes character types,
     lower- and upper-case equivalencies and sorting orders for the
     character values in the set.
 
-Index
-    The Index file lists all of the available charset configurations.
-
-    Each charset is paired with a number.  The number is stored
-    IN THE DATABASE TABLE FILES and must not be changed.  Always
-    add new character sets to the end of the list, so that the
-    numbers of the other character sets will not be changed.
+Index.xml
+    The Index.xml file lists all of the available charset configurations,
+    including collations.
+
+    Each collation must have a unique number.  The number is stored
+    IN THE DATABASE TABLE FILES and must not be changed.
+
+    The max-id attribute of the <charsets> element must be set to
+    the largest collation number.
 
 Compiled in or configuration file?
     When should a character set be compiled in to MySQL's string library
-    (libmystrings), and when should it be placed in a configuration
-    file?
+    (libmystrings), and when should it be placed in a charset_name.xml
+    configuration file?
 
     If the character set requires the strcoll functions or is a
     multi-byte character set, it MUST be compiled in to the string
     library.  If it does not require these functions, it should be
-    placed in a configuration file.
+    placed in a charset_name.xml configuration file.
 
     If the character set uses any one of the strcoll functions, it
     must define all of them.  Likewise, if the set uses one of the
@@ -30,11 +33,7 @@
     more information on how to add a complex character set to MySQL.
 
 Syntax of configuration files
-    The syntax is very simple.  Comments start with a '#' character and
-    proceed to the end of the line.  Words are separated by arbitrary
-    amounts of whitespace.
-
-    For the character set configuration files, every word must be a
-    number in hexadecimal format.  The ctype array takes up the first
-    257 words; the to_lower, to_upper and sort_order arrays take up 256
-    words each after that.
+    The syntax is very simple.  Words in <map> array elements are
+    separated by arbitrary amounts of whitespace. Each word must be a
+    number in hexadecimal format.  The ctype array has 257 words; the
+    other arrays (lower, upper, etc.) take up 256 words each after that.

=== modified file 'strings/CHARSET_INFO.txt'
--- a/strings/CHARSET_INFO.txt	2006-10-12 10:42:05 +0000
+++ b/strings/CHARSET_INFO.txt	2008-05-28 14:18:24 +0000
@@ -3,9 +3,8 @@
 ============
 A structure containing data for charset+collation pair implementation. 
 
-Virtual functions which use this data are collected
-into separate structures MY_CHARSET_HANDLER and
-MY_COLLATION_HANDLER.
+Virtual functions that use this data are collected into separate
+structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER.
 
 
 typedef struct charset_info_st
@@ -56,7 +55,7 @@
 parts of the code where we need to find the default collation
 using its non-default counterpart for the given character set.
 
-binary_numner - ID of a charset+collation pair, which consists
+binary_number - ID of a charset+collation pair, which consists
 of the same character set and the binary collation of this
 character set. Not really used now. 
 
@@ -65,15 +64,15 @@
 
   csname  - name of the character set for this charset+collation pair.
   name    - name of the collation for this charset+collation pair.
-  comment - a text comment, dysplayed in "Description" column of
+  comment - a text comment, displayed in "Description" column of
             SHOW CHARACTER SET output.
 
 Conversion tables
 -----------------
   
   ctype      - pointer to array[257] of "type of characters"
-               bit mask for each chatacter, e.g. if a 
-               character is a digit or a letter or a separator, etc.
+               bit mask for each character, e.g., whether a 
+               character is a digit, letter, separator, etc.
 
                Monty 2004-10-21:
                  If you look at the macros, we use ctype[(char)+1].
@@ -87,17 +86,64 @@
   to_upper   - pointer to array[256] used in UCASE()
   sort_order - pointer to array[256] used for strings comparison
 
+In all Asian charsets these arrays are set up as follows:
+
+- All bytes in the range 0x80..0xFF were marked as letters in the
+  ctype array.
+
+- The to_lower and to_upper arrays map only ASCII letters.
+  UPPER() and LOWER() doesn't really work for multi-byte characters.
+  Most of the characters in Asian character sets are ideograms
+  anyway and they don't have case mapping. However, there are
+  still some characters from European alphabets.
+  For example:
+  _ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE
+  _ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE
+
+  But they don't map to each other with UPPER and LOWER operations.
+
+- The sort_order array is filled case insensitively for the
+  ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte
+  range 0x80..0xFF for these collations:
+
+  cp932_japanese_ci,
+  euckr_korean_ci,
+  eucjpms_japanese_ci,
+  gb2312_chinese_ci,
+  sjis_japanese_ci,
+  ujis_japanese_ci.
+
+  So multi-byte characters are sorted just according to their codes.
+
+
+- Two collations are still case insensitive for the ASCII characters,
+  but have special sorting order for multi-byte characters
+  (something more complex than just according to codes):
+
+  big5_chinese_ci
+  gbk_chinese_ci
+
+  So handlers for these collations use only the 0x00..0x7F part
+  of their sort_order arrays, and apply the special functions
+  for multi-byte characters
+
+In Unicode character sets we have full support of UPPER/LOWER mapping,
+for sorting order, and for character type detection.
+"utf8_general_ci" still has the "old-fashioned" arrays
+like to_upper, to_lower, sort_order and ctype, but they are
+not really used (maybe only in some rare legacy functions).
+
 
 
 Unicode conversion data
 -----------------------
-For 8bit character sets:
+For 8-bit character sets:
 
 tab_to_uni  : array[256] of charset->Unicode translation
 tab_from_uni: a structure for Unicode->charset translation
 
-Non-8 bit charsets have their own structures per charset
-hidden in correspondent ctype-xxx.c file and don't use
+Non-8-bit charsets have their own structures per charset
+hidden in corresponding ctype-xxx.c file and don't use
 tab_to_uni and tab_from_uni tables.
 
 
@@ -106,9 +152,9 @@
 state_map[]
 ident_map[]
 
- These maps are to quickly identify if a character is
-an identificator part, a digit, a special character, 
-or a part of other SQL language lexical item.
+These maps are used to quickly identify whether a character is an
+identifier part, a digit, a special character, or a part of another
+SQL language lexical item.
 
 Probably can be combined with ctype array in the future.
 But for some reasons these two arrays are used in the parser,
@@ -116,32 +162,32 @@
 code, like fulltext, etc.
 
 
-Misc fields
------------
+Miscellaneous fields
+--------------------
 
-  strxfrm_multiply - how many times a sort key (i.e. a string
-                     which can be passed into memcmp() for comparison)
+  strxfrm_multiply - how many times a sort key (that is, a string
+                     that can be passed into memcmp() for comparison)
                      can be longer than the original string. 
                      Usually it is 1. For some complex
-                     collations it can be bigger. For example
+                     collations it can be bigger. For example,
                      in latin1_german2_ci, a sort key is up to
-                     twice longer than the original string.
+                     two times longer than the original string.
                      e.g. Letter 'A' with two dots above is
                      substituted with 'AE'. 
-  mbminlen         - mininum multibyte sequence length.
-                     Now always 1 except ucs2. For ucs2
+  mbminlen         - minimum multi-byte sequence length.
+                     Now always 1 except for ucs2. For ucs2,
                      it is 2.
-  mbmaxlen         - maximum multibyte sequence length.
-                     1 for 8bit charsets. Can be also 2 or 3.
+  mbmaxlen         - maximum multi-byte sequence length.
+                     1 for 8-bit charsets. Can be also 2 or 3.
 
   max_sort_char    - for LIKE range
-                     in case of 8bit character sets - native code
+                     in case of 8-bit character sets - native code
 		     of maximum character (max_str pad byte);
                      in case of UTF8 and UCS2 - Unicode code of the maximum
 		     possible character (usually U+FFFF). This code is
-		     converted to multibyte representation (usually 0xEFBFBF)
+		     converted to multi-byte representation (usually 0xEFBFBF)
 		     and then used as a pad sequence for max_str.
-		     in case of other multibyte character sets -
+		     in case of other multi-byte character sets -
 		     max_str pad byte (usually 0xFF).
 
 MY_CHARSET_HANDLER
@@ -151,10 +197,10 @@
 related routines. Defined in m_ctype.h. Have the 
 following set of functions:
 
-Multibyte routines
+Multi-byte routines
 ------------------
-ismbchar()  - detects if the given string is a multibyte sequence
-mbcharlen() - returns length of multibyte sequence starting with
+ismbchar()  - detects whether the given string is a multi-byte sequence
+mbcharlen() - returns length of multi-byte sequence starting with
               the given character
 numchars()  - returns number of characters in the given string, e.g.
               in SQL function CHAR_LENGTH().
@@ -163,29 +209,29 @@
               INSERT()
 
 well_formed_length()
-            - finds the length of correctly formed multybyte beginning.
+            - finds the length of correctly formed multi-byte beginning.
               Used in INSERTs to cut a beginning of the given string
               which is
               a) "well formed" according to the given character set.
-              b)  can fit into the given data type
+              b) can fit into the given data type
               Terminates the string in the good position, taking in account
-              multibyte character boundaries.
+              multi-byte character boundaries.
 
-lengthsp()  - returns the length of the given string without traling spaces.
+lengthsp()  - returns the length of the given string without trailing spaces.
 
 
 Unicode conversion routines
 ---------------------------
-mb_wc       - converts the left multibyte sequence into it Unicode code.
-mc_mb       - converts the given Unicode code into multibyte sequence.
+mb_wc       - converts the left multi-byte sequence into its Unicode code.
+mc_mb       - converts the given Unicode code into multi-byte sequence.
 
 
 Case and sort conversion
 ------------------------
-caseup_str  - converts the given 0-terminated string into the upper case
-casedn_str  - converts the given 0-terminated string into the lower case
-caseup      - converts the given string into the lower case using length
-casedn      - converts the given string into the lower case using length
+caseup_str  - converts the given 0-terminated string to uppercase
+casedn_str  - converts the given 0-terminated string to lowercase
+caseup      - converts the given string to lowercase using length
+casedn      - converts the given string to lowercase using length
 
 Number-to-string conversion routines
 ------------------------------------
@@ -193,7 +239,7 @@
 long10_to_str()
 longlong10_to_str()
 
-The names are pretty self-descripting.
+The names are pretty self-describing.
 
 String padding routines
 -----------------------
@@ -201,7 +247,7 @@
              with the given length. Used to pad the string, usually
              with space character, according to the given charset.
 
-String-to-numner conversion routines
+String-to-number conversion routines
 ------------------------------------
 strntol()
 strntoul()
@@ -209,10 +255,10 @@
 strntoull()
 strntod()
 
-These functions are almost for the same thing with their
-STDLIB counterparts, but also:
+These functions are almost the same as their STDLIB counterparts,
+but also:
   - accept length instead of 0-terminator
-  - and are character set dependant
+  - are character set dependent
 
 Simple scanner routines
 -----------------------
@@ -230,8 +276,8 @@
 like_range()  - creates a LIKE range, for optimizer
 wildcmp()     - wildcard comparison, for LIKE
 strcasecmp()  - 0-terminated string comparison
-instr()       - finds the first substring appearence in the string
-hash_sort()   - calculates hash value taking in account
+instr()       - finds the first substring appearance in the string
+hash_sort()   - calculates hash value taking into account
                 the collation rules, e.g. case-insensitivity, 
                 accent sensitivity, etc.
 

Thread
commit into mysql-5.1 branch (kgeorge:2656) Bug#36676Georgi Kodinov28 May