Rechercher dans le manuel MySQL
10.10.1 Unicode Character Sets
MySQL supports multiple Unicode character sets:
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character.utf8
: An alias forutf8mb3
.ucs2
: The UCS-2 encoding of the Unicode character set using two bytes per character.utf16
: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Likeucs2
but with an extension for supplementary characters.utf16le
: The UTF-16LE encoding for the Unicode character set. Likeutf16
but little-endian rather than big-endian.utf32
: The UTF-32 encoding for the Unicode character set using four bytes per character.
The utf8mb3
character set is deprecated and
will be removed in a future MySQL release. Please use
utf8mb4
instead. Although
utf8
is currently an alias for
utf8mb3
, at that point
utf8
will become a reference to
utf8mb4
. To avoid ambiguity about the
meaning of utf8
, consider specifying
utf8mb4
explicitly for character set
references instead of utf8
.
utf8
and ucs2
support
Basic Multilingual Plane (BMP) characters.
utf8mb4
, utf16
,
utf16le
, and utf32
support
BMP and supplementary characters.
This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, see Section 10.9, “Unicode Support”.
Most Unicode character sets have a general collation (indicated
by _general
in the name or by the absence of
a language specifier), a binary collation (indicated by
_bin
in the name), and several
language-specific collations (indicated by language specifiers).
For example, for utf8
,
utf8_general_ci
and
utf8_bin
are its general and binary
collations, and utf8_danish_ci
is one of its
language-specific collations.
Collation support for utf16le
is limited. The
only collations available are
utf16le_general_ci
and
utf16le_bin
. These are similar to
utf16_general_ci
and
utf16_bin
.
A locale code or language name shown in the following table indicates a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Table 10.3 Unicode Collation Language Specifiers
Language | Language Specifier |
---|---|
Chinese | zh |
Classical Latin | la or roman |
Croatian | hr or croatian |
Czech | cs or czech |
Danish | da or danish |
Esperanto | eo or esperanto |
Estonian | et or estonian |
German phone book order | de_pb or german2 |
Hungarian | hu or hungarian |
Icelandic | is or icelandic |
Japanese | ja |
Latvian | lv or latvian |
Lithuanian | lt or lithuanian |
Persian | persian |
Polish | pl or polish |
Romanian | ro or romanian |
Russian | ru |
Sinhala | sinhala |
Slovak | sk or slovak |
Slovenian | sl or slovenian |
Modern Spanish | es or spanish |
Traditional Spanish | es_trad or spanish2 |
Swedish | sv or swedish |
Turkish | tr or turkish |
Vietnamese | vi or vietnamese |
Croatian collations are tailored for these Croatian letters:
Č
, Ć
,
Dž
, Đ
,
Lj
, Nj
,
Š
, Ž
.
Danish collations may also be used for Norwegian.
For Japanese, the utf8mb4
character set
includes utf8mb4_ja_0900_as_cs
and
utf8mb4_ja_0900_as_cs_ks
collations. Both
collations are accent sensitive and case-sensitive.
utf8mb4_ja_0900_as_cs_ks
is also kana
sensitive and distinguishes Katakana characters from Hiragana
characters, whereas utf8mb4_ja_0900_as_cs
treats Katakana and Hiragana characters as equal for sorting.
Applications that require a Japanese collation but not kana
sensitivity may use utf8mb4_ja_0900_as_cs
for
better sort performance.
utf8mb4_ja_0900_as_cs
uses three weight
levels for sorting; utf8mb4_ja_0900_as_cs_ks
uses four.
For Classical Latin collations that are accent insensitive,
I
and J
compare as equal,
and U
and V
compare as
equal. I
and J
, and
U
and V
compare as equal
on the base letter level. In other words, J
is regarded as an accented I
, and
U
is regarded as an accented
V
.
Spanish collations are available for modern and traditional
Spanish. For both, ñ
(n-tilde) is a separate
letter between n
and o
. In
addition, for traditional Spanish, ch
is a
separate letter between c
and
d
, and ll
is a separate
letter between l
and m
.
Traditional Spanish collations may also be used for Asturian and Galician.
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ü = Y < Ö
For questions about particular language orderings, unicode.org provides Common Locale Data Repository (CLDR) collation charts at http://www.unicode.org/cldr/charts/30/collation/index.html.
The
collations preserve the pre-5.1.24 ordering of the original
xxx
_general_mysql500_ci
collations and permit upgrades for tables created before MySQL
5.1.24 (Bug #27877).
xxx
_general_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
The xxx
_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported, and combining
marks are not fully supported. This affects primarily
Vietnamese, Yoruba, and some smaller languages such as Navajo. A
combined character is considered different from the same
character written with a single unicode character in string
comparisons, and the two characters are considered to have a
different length (for example, as returned by the
xxx
_unicode_ciCHAR_LENGTH()
function or in
result set metadata).
Unicode collations based on UCA versions later than 4.0.0
include the version in the collation name. Thus,
utf8mb4_unicode_520_ci
is based on UCA 5.2.0
weight keys
(http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt),
whereas utf8mb4_0900_ai_ci
is based on UCA
9.0.0 weight keys
(http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt).
Collations based on UCA 9.0.0 and higher are faster than collations based on UCA versions below 9.0.0. They also have a pad attribute of NO PAD, in contrast to PAD SPACE as used in collations based on UCA versions below 9.0.0. NO PAD collations treat spaces at the end of strings like any other character.
To determine the pad attribute for a collation, use the
INFORMATION_SCHEMA
COLLATIONS
table, which has a
PAD_ATTRIBUTE
column.
Comparisons of VARCHAR
columns
that have a NO PAD collation differ from other collations with
respect to trailing spaces. For example, 'a'
and 'a '
compare as different strings,
not the same string.
MySQL implements language-specific Unicode collations if the ordering based only on UCA does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules.
For example, the nonlanguage-specific
utf8mb4_0900_ai_ci
and language-specific
utf8mb4_
Unicode collations each have these characteristics:
LOCALE
_0900_ai_ci
The collation is based on Unicode Collation Algorithm (UCA) 9.0.0 and Common Locale Data Repository (CLDR) v30, is accent insensitive, and case insensitive. These characteristics are indicated by
_0900
,_ai
, and_ci
in the collation name. Exception:utf8mb4_la_0900_ai_ci
is not based on CLDR because Classical Latin is not defined in CLDR.The collation works for all characters in the range [U+0, U+10FFFF].
If the collation is not language specific, it sorts all characters, including supplementary characters, in default order (described following). If the collation is language specific, it sorts characters of the language correctly according to language-specific rules, and characters not in the language in default order.
By default, the collation sorts characters having a code point listed in the DUCET table (Default Unicode Collation Element Table) according to the weight value assigned in the table. The collation sorts characters not having a code point listed in the DUCET table using their implicit weight value, which is constructed according to the UCA.
For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order.
LOWER()
and
UPPER()
perform case folding
according to the collation of their argument. A character that
has uppercase and lowercase versions only in a Unicode version
more recent than 4.0.0 is converted by these functions only if
the argument has a collation that uses a recent enough UCA
version.
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx
_general_ci
collation. For example, comparisons for the
xxx
_unicode_ciutf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
. The reason for this is that
utf8_unicode_ci
supports mappings such as
expansions; that is, when one character compares as equal to
combinations of other characters. For example, in German and
some other languages ß
is equal to
ss
. utf8_unicode_ci
also
supports contractions and ignorable characters.
utf8_general_ci
is a legacy collation that
does not support expansions, contractions, or ignorable
characters. It can make only one-to-one comparisons between
characters.
To further illustrate, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
(for the effect of this in
comparisons or searches, see
Section 10.8.6, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements utf8
language-specific
collations if the ordering with
utf8_unicode_ci
does not work well for a
language. For example, utf8_unicode_ci
works
fine for German dictionary order and French, so there is no need
to create special utf8
collations.
utf8_general_ci
also is satisfactory for both
German and French, except that ß
is equal to
s
, and not to ss
. If this
is acceptable for your application, you should use
utf8_general_ci
because it is faster. If this
is not acceptable (for example, if you require German dictionary
order), use utf8_unicode_ci
because it is
more accurate.
If you require German DIN-2 (phone book) ordering, use the
utf8_german2_ci
collation, which compares the
following sets of characters equal:
Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss
utf8_german2_ci
is similar to
latin1_german2_ci
, but the latter does not
compare Æ
equal to AE
or
Œ
equal to OE
. There is
no utf8_german_ci
corresponding to
latin1_german_ci
for German dictionary order
because utf8_general_ci
suffices.
For all Unicode collations except the binary
(_bin
) collations, MySQL performs a table
lookup to find a character's collating weight. This weight can
be displayed using the
WEIGHT_STRING()
function. (See
Section 12.5, “String Functions”.) If a character is not in
the table (for example, because it is a “new”
character), collating weight determination becomes more complex:
For BMP characters in general collations (
), weight = code point.xxx
_general_ciFor BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:xxx
_unicode_ciif (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
aaaa
followed bybbbb
. For example:- +----------------------------------------------------------+
- +----------------------------------------------------------+
- | FBC084CF |
- +----------------------------------------------------------+
Thus,
U+04cf CYRILLIC SMALL LETTER PALOCHKA
is, with all UCA 4.0.0 collations, greater thanU+04c0 CYRILLIC LETTER PALOCHKA
. With UCA 5.2.0 collations, all palochkas sort together.For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER
. For supplementary characters in UCA 4.0.0 collations, their collating weight is0xfffd
. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.An example with Deseret characters and
COUNT(DISTINCT)
:The result is 2 because in the MySQL
collations, the replacement character has a weight ofxxx
_unicode_ci0x0dc6
, whereas Deseret Bee and Deseret Tee both have a weight of0xfffd
. (Were theutf32_general_ci
collation used instead, the result is 1 because all three characters have a weight of0xfffd
in that collation.)An example with cuneiform characters and
WEIGHT_STRING()
:- /*
- The four characters in the INSERT string are
- 00000041 # LATIN CAPITAL LETTER A
- 0001218F # CUNEIFORM SIGN KAB
- 000121A7 # CUNEIFORM SIGN KISH
- 00000042 # LATIN CAPITAL LETTER B
- */
The result is:
0E33 FFFD FFFD 0E4A
0E33
and0E4A
are primary weights as in UCA 4.0.0.FFFD
is the weight for KAB and also for KISH.The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
For supplementary characters based on UCA versions higher than 4.0.0 (for example,
), supplementary characters do not necessarily all have the same collation weight. Some have explicit weights from the UCAxxx
_unicode_520_ciallkeys.txt
file. Others have weights calculated from this algorithm:aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
There is a difference between “ordering by the character's
code value” and “ordering by the character's binary
representation,” a difference that appears only with
utf16_bin
, because of surrogates.
Suppose that utf16_bin
(the binary collation
for utf16
) was a binary comparison
“byte by byte” rather than “character by
character.” If that were so, the order of characters in
utf16_bin
would differ from the order in
utf8_bin
. For example, the following chart
shows two rare characters. The first character is in the range
E000
-FFFF
, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16
---------- --------- ---- -----
0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D
10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point value
because 0xff9d
<
0x10384
. And they are in order by
utf8
value because 0xef
< 0xf0
. But they are not in order by
utf16
value, if we use byte-by-byte
comparison, because 0xff
>
0xd8
.
So MySQL's utf16_bin
collation is not
“byte by byte.” It is “by code point.”
When MySQL sees a supplementary-character encoding in
utf16
, it converts to the character's
code-point value, and then compares. Therefore,
utf8_bin
and utf16_bin
are
the same ordering. This is consistent with the SQL:2008 standard
requirement for a UCS_BASIC collation: “UCS_BASIC is a
collation in which the ordering is determined entirely by the
Unicode scalar values of the characters in the strings being
sorted. It is applicable to the UCS character repertoire. Since
every character repertoire is a subset of the UCS repertoire,
the UCS_BASIC collation is potentially applicable to every
character set. NOTE 11: The Unicode scalar value of a character
is its code point treated as an unsigned integer.”
If the character set is ucs2
, comparison is
byte-by-byte, but ucs2
strings should not
contain surrogates, anyway.
Traduction non disponible
Le manuel MySQL n'est pas encore traduit en français sur l'infobrol. Seule la version anglaise est disponible pour l'instant.
Document créé le 26/06/2006, dernière modification le 26/10/2018
Source du document imprimé : https://www.gaudry.be/mysql-rf-charset-unicode-sets.html
L'infobrol est un site personnel dont le contenu n'engage que moi. Le texte est mis à disposition sous licence CreativeCommons(BY-NC-SA). Plus d'info sur les conditions d'utilisation et sur l'auteur.
Références
Ces références et liens indiquent des documents consultés lors de la rédaction de cette page, ou qui peuvent apporter un complément d'information, mais les auteurs de ces sources ne peuvent être tenus responsables du contenu de cette page.
L'auteur de ce site est seul responsable de la manière dont sont présentés ici les différents concepts, et des libertés qui sont prises avec les ouvrages de référence. N'oubliez pas que vous devez croiser les informations de sources multiples afin de diminuer les risques d'erreurs.