Rechercher dans le manuel MySQL
10.9 Unicode Support
[+/-]
- 10.9.1 The utf8mb4 Character Set (4-Byte UTF-8 Unicode Encoding)
- 10.9.2 The utf8mb3 Character Set (3-Byte UTF-8 Unicode Encoding)
- 10.9.3 The utf8 Character Set (Alias for utf8mb3)
- 10.9.4 The ucs2 Character Set (UCS-2 Unicode Encoding)
- 10.9.5 The utf16 Character Set (UTF-16 Unicode Encoding)
- 10.9.6 The utf16le Character Set (UTF-16LE Unicode Encoding)
- 10.9.7 The utf32 Character Set (UTF-32 Unicode Encoding)
- 10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets
The Unicode Standard includes characters from the Basic Multilingual Plane (BMP) and supplementary characters that lie outside the BMP. This section describes support for Unicode in MySQL. For information about the Unicode Standard itself, visit the Unicode Consortium website.
BMP characters have these characteristics:
Their code point values are between 0 and 65535 (or
U+0000
andU+FFFF
).They can be encoded in a variable-length encoding using 8, 16, or 24 bits (1 to 3 bytes).
They can be encoded in a fixed-length encoding using 16 bits (2 bytes).
They are sufficient for almost all characters in major languages.
Supplementary characters lie outside the BMP:
Their code point values are between
U+10000
andU+10FFFF
).Unicode support for supplementary characters requires character sets that have a range outside BMP characters and therefore take more space than BMP characters (up to 4 bytes per character).
The UTF-8 (Unicode Transformation Format with 8-bit units) method for encoding Unicode data is implemented according to RFC 3629, which describes encoding sequences that take from one to four bytes. The idea of UTF-8 is that various Unicode characters are encoded using byte sequences of different lengths:
Basic Latin letters, digits, and punctuation signs use one byte.
Most European and Middle East script letters fit into a 2-byte sequence: extended Latin letters (with tilde, macron, acute, grave and other accents), Cyrillic, Greek, Armenian, Hebrew, Arabic, Syriac, and others.
Korean, Chinese, and Japanese ideographs use 3-byte or 4-byte sequences.
MySQL supports these Unicode character sets:
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character.utf8
: An alias forutf8mb3
.ucs2
: The UCS-2 encoding of the Unicode character set using two bytes per character.utf16
: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Likeucs2
but with an extension for supplementary characters.utf16le
: The UTF-16LE encoding for the Unicode character set. Likeutf16
but little-endian rather than big-endian.utf32
: The UTF-32 encoding for the Unicode character set using four bytes per character.
The utf8mb3
character set is deprecated and
will be removed in a future MySQL release. Please use
utf8mb4
instead. Although
utf8
is currently an alias for
utf8mb3
, at some point
utf8
will become a reference to
utf8mb4
. To avoid ambiguity about the meaning
of utf8
, consider specifying
utf8mb4
explicitly for character set
references instead of utf8
.
Table 10.2, “Unicode Character Set General Characteristics”, summarizes the general characteristics of Unicode character sets supported by MySQL.
Table 10.2 Unicode Character Set General Characteristics
Character Set | Supported Characters | Required Storage Per Character |
---|---|---|
utf8mb3 , utf8 |
BMP only | 1, 2, or 3 bytes |
ucs2 |
BMP only | 2 bytes |
utf8mb4 |
BMP and supplementary | 1, 2, 3, or 4 bytes |
utf16 |
BMP and supplementary | 2 or 4 bytes |
utf16le |
BMP and supplementary | 2 or 4 bytes |
utf32 |
BMP and supplementary | 4 bytes |
Characters outside the BMP compare as REPLACEMENT CHARACTER and
convert to '?'
when converted to a Unicode
character set that supports only BMP characters
(utf8mb3
or ucs2
).
If you use character sets that support supplementary characters
and thus are “wider” than the BMP-only
utf8mb3
and ucs2
character
sets, there are potential incompatibility issues for your
applications; see Section 10.9.8, “Converting Between 3-Byte and 4-Byte Unicode Character Sets”.
That section also describes how to convert tables from the
(3-byte) utf8mb3
to the (4-byte)
utf8mb4
, and what constraints may apply in
doing so.
A similar set of collations is available for most Unicode
character sets. For example, each has a Danish collation, the
names of which are utf8mb4_danish_ci
,
utf8mb3_danish_ci
,
utf8_danish_ci
,
ucs2_danish_ci
,
utf16_danish_ci
, and
utf32_danish_ci
. The exception is
utf16le
, which has only two collations. For
information about Unicode collations and their differentiating
properties, including collation properties for supplementary
characters, see Section 10.10.1, “Unicode Character Sets”.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values will need to be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.
MySQL uses no BOM for UTF-8 values.
Client applications that communicate with the server using Unicode
should set the client character set accordingly (for example, by
issuing a SET NAMES 'utf8mb4'
statement). Some
character sets cannot be used as the client character set.
Attempting to use them with SET
NAMES
or SET CHARACTER
SET
produces an error. See
Impermissible Client Character Sets.
The following sections provide additional detail on the Unicode character sets in MySQL.
Nederlandse vertaling
U hebt gevraagd om deze site in het Nederlands te bezoeken. Voor nu wordt alleen de interface vertaald, maar nog niet alle inhoud.Als je me wilt helpen met vertalingen, is je bijdrage welkom. Het enige dat u hoeft te doen, is u op de site registreren en mij een bericht sturen waarin u wordt gevraagd om u toe te voegen aan de groep vertalers, zodat u de gewenste pagina's kunt vertalen. Een link onderaan elke vertaalde pagina geeft aan dat u de vertaler bent en heeft een link naar uw profiel.
Bij voorbaat dank.
Document heeft de 26/06/2006 gemaakt, de laatste keer de 26/10/2018 gewijzigd
Bron van het afgedrukte document:https://www.gaudry.be/nl/mysql-rf-charset-unicode.html
De infobrol is een persoonlijke site waarvan de inhoud uitsluitend mijn verantwoordelijkheid is. De tekst is beschikbaar onder CreativeCommons-licentie (BY-NC-SA). Meer info op de gebruiksvoorwaarden en de auteur.
Referenties
Deze verwijzingen en links verwijzen naar documenten die geraadpleegd zijn tijdens het schrijven van deze pagina, of die aanvullende informatie kunnen geven, maar de auteurs van deze bronnen kunnen niet verantwoordelijk worden gehouden voor de inhoud van deze pagina.
De auteur Deze site is als enige verantwoordelijk voor de manier waarop de verschillende concepten, en de vrijheden die met de referentiewerken worden genomen, hier worden gepresenteerd. Vergeet niet dat u meerdere broninformatie moet doorgeven om het risico op fouten te verkleinen.