Rechercher dans le manuel MySQL

12.9.9 MeCab Full-Text Parser Plugin

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation for Japanese, MySQL provides a MeCab full-text parser plugin. The MeCab full-text parser plugin is supported for use with InnoDB and MyISAM.

Note

MySQL also provides an ngram full-text parser plugin that supports Japanese. For more information, see Section 12.9.8, “ngram Full-Text Parser”.

The MeCab full-text parser plugin is a full-text parser plugin for Japanese that tokenizes a sequence of text into meaningful words. For example, MeCab tokenizes データベース管理 (Database Management) into データベース (Database) and 管理 (Management). By comparison, the ngram full-text parser tokenizes text into a contiguous sequence of n characters, where n represents a number between 1 and 10.

In addition to tokenizing text into meaningful words, MeCab indexes are typically smaller than ngram indexes, and MeCab full-text searches are generally faster. One drawback is that it may take longer for the MeCab full-text parser to tokenize documents, compared to the ngram full-text parser.

The full-text search syntax described in Section 12.9, “Full-Text Search Functions” applies to the MeCab parser plugin. Differences in parsing behavior are described in this section. Full-text related configuration options are also applicable.

For additional information about the MeCab parser, refer to the MeCab: Yet Another Part-of-Speech and Morphological Analyzer project on Github.

Installing the MeCab Parser Plugin

The MeCab parser plugin requires mecab and mecab-ipadic.

On supported Fedora, Debian and Ubuntu platforms (except Ubuntu 12.04 where the system mecab version is too old), MySQL dynamically links to the system mecab installation if it is installed to the default location. On other supported Unix-like platforms, libmecab.so is statically linked in libpluginmecab.so, which is located in the MySQL plugin directory. mecab-ipadic is included in MySQL binaries and is located in MYSQL_HOME\lib\mecab.

You can install mecab and mecab-ipadic using a native package management utility (on Fedora, Debian, and Ubuntu), or you can build mecab and mecab-ipadic from source. For information about installing mecab and mecab-ipadic using a native package management utility, see Installing MeCab From a Binary Distribution (Optional). If you want to build mecab and mecab-ipadic from source, see Building MeCab From Source (Optional).

On Windows, libmecab.dll is found in the MySQL bin directory. mecab-ipadic is located in MYSQL_HOME/lib/mecab.

To install and configure the MeCab parser plugin, perform the following steps:

  1. In the MySQL configuration file, set the mecab_rc_file configuration option to the location of the mecabrc configuration file, which is the configuration file for MeCab. If you are using the MeCab package distributed with MySQL, the mecabrc file is located in MYSQL_HOME/lib/mecab/etc/.

    [mysqld]
    loose-mecab-rc-file=MYSQL_HOME/lib/mecab/etc/mecabrc

    The loose prefix is an option modifier. The mecab_rc_file option is not recognized by MySQL until the MeCaB parser plugin is installed but it must be set before attempting to install the MeCaB parser plugin. The loose prefix allows you restart MySQL without encountering an error due to an unrecognized variable.

    If you use your own MeCab installation, or build MeCab from source, the location of the mecabrc configuration file may differ.

    For information about the MySQL configuration file and its location, see Section 4.2.2.2, “Using Option Files”.

  2. Also in the MySQL configuration file, set the minimum token size to 1 or 2, which are the values recommended for use with the MeCab parser. For InnoDB tables, minimum token size is defined by the innodb_ft_min_token_size configuration option, which has a default value of 3. For MyISAM tables, minimum token size is defined by ft_min_word_len, which has a default value of 4.

    [mysqld]
    innodb_ft_min_token_size=1
  3. Modify the mecabrc configuration file to specify the dictionary you want to use. The mecab-ipadic package distributed with MySQL binaries includes three dictionaries (ipadic_euc-jp, ipadic_sjis, and ipadic_utf-8). The mecabrc configuration file packaged with MySQL contains and entry similar to the following:

    dicdir =  /path/to/mysql/lib/mecab/lib/mecab/dic/ipadic_euc-jp

    To use the ipadic_utf-8 dictionary, for example, modify the entry as follows:

    dicdir=MYSQL_HOME/lib/mecab/dic/ipadic_utf-8

    If you are using your own MeCab installation or have built MeCab from source, the default dicdir entry in the mecabrc file will differ, as will the dictionaries and their location.

    Note

    After the MeCab parser plugin is installed, you can use the mecab_charset status variable to view the character set used with MeCab. The three MeCab dictionaries provided with the MySQL binary support the following character sets.

    • The ipadic_euc-jp dictionary supports the ujis and eucjpms character sets.

    • The ipadic_sjis dictionary supports the sjis and cp932 character sets.

    • The ipadic_utf-8 dictionary supports the utf8 and utf8mb4 character sets.

    mecab_charset only reports the first supported character set. For example, the ipadic_utf-8 dictionary supports both utf8 and utf8mb4. mecab_charset always reports utf8 when this dictionary is in use.

  4. Restart MySQL.

  5. Install the MeCab parser plugin:

    The MeCab parser plugin is installed using INSTALL PLUGIN syntax. The plugin name is mecab, and the shared library name is libpluginmecab.so. For additional information about installing plugins, see Section 5.6.1, “Installing and Uninstalling Plugins”.

    1. INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';

    Once installed, the MeCab parser plugin loads at every normal MySQL restart.

  6. Verify that the MeCab parser plugin is loaded using the SHOW PLUGINS statement.

    1. mysql> SHOW PLUGINS;

    A mecab plugin should appear in the list of plugins.

Contents Haut

Creating a FULLTEXT Index that uses the MeCab Parser

To create a FULLTEXT index that uses the mecab parser, specify WITH PARSER ngram with CREATE TABLE, ALTER TABLE, or CREATE INDEX.

This example demonstrates creating a table with a mecab FULLTEXT index, inserting sample data, and viewing tokenized data in the INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table:

mysql> USE test;

mysql> CREATE TABLE articles (
      id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
      title VARCHAR(200),
      body TEXT,
      FULLTEXT (title,body) WITH PARSER mecab
    ) ENGINE=InnoDB CHARACTER SET utf8;

mysql> SET NAMES utf8;
    
mysql> INSERT INTO articles (title,body) VALUES
    ('データベース管理','このチュートリアルでは、私はどのようにデータベースを管理する方法を紹介します'),
    ('データベースアプリケーション開発','データベースアプリケーションを開発することを学ぶ');

mysql> SET GLOBAL innodb_ft_aux_table="test/articles";

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;

To add a FULLTEXT index to an existing table, you can use ALTER TABLE or CREATE INDEX. For example:

  1. CREATE TABLE articles (
  2.       title VARCHAR(200),
  3.       body TEXT
  4.      ) ENGINE=InnoDB CHARACTER SET utf8;
  5.  
  6. ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER mecab;
  7.  
  8. # Or:
  9.  
  10. CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER mecab;

Contents Haut

MeCab Parser Space Handling

The MeCab parser uses spaces as separators in query strings. For example, the MeCab parser tokenizes データベース管理 as データベース and 管理.

Contents Haut

MeCab Parser Stopword Handling

By default, the MeCab parser uses the default stopword list, which contains a short list of English stopwords. For a stopword list applicable to Japanese, you must create your own. For information about creating stopword lists, see Section 12.9.4, “Full-Text Stopwords”.

Contents Haut

For natural language mode search, the search term is converted to a union of tokens. For example, データベース管理 is converted to データベース 管理.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN NATURAL LANGUAGE MODE);

For boolean mode search, the search term is converted to a search phrase. For example, データベース管理 is converted to データベース 管理.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN BOOLEAN MODE);

Contents Haut

Wildcard search terms are not tokenized. A search on データベース管理* is performed on the prefix, データベース管理.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース*' IN BOOLEAN MODE);

Contents Haut

Phrases are tokenized. For example, データベース管理 is tokenized as データベース 管理.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('"データベース管理"' IN BOOLEAN MODE);

Contents Haut

Installing MeCab From a Binary Distribution (Optional)

This section describes how to install mecab and mecab-ipadic from a binary distribution using a native package management utility. For example, on Fedora, you can use Yum to perform the installation:

yum mecab-devel

On Debian or Ubuntu, you can perform an APT installation:

apt-get install mecab
apt-get install mecab-ipadic

Contents Haut

Installing MeCab From Source (Optional)

If you want to build mecab and mecab-ipadic from source, basic installation steps are provided below. For additional information, refer to the MeCab documentation.

  1. Download the tar.gz packages for mecab and mecab-ipadic from http://taku910.github.io/mecab/#download. As of February, 2016, the latest available packages are mecab-0.996.tar.gz and mecab-ipadic-2.7.0-20070801.tar.gz.

  2. Install mecab:

    tar zxfv mecab-0.996.tar
    cd mecab-0.996
    ./configure
    make
    make check
    su
    make install
  3. Install mecab-ipadic:

    tar zxfv mecab-ipadic-2.7.0-20070801.tar
    cd mecab-ipadic-2.7.0-20070801
    ./configure
    make
    su
    make install
  4. Compile MySQL using the WITH_MECAB CMake option. Set the WITH_MECAB option to system if you have installed mecab and mecab-ipadic to the default location.

    -DWITH_MECAB=system

    If you defined a custom installation directory, set WITH_MECAB to the custom directory. For example:

    -DWITH_MECAB=/path/to/mecab

Find a PHP function

Document created the 26/06/2006, last modified the 26/10/2018
Source of the printed document:https://www.gaudry.be/en/mysql-rf-fulltext-search-mecab.html

The infobrol is a personal site whose content is my sole responsibility. The text is available under CreativeCommons license (BY-NC-SA). More info on the terms of use and the author.

References

  1. View the html document Language of the document:en Manuel MySQL : https://dev.mysql.com/

These references and links indicate documents consulted during the writing of this page, or which may provide additional information, but the authors of these sources can not be held responsible for the content of this page.
The author This site is solely responsible for the way in which the various concepts, and the freedoms that are taken with the reference works, are presented here. Remember that you must cross multiple source information to reduce the risk of errors.

Contents Haut