Rechercher dans le manuel MySQL
29.2.4.4 Writing Full-Text Parser Plugins
MySQL supports server-side full-text parser plugins with
MyISAM
and
InnoDB
. For introductory
information about full-text parser plugins, see
Full-Text Parser Plugins.
A full-text parser plugin can be used to replace or modify the
built-in full-text parser. This section describes how to write
a full-text parser plugin named
simple_parser
. This plugin performs parsing
based on simpler rules than those used by the MySQL built-in
full-text parser: Words are nonempty runs of whitespace
characters.
The instructions use the source code in the
plugin/fulltext
directory of MySQL source
distributions, so change location into that directory. The
following procedure describes how the plugin library is
created:
To write a full-text parser plugin, include the following header file in the plugin source file. Other MySQL or general header files might also be needed, depending on the plugin capabilities and requirements.
#include <mysql/plugin.h>
plugin.h
defines theMYSQL_FTPARSER_PLUGIN
server plugin type and the data structures needed to declare the plugin.Set up the library descriptor for the plugin library file.
This descriptor contains the general plugin descriptor for the server plugin. For a full-text parser plugin, the type must be
MYSQL_FTPARSER_PLUGIN
. This is the value that identifies the plugin as being legal for use in aWITH PARSER
clause when creating aFULLTEXT
index. (No other plugin type is legal for this clause.)For example, the library descriptor for a library that contains a single full-text parser plugin named
simple_parser
looks like this:mysql_declare_plugin(ftexample) { MYSQL_FTPARSER_PLUGIN, /* type */ &simple_parser_descriptor, /* descriptor */ "simple_parser", /* name */ "Oracle Corporation", /* author */ "Simple Full-Text Parser", /* description */ PLUGIN_LICENSE_GPL, /* plugin license */ simple_parser_plugin_init, /* init function (when loaded) */ simple_parser_plugin_deinit,/* deinit function (when unloaded) */ 0x0001, /* version */ simple_status, /* status variables */ simple_system_variables, /* system variables */ NULL, 0 } mysql_declare_plugin_end;
The
name
member (simple_parser
) indicates the name to use for references to the plugin in statements such asINSTALL PLUGIN
orUNINSTALL PLUGIN
. This is also the name displayed bySHOW PLUGINS
orINFORMATION_SCHEMA.PLUGINS
.For more information, see Section 29.2.4.2.1, “Server Plugin Library and Plugin Descriptors”.
Set up the type-specific plugin descriptor.
Each general plugin descriptor in the library descriptor points to a type-specific descriptor. For a full-text parser plugin, the type-specific descriptor is an instance of the
st_mysql_ftparser
structure in theplugin.h
file:struct st_mysql_ftparser { int interface_version; int (*parse)(MYSQL_FTPARSER_PARAM *param); int (*init)(MYSQL_FTPARSER_PARAM *param); int (*deinit)(MYSQL_FTPARSER_PARAM *param); };
As shown by the structure definition, the descriptor has an interface version number and contains pointers to three functions.
The interface version number is specified using a symbol, which is in the form:
MYSQL_
. For full-text parser plugins, the symbol isxxx
_INTERFACE_VERSIONMYSQL_FTPARSER_INTERFACE_VERSION
. In the source code, you will find the actual interface version number for the full-text parser plugin defined ininclude/mysql/plugin_ftparser.h
. The current interface version number is0x0101
.The
init
anddeinit
members should point to a function or be set to 0 if the function is not needed. Theparse
member must point to the function that performs the parsing.In the
simple_parser
declaration, that descriptor is indicated by&simple_parser_descriptor
. The descriptor specifies the version number for the full-text plugin interface (as given byMYSQL_FTPARSER_INTERFACE_VERSION
), and the plugin's parsing, initialization, and deinitialization functions:static struct st_mysql_ftparser simple_parser_descriptor= { MYSQL_FTPARSER_INTERFACE_VERSION, /* interface version */ simple_parser_parse, /* parsing function */ simple_parser_init, /* parser init function */ simple_parser_deinit /* parser deinit function */ };
A full-text parser plugin is used in two different contexts, indexing and searching. In both contexts, the server calls the initialization and deinitialization functions at the beginning and end of processing each SQL statement that causes the plugin to be invoked. However, during statement processing, the server calls the main parsing function in context-specific fashion:
For indexing, the server calls the parser for each column value to be indexed.
For searching, the server calls the parser to parse the search string. The parser might also be called for rows processed by the statement. In natural language mode, there is no need for the server to call the parser. For boolean mode phrase searches or natural language searches with query expansion, the parser is used to parse column values for information that is not in the index. Also, if a boolean mode search is done for a column that has no
FULLTEXT
index, the built-in parser will be called. (Plugins are associated with specific indexes. If there is no index, no plugin is used.)
The plugin declaration in the general plugin descriptor has
init
anddeinit
members that point initialization and deinitialization functions, and so does the type-specific plugin descriptor to which it points. However, these pairs of functions have different purposes and are invoked for different reasons:For the plugin declaration in the general plugin descriptor, the initialization and deinitialization functions are invoked when the plugin is loaded and unloaded.
For the type-specific plugin descriptor, the initialization and deinitialization functions are invoked per SQL statement for which the plugin is used.
Each interface function named in the plugin descriptor should return zero for success or nonzero for failure, and each of them receives an argument that points to a
MYSQL_FTPARSER_PARAM
structure containing the parsing context. The structure has this definition:typedef struct st_mysql_ftparser_param { int (*mysql_parse)(struct st_mysql_ftparser_param *, char *doc, int doc_len); int (*mysql_add_word)(struct st_mysql_ftparser_param *, char *word, int word_len, MYSQL_FTPARSER_BOOLEAN_INFO *boolean_info); void *ftparser_state; void *mysql_ftparam; struct charset_info_st *cs; char *doc; int length; int flags; enum enum_ftparser_mode mode; } MYSQL_FTPARSER_PARAM;
The structure members are used as follows:
mysql_parse
: A pointer to a callback function that invokes the server's built-in parser. Use this callback when the plugin acts as a front end to the built-in parser. That is, when the plugin parsing function is called, it should process the input to extract the text and pass the text to themysql_parse
callback.The first parameter for this callback function should be the
param
value itself:param->mysql_parse(param, ...);
A front end plugin can extract text and pass it all at once to the built-in parser, or it can extract and pass text to the built-in parser a piece at a time. However, in this case, the built-in parser treats the pieces of text as though there are implicit word breaks between them.
mysql_add_word
: A pointer to a callback function that adds a word to a full-text index or to the list of search terms. Use this callback when the parser plugin replaces the built-in parser. That is, when the plugin parsing function is called, it should parse the input into words and invoke themysql_add_word
callback for each word.The first parameter for this callback function should be the
param
value itself:param->mysql_add_word(param, ...);
ftparser_state
: This is a generic pointer. The plugin can set it to point to information to be used internally for its own purposes.mysql_ftparam
: This is set by the server. It is passed as the first argument to themysql_parse
ormysql_add_word
callback.cs
: A pointer to information about the character set of the text, or 0 if no information is available.doc
: A pointer to the text to be parsed.length
: The length of the text to be parsed, in bytes.flags
: Parser flags. This is zero if there are no special flags. The only nonzero flag isMYSQL_FTFLAGS_NEED_COPY
, which means thatmysql_add_word()
must save a copy of the word (that is, it cannot use a pointer to the word because the word is in a buffer that will be overwritten.)This flag might be set or reset by MySQL before calling the parser plugin, by the parser plugin itself, or by the
mysql_parse()
function.mode
: The parsing mode. This value will be one of the following constants:MYSQL_FTPARSER_SIMPLE_MODE
: Parse in fast and simple mode, which is used for indexing and for natural language queries. The parser should pass to the server only those words that should be indexed. If the parser uses length limits or a stopword list to determine which words to ignore, it should not pass such words to the server.MYSQL_FTPARSER_WITH_STOPWORDS
: Parse in stopword mode. This is used in boolean searches for phrase matching. The parser should pass all words to the server, even stopwords or words that are outside any normal length limits.MYSQL_FTPARSER_FULL_BOOLEAN_INFO
: Parse in boolean mode. This is used for parsing boolean query strings. The parser should recognize not only words but also boolean-mode operators and pass them to the server as tokens using themysql_add_word
callback. To tell the server what kind of token is being passed, the plugin needs to fill in aMYSQL_FTPARSER_BOOLEAN_INFO
structure and pass a pointer to it.
NoteFor
MyISAM
, the stopword list andft_min_word_len
andft_max_word_len
are checked inside the tokenizer. ForInnoDB
, the stopword list and equivalent word length variable settings (innodb_ft_min_token_size
andinnodb_ft_max_token_size
) are checked outside of the tokenizer. As a result,InnoDB
plugin parsers do not need to check the stopword list,innodb_ft_min_token_size
, orinnodb_ft_max_token_size
. Instead, it is recommended that all words be returned toInnoDB
. However, if you want to check stopwords within your plugin parser, useMYSQL_FTPARSER_SIMPLE_MODE
, which is for full-text search index and natural language search. ForMYSQL_FTPARSER_WITH_STOPWORDS
andMYSQL_FTPARSER_FULL_BOOLEAN_INFO
modes, it is recommended that all words be returned toInnoDB
including stopwords, in case of phrase searches.If the parser is called in boolean mode, the
param->mode
value will beMYSQL_FTPARSER_FULL_BOOLEAN_INFO
. TheMYSQL_FTPARSER_BOOLEAN_INFO
structure that the parser uses for passing token information to the server looks like this:typedef struct st_mysql_ftparser_boolean_info { enum enum_ft_token_type type; int yesno; int weight_adjust; char wasign; char trunc; int position; /* These are parser state and must be removed. */ char prev; char *quot; } MYSQL_FTPARSER_BOOLEAN_INFO;
The parser should fill in the structure members as follows:
type
: The token type. The following table shows the permissible types.Table 29.3 Full-Text Parser Token Types
Token Value Meaning FT_TOKEN_EOF
End of data FT_TOKEN_WORD
A regular word FT_TOKEN_LEFT_PAREN
The beginning of a group or subexpression FT_TOKEN_RIGHT_PAREN
The end of a group or subexpression FT_TOKEN_STOPWORD
A stopword yesno
: Whether the word must be present for a match to occur. 0 means that the word is optional but increases the match relevance if it is present. Values larger than 0 mean that the word must be present. Values smaller than 0 mean that the word must not be present.weight_adjust
: A weighting factor that determines how much a match for the word counts. It can be used to increase or decrease the word's importance in relevance calculations. A value of zero indicates no weight adjustment. Values greater than or less than zero mean higher or lower weight, respectively. The examples at Section 12.9.2, “Boolean Full-Text Searches”, that use the<
and>
operators illustrate how weighting works.wasign
: The sign of the weighting factor. A negative value acts like the~
boolean-search operator, which causes the word's contribution to the relevance to be negative.trunc
: Whether matching should be done as if the boolean-mode*
truncation operator had been given.position
: Start position of the word in the document, in bytes. Used byInnoDB
full-text search. For existing plugins that are called in boolean mode, support must be added for the position member.
Plugins should not use the
prev
andquot
members of theMYSQL_FTPARSER_BOOLEAN_INFO
structure.NoteThe plugin parser framework does not support:
The
@distance
boolean operator.A leading plus sign (
+
) or minus sign (-
) boolean operator followed by a space and then a word ('+ apple'
or'- apple'
). The leading plus or minus sign must be directly adjacent to the word, for example:'+apple'
or'-apple'
.
For information about boolean full-text search operators, see Section 12.9.2, “Boolean Full-Text Searches”.
Set up the plugin interface functions.
The general plugin descriptor in the library descriptor names the initialization and deinitialization functions that the server should invoke when it loads and unloads the plugin. For
simple_parser
, these functions do nothing but return zero to indicate that they succeeded:static int simple_parser_plugin_init(void *arg __attribute__((unused))) { return(0); } static int simple_parser_plugin_deinit(void *arg __attribute__((unused))) { return(0); }
Because those functions do not actually do anything, you could omit them and specify 0 for each of them in the plugin declaration.
The type-specific plugin descriptor for
simple_parser
names the initialization, deinitialization, and parsing functions that the server invokes when the plugin is used. Forsimple_parser
, the initialization and deinitialization functions do nothing:static int simple_parser_init(MYSQL_FTPARSER_PARAM *param __attribute__((unused))) { return(0); } static int simple_parser_deinit(MYSQL_FTPARSER_PARAM *param __attribute__((unused))) { return(0); }
Here too, because those functions do nothing, you could omit them and specify 0 for each of them in the plugin descriptor.
The main parsing function,
simple_parser_parse()
, acts as a replacement for the built-in full-text parser, so it needs to split text into words and pass each word to the server. The parsing function's first argument is a pointer to a structure that contains the parsing context. This structure has adoc
member that points to the text to be parsed, and alength
member that indicates how long the text is. The simple parsing done by the plugin considers nonempty runs of whitespace characters to be words, so it identifies words like this:static int simple_parser_parse(MYSQL_FTPARSER_PARAM *param) { char *end, *start, *docend= param->doc + param->length; for (end= start= param->doc;; end++) { if (end == docend) { if (end > start) add_word(param, start, end - start); break; } else if (isspace(*end)) { if (end > start) add_word(param, start, end - start); start= end + 1; } } return(0); }
As the parser finds each word, it invokes a function
add_word()
to pass the word to the server.add_word()
is a helper function only; it is not part of the plugin interface. The parser passes the parsing context pointer toadd_word()
, as well as a pointer to the word and a length value:static void add_word(MYSQL_FTPARSER_PARAM *param, char *word, size_t len) { MYSQL_FTPARSER_BOOLEAN_INFO bool_info= { FT_TOKEN_WORD, 0, 0, 0, 0, 0, ' ', 0 }; param->mysql_add_word(param, word, len, &bool_info); }
For boolean-mode parsing,
add_word()
fills in the members of thebool_info
structure as described earlier in the discussion of thest_mysql_ftparser_boolean_info
structure.Set up the status variables. For the
simple_parser
plugin, the following status variable array sets up one status variable with a value that is static text, and another with a value that is stored in a long integer variable:long number_of_calls= 0; struct st_mysql_show_var simple_status[]= { {"simple_parser_static", (char *)"just a static text", SHOW_CHAR}, {"simple_parser_called", (char *)&number_of_calls, SHOW_LONG}, {0,0,0} };
By using status variable names that begin with the plugin name, you can easily display the variables for a plugin with
SHOW STATUS
:To compile and install a plugin library file, use the instructions in Section 29.2.4.3, “Compiling and Installing Plugin Libraries”. To make the library file available for use, install it in the plugin directory (the directory named by the
plugin_dir
system variable). For thesimple_parser
plugin, it is compiled and installed when you build MySQL from source. It is also included in binary distributions. The build process produces a shared object library with a name ofmypluglib.so
(the.so
suffix might differ depending on your platform).To use the plugin, register it with the server. For example, to register the plugin at runtime, use this statement (adjust the
.so
suffix for your platform as necessary):For additional information about plugin loading, see Section 5.6.1, “Installing and Uninstalling Plugins”.
To verify plugin installation, examine the
INFORMATION_SCHEMA.PLUGINS
table or use theSHOW PLUGINS
statement. See Section 5.6.2, “Obtaining Server Plugin Information”.Test the plugin to verify that it works properly.
Create a table that contains a string column and associate the parser plugin with a
FULLTEXT
index on the column:- Query OK, 0 rows affected (0.01 sec)
Insert some text into the table and try some searches. These should verify that the parser plugin treats all nonwhitespace characters as word characters:
- -> ('utf8mb4_0900_as_cs is a case-sensitive collation'),
- -> ('I\'d like a case of oranges'),
- -> ('this is sensitive information'),
- -> ('another row'),
- -> ('yet another row');
- Query OK, 5 rows affected (0.02 sec)
- +--------------------------------------------------+
- | c |
- +--------------------------------------------------+
- | I'd like a case of oranges |
- | this is sensitive information |
- | another row |
- | yet another row |
- +--------------------------------------------------+
- 5 rows in set (0.00 sec)
- +--------------------------+
- +--------------------------+
- | 0 |
- | 1.2968142032623 |
- | 0 |
- | 0 |
- | 0 |
- +--------------------------+
- 5 rows in set (0.00 sec)
- mysql> SELECT MATCH(c) AGAINST('sensitive') FROM t;
- +-------------------------------+
- | MATCH(c) AGAINST('sensitive') |
- +-------------------------------+
- | 0 |
- | 0 |
- | 1.3253291845322 |
- | 0 |
- | 0 |
- +-------------------------------+
- 5 rows in set (0.01 sec)
- mysql> SELECT MATCH(c) AGAINST('case-sensitive') FROM t;
- +------------------------------------+
- | MATCH(c) AGAINST('case-sensitive') |
- +------------------------------------+
- | 1.3109166622162 |
- | 0 |
- | 0 |
- | 0 |
- | 0 |
- +------------------------------------+
- 5 rows in set (0.01 sec)
- +--------------------------+
- +--------------------------+
- | 0 |
- | 1.2968142032623 |
- | 0 |
- | 0 |
- | 0 |
- +--------------------------+
Neither “case” nor “insensitive” match “case-insensitive” the way that they would for the built-in parser.
Document created the 26/06/2006, last modified the 26/10/2018
Source of the printed document:https://www.gaudry.be/en/mysql-rf-writing-full-text-plugins.html
The infobrol is a personal site whose content is my sole responsibility. The text is available under CreativeCommons license (BY-NC-SA). More info on the terms of use and the author.
References
These references and links indicate documents consulted during the writing of this page, or which may provide additional information, but the authors of these sources can not be held responsible for the content of this page.
The author This site is solely responsible for the way in which the various concepts, and the freedoms that are taken with the reference works, are presented here. Remember that you must cross multiple source information to reduce the risk of errors.