Rechercher dans le manuel MySQL
12.9.2 Boolean Full-Text Searches
MySQL can perform boolean full-text searches using the
IN BOOLEAN MODE
modifier. With this modifier,
certain characters have special meaning at the beginning or end
of words in the search string. In the following query, the
+
and -
operators indicate
that a word must be present or absent, respectively, for a match
to occur. Thus, the query retrieves all the rows that contain
the word “MySQL” but that do
not contain the word
“YourSQL”:
- +----+-----------------------+-------------------------------------+
- | id | title | body |
- +----+-----------------------+-------------------------------------+
- +----+-----------------------+-------------------------------------+
In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which
+
stands forAND
-
stands forNOT
[no operator] implies
OR
Boolean full-text searches have these characteristics:
They do not automatically sort rows in order of decreasing relevance.
InnoDB
tables require aFULLTEXT
index on all columns of theMATCH()
expression to perform boolean queries. Boolean queries against aMyISAM
search index can work even without aFULLTEXT
index, although a search executed in this fashion would be quite slow.The minimum and maximum word length full-text parameters apply to
FULLTEXT
indexes created using the built-inFULLTEXT
parser and MeCab parser plugin.innodb_ft_min_token_size
andinnodb_ft_max_token_size
are used forInnoDB
search indexes.ft_min_word_len
andft_max_word_len
are used forMyISAM
search indexes.Minimum and maximum word length full-text parameters do not apply to
FULLTEXT
indexes created using the ngram parser. ngram token size is defined by thengram_token_size
option.The stopword list applies, controlled by
innodb_ft_enable_stopword
,innodb_ft_server_stopword_table
, andinnodb_ft_user_stopword_table
forInnoDB
search indexes, andft_stopword_file
forMyISAM
ones.InnoDB
full-text search does not support the use of multiple operators on a single search word, as in this example:'++apple'
. Use of multiple operators on a single search word returns a syntax error to standard out. MyISAM full-text search will successfully process the same search ignoring all operators except for the operator immediately adjacent to the search word.InnoDB
full-text search only supports leading plus or minus signs. For example,InnoDB
supports'+apple'
but does not support'apple+'
. Specifying a trailing plus or minus sign causesInnoDB
to report a syntax error.InnoDB
full-text search does not support the use of a leading plus sign with wildcard ('+*'
), a plus and minus sign combination ('+-'
), or leading a plus and minus sign combination ('+-apple'
). These invalid queries return a syntax error.InnoDB
full-text search does not support the use of the@
symbol in boolean full-text searches. The@
symbol is reserved for use by the@distance
proximity search operator.They do not use the 50% threshold that applies to
MyISAM
search indexes.
The boolean full-text search capability supports the following operators:
+
A leading or trailing plus sign indicates that this word must be present in each row that is returned.
InnoDB
only supports leading plus signs.-
A leading or trailing minus sign indicates that this word must not be present in any of the rows that are returned.
InnoDB
only supports leading minus signs.Note: The
-
operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean-mode search that contains only terms preceded by-
returns an empty result. It does not return “all rows except those containing any of the excluded terms.”(no operator)
By default (when neither
+
nor-
is specified), the word is optional, but the rows that contain it are rated higher. This mimics the behavior ofMATCH() ... AGAINST()
without theIN BOOLEAN MODE
modifier.@
distance
This operator works on
InnoDB
tables only. It tests whether two or more words all start within a specified distance from each other, measured in words. Specify the search words within a double-quoted string immediately before the@
operator, for example,distance
MATCH(col1) AGAINST('"word1 word2 word3" @8' IN BOOLEAN MODE)
> <
These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The
>
operator increases the contribution and the<
operator decreases it. See the example following this list.( )
Parentheses group words into subexpressions. Parenthesized groups can be nested.
~
A leading tilde acts as a negation operator, causing the word's contribution to the row's relevance to be negative. This is useful for marking “noise” words. A row containing such a word is rated lower than others, but is not excluded altogether, as it would be with the
-
operator.*
The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it is appended to the word to be affected. Words match if they begin with the word preceding the
*
operator.If a word is specified with the truncation operator, it is not stripped from a boolean query, even if it is too short or a stopword. Whether a word is too short is determined from the
innodb_ft_min_token_size
setting forInnoDB
tables, orft_min_word_len
forMyISAM
tables. These options are not applicable toFULLTEXT
indexes that use the ngram parser.The wildcarded word is considered as a prefix that must be present at the start of one or more words. If the minimum word length is 4, a search for
'+
could return fewer rows than a search forword
+the*''+
, because the second query ignores the too-short search termword
+the'the
."
A phrase that is enclosed within double quote (
"
) characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in theFULLTEXT
index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example,"test phrase"
matches"test, phrase"
.If the phrase contains no words that are in the index, the result is empty. The words might not be in the index because of a combination of factors: if they do not exist in the text, are stopwords, or are shorter than the minimum length of indexed words.
The following examples demonstrate some search strings that use boolean full-text operators:
'apple banana'
Find rows that contain at least one of the two words.
'+apple +juice'
Find rows that contain both words.
'+apple macintosh'
Find rows that contain the word “apple”, but rank rows higher if they also contain “macintosh”.
'+apple -macintosh'
Find rows that contain the word “apple” but not “macintosh”.
'+apple ~macintosh'
Find rows that contain the word “apple”, but if the row also contains the word “macintosh”, rate it lower than if row does not. This is “softer” than a search for
'+apple -macintosh'
, for which the presence of “macintosh” causes the row not to be returned at all.'+apple +(>turnover <strudel)'
Find rows that contain the words “apple” and “turnover”, or “apple” and “strudel” (in any order), but rank “apple turnover” higher than “apple strudel”.
'apple*'
Find rows that contain words such as “apple”, “apples”, “applesauce”, or “applet”.
'"some words"'
Find rows that contain the exact phrase “some words” (for example, rows that contain “some words of wisdom” but not “some noise words”). Note that the
"
characters that enclose the phrase are operator characters that delimit the phrase. They are not the quotation marks that enclose the search string itself.
Relevancy Rankings for InnoDB Boolean Mode Search
InnoDB
full-text search is
modeled on the
Sphinx full-text
search engine, and the algorithms used are based on
BM25
and
TF-IDF
ranking algorithms. For these reasons, relevancy rankings for
InnoDB
boolean full-text search may differ
from MyISAM
relevancy rankings.
InnoDB
uses a variation of the “term
frequency-inverse document frequency”
(TF-IDF
) weighting system to rank a
document's relevance for a given full-text search query. The
TF-IDF
weighting is based on how frequently
a word appears in a document, offset by how frequently the
word appears in all documents in the collection. In other
words, the more frequently a word appears in a document, and
the less frequently the word appears in the document
collection, the higher the document is ranked.
How Relevancy Ranking is Calculated
The term frequency (TF
) value is the number
of times that a word appears in a document. The inverse
document frequency (IDF
) value of a word is
calculated using the following formula, where
total_records
is the number of records in
the collection, and matching_records
is the
number of records that the search term appears in.
${IDF} = log10( ${total_records} / ${matching_records} )
When a document contains a word multiple times, the IDF value is multiplied by the TF value:
${TF} * ${IDF}
Using the TF
and IDF
values, the relevancy ranking for a document is calculated
using this formula:
${rank} = ${TF} * ${IDF} * ${IDF}
The formula is demonstrated in the following examples.
Relevancy Ranking for a Single Word Search
This example demonstrates the relevancy ranking calculation for a single-word search.
- body TEXT,
- Query OK, 0 rows affected (1.04 sec)
- ('MySQL Tutorial','This database tutorial ...'),
- ("How To Use MySQL",'After you went through a ...'),
- ('Optimizing Your Database','In this database tutorial ...'),
- ('MySQL vs. YourSQL','When comparing databases ...'),
- ('MySQL Security','When configured properly, MySQL ...'),
- ('Database, Database, Database','database database database'),
- ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
- ('MySQL Full-Text Indexes', 'MySQL fulltext indexes use a ..');
- Query OK, 8 rows affected (0.06 sec)
- +----+------------------------------+-------------------------------------+---------------------+
- | id | title | body | score |
- +----+------------------------------+-------------------------------------+---------------------+
- +----+------------------------------+-------------------------------------+---------------------+
There are 8 records in total, with 3 that match the
“database” search term. The first record
(id 6
) contains the search term 6 times and
has a relevancy ranking of
1.0886961221694946
. This ranking value is
calculated using a TF
value of 6 (the
“database” search term appears 6 times in record
id 6
) and an IDF
value
of 0.42596873216370745, which is calculated as follows (where
8 is the total number of records and 3 is the number of
records that the search term appears in):
${IDF} = log10( 8 / 3 ) = 0.42596873216370745
The TF
and IDF
values
are then entered into the ranking formula:
${rank} = ${TF} * ${IDF} * ${IDF}
Performing the calculation in the MySQL command-line client returns a ranking value of 1.088696164686938.
- +-------------------------+
- +-------------------------+
- | 1.088696164686938 |
- +-------------------------+
You may notice a slight difference in the ranking values
returned by the SELECT ... MATCH ...
AGAINST
statement and the MySQL command-line
client (1.0886961221694946
versus
1.088696164686938
). The difference is due
to how the casts between integers and floats/doubles are
performed internally by InnoDB
(along
with related precision and rounding decisions), and how they
are performed elsewhere, such as in the MySQL command-line
client or other types of calculators.
Relevancy Ranking for a Multiple Word Search
This example demonstrates the relevancy ranking calculation
for a multiple-word full-text search based on the
articles
table and data used in the
previous example.
If you search on more than one word, the relevancy ranking value is a sum of the relevancy ranking value for each word, as shown in this formula:
${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}
Performing a search on two terms ('mysql tutorial') returns the following results:
- +----+------------------------------+-------------------------------------+----------------------+
- | id | title | body | score |
- +----+------------------------------+-------------------------------------+----------------------+
- +----+------------------------------+-------------------------------------+----------------------+
In the first record (id 8
), 'mysql' appears
once and 'tutorial' appears twice. There are six matching
records for 'mysql' and two matching records for 'tutorial'.
The MySQL command-line client returns the expected ranking
value when inserting these values into the ranking formula for
a multiple word search:
- +-------------------------------------------------------+
- +-------------------------------------------------------+
- | 0.7405621541938003 |
- +-------------------------------------------------------+
The slight difference in the ranking values returned by the
SELECT ... MATCH ... AGAINST
statement
and the MySQL command-line client is explained in the
preceding example.
Nederlandse vertaling
U hebt gevraagd om deze site in het Nederlands te bezoeken. Voor nu wordt alleen de interface vertaald, maar nog niet alle inhoud.Als je me wilt helpen met vertalingen, is je bijdrage welkom. Het enige dat u hoeft te doen, is u op de site registreren en mij een bericht sturen waarin u wordt gevraagd om u toe te voegen aan de groep vertalers, zodat u de gewenste pagina's kunt vertalen. Een link onderaan elke vertaalde pagina geeft aan dat u de vertaler bent en heeft een link naar uw profiel.
Bij voorbaat dank.
Document heeft de 26/06/2006 gemaakt, de laatste keer de 26/10/2018 gewijzigd
Bron van het afgedrukte document:https://www.gaudry.be/nl/mysql-rf-fulltext-boolean.html
De infobrol is een persoonlijke site waarvan de inhoud uitsluitend mijn verantwoordelijkheid is. De tekst is beschikbaar onder CreativeCommons-licentie (BY-NC-SA). Meer info op de gebruiksvoorwaarden en de auteur.
Referenties
Deze verwijzingen en links verwijzen naar documenten die geraadpleegd zijn tijdens het schrijven van deze pagina, of die aanvullende informatie kunnen geven, maar de auteurs van deze bronnen kunnen niet verantwoordelijk worden gehouden voor de inhoud van deze pagina.
De auteur Deze site is als enige verantwoordelijk voor de manier waarop de verschillende concepten, en de vrijheden die met de referentiewerken worden genomen, hier worden gepresenteerd. Vergeet niet dat u meerdere broninformatie moet doorgeven om het risico op fouten te verkleinen.