A Brief Discussion on MySQL Full-Text Indexing

Recently, I've encountered quite a few pitfalls with MySQL's full-text search, so I decided to write a detailed article summarizing some knowledge points about MySQL's full-text search.

Fuzzy Search#

During our use of MySQL, there are always some fuzzy search requirements. For example, we have a table like this:

create table if not exists `user`
(
    `id`         bigint(20)   not null auto_increment,
    `name`       varchar(255) not null,
    `age`        int          not null,
    `update_time` timestamp    not null,
    `create_time` timestamp    not null,
    index (`name`),
    primary key (`id`)
) engine = InnoDB
  charset = 'utf8mb4';

Now we need to perform some fuzzy matching on name, for instance, we want to match names that contain the character 草. After some thought, we write the following SQL:

select * from user where name like '%草%'

Great! When you excitedly deploy this code, you find that it crashes online. Why? Because of the pitfalls in MySQL. The like query in MySQL has two limitations:

Only prefix matching ' 草 %' and suffix matching '% 草 ' will use the index, while arbitrary matching will not.
When the index cannot be used, MySQL will scan the entire table to query data.

When the data scale of a table is large, a brute-force table scan will inevitably incur significant overhead.

However, in our actual work, there are definitely many arbitrary matching requirements. So what should we do? Perhaps we can try full-text search.

Full-Text Search#

A Brief Discussion on Full-Text Search#

Full-text search is no longer unfamiliar to everyone. In simple terms, it is a way to find matches in a pile of text data using a set of keywords. Currently, the more mainstream full-text search solutions in the industry include:

Relational databases that support full-text search
Apache Lucene
ElasticSearch based on Apache Lucene
Apache Solr

The latter two are the main solutions in the industry today, and many full-text search requirements will consider using ES or Solr. However, this approach is not without cost. There are several practical issues:

The operational issues of ES/Solr when dealing with large data volumes; ensuring the HA of the cluster will be a significant test of the team's capabilities.
How to perform real-time/offline ETL of data from MySQL or other data sources to the Search Engine.
The additional learning and maintenance costs of the codebase.
After adding a dependency, ensuring the overall HA of the system.

In technical decision-making, we often need to weigh the ROI of an option to assist in decision-making. If we are faced with a relatively simple search scenario, the overhead brought by choosing ES/Solr will make its ROI relatively low. Therefore, in some simple scenarios, we may prefer to utilize the capabilities of the database itself to meet our needs.

Fortunately, starting from MySQL 5.5, it supports certain full-text search capabilities.

MySQL Full-Text Search#

The prerequisite for MySQL full-text search is to create a Full Text Index on the table.

alter table `user`
    ADD FULLTEXT INDEX name_index (`name`);

Note that full-text indexes only take effect on fields of type CHAR/VARCHAR/TEXT.

Next, we insert two records:

insert into `user` (name, age, createTime, updateTime)
values ('Jeff.S.Wang', 18, current_timestamp, current_timestamp);

insert into `user` (name, age, createTime, updateTime)
values ('Jeff.Li', 18, current_timestamp, current_timestamp);

Now, let's see how MySQL performs full-text queries.

First, according to the official definition,

MATCH (col1,col2,...) AGAINST (expr [search_modifier])

The search_modifier is the selected matching mode, and there are four types in MySQL:

IN NATURAL LANGUAGE MODE
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
IN BOOLEAN MODE
WITH QUERY EXPANSION

The commonly used modes are Natural Language Mode and Boolean Mode.

First, let's talk about Natural Language Mode. It's quite simple; as the name suggests, MySQL directly calculates the keywords to be matched and returns the corresponding values. Here’s a quote from the official explanation:

By default or with the IN NATURAL LANGUAGE MODE modifier, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.

Let's write an SQL query:

select * 
from `user` 
where MATCH(name) AGAINST('Jeff' IN NATURAL LANGUAGE MODE)

We find that we get the following results:

id	name	age	updateTime	createTime
1	Jeff Li	18	2020-03-01 15:38:07	2020-03-01 15:38:07
2	Jeff.S.Wang	18	2020-03-01 15:42:28	2020-03-01 15:42:28

Next, we try to match the user's last name, for example, we want to find a user with the last name Wang.

We write the following SQL:

select * 
from `user` 
where MATCH(name) AGAINST('Wang' IN NATURAL LANGUAGE MODE)

We get the following result:

id	name	age	updateTime	createTime
2	Jeff.S.Wang	18	2020-03-01 15:42:28	2020-03-01 15:42:28

Now we try to search for a user with the last name Li, and we write the following SQL:

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

And we find that there are no results????? WTF? Why?

The reason lies in the granularity of the word segmentation. When we enter new data, MySQL segments the data in our indexed fields based on a certain word segmentation length and stores it for querying. There are four parameters that control the word segmentation length:

innodb_ft_min_token_size
innodb_ft_max_token_size
ft_min_word_len (the same function, but for the MyISAM engine)
ft_max_word_len

For InnoDB, the default value of innodb_ft_min_token_size is 3. In other words, in the data we previously entered, the segmented units stored in our data are:

Jeff
Wang

Thus, we had no results for our second search. Now, after modifying MySQL's parameters, let's execute it again:

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

Still not working?????

After checking the official documentation, we find the following description:

Some variable changes require that you rebuild the FULLTEXT indexes in your tables. Instructions for doing so are given later in this section.

The index word segmentation granularity is included in this, so we need to delete/rebuild the index and then execute again (a bit of a pitfall...).

select * 
from `user` 
where MATCH(name) AGAINST('Li' IN NATURAL LANGUAGE MODE)

Now we get the normal return results:

id	name	age	updateTime	createTime
1	Jeff Li	18	2020-03-01 15:38:07	2020-03-01 15:38:07

Now let's talk about another matching mode, BOOLEAN MODE.

Boolean mode allows us to use some operators to retrieve data. Here are some common examples; the rest can be found in the MySQL official documentation:

AGAINST('Jeff Li' IN BOOLEAN MODE) means that either Jeff or Li must exist.
AGAINST('+Jeff' IN BOOLEAN MODE) means that Jeff must exist.
AGAINST('+Jeff -Li' IN BOOLEAN MODE) means that Jeff must exist and Li must not exist.

Let's execute these SQL queries:

select * 
from `user` 
where MATCH(name) AGAINST('Jeff Li' IN BOOLEAN MODE)

Result:

id	name	age	updateTime	createTime
1	Jeff Li	18	2020-03-01 15:38:07	2020-03-01 15:38:07
2	Jeff.S.Wang	18	2020-03-01 15:42:28	2020-03-01 15:42:28

select *
from `user` 
where MATCH(name) AGAINST('+Jeff' IN BOOLEAN MODE)

Result:

id	name	age	updateTime	createTime
1	Jeff Li	18	2020-03-01 15:38:07	2020-03-01 15:38:07
2	Jeff.S.Wang	18	2020-03-01 15:42:28	2020-03-01 15:42:28

select * 
from `user` 
where MATCH(name) AGAINST('+Jeff -Li' IN BOOLEAN MODE)

Result:

id	name	age	updateTime	createTime
2	Jeff.S.Wang	18	2020-03-01 15:42:28	2020-03-01 15:42:28

Now, we have some Chinese search requirements. Let's first insert data:

insert into `user` (name, age, createTime, updateTime)
values ('奥特曼', 18, current_timestamp, current_timestamp);

Now we want to search for users with the last name 奥. Following the previous guide, we write the following SQL:

select *
from `user`
where MATCH(name) AGAINST('+奥' IN BOOLEAN MODE)

To our surprise, we find that there are still no results???? Why???

Actually, it’s the same issue we mentioned earlier: word segmentation. MySQL's default word segmentation engine only supports English word segmentation and does not support Chinese word segmentation. Without word segmentation, there can be no search. What to do?

Starting from MySQL 5.7, MySQL provides the ngram component to help us perform Chinese word segmentation, and it's very simple to use:

alter table `user`
    add fulltext index name_index (`name`) with parser ngram;

Here are a few points to note:

ngram is not only suitable for Chinese; according to the official documentation, it also supports Korean and Japanese.
A single field can only have one full-text index, so the original full-text index needs to be deleted.

At the same time, like the default word segmentation, ngram is also subject to word segmentation granularity limitations, but the settings for ngram are:

ngram_token_size

We can set it according to our needs.

Summary#

Full-text search is a common requirement in daily development. When our infrastructure does not allow us to confidently use external components, leveraging the capabilities provided by the database may be a good option. However, there are still many pitfalls to navigate and many parameters to optimize. By the way, the settings for Alibaba Cloud's RDS are really difficult to use (just a little complaint).

Alright... my procrastination really has no cure... and I've been suffering from a toothache these past few days, which is really frustrating.