Home  //  Play

Tokenization of Chinese texts

Difficulty: Beginner
Estimated Time: 3 minutes

Manticoresearch - ICU-chinese text tokenization

In this tutorial we will show you how our ICU-based tokenization of Chinese texts works.

Tokenization of Chinese texts

Step 1 of 3

Preparing

First we make sure there's a running Manticore instance on the machine

searchd --status

Everything is ok, and now we connect to the Manticore search deamon:

mysql -P 9306 -h0

Basic usage

Let's insert into a simple mixed chinese-english sentence '买新的Apple电脑' ('I like Apple computers'):

INSERT INTO testrt VALUES ( 1, 'first record', '买新的Apple电脑', 1 );

And now let's make a search against our content field with a query phrase 'Apple电脑' ('Apple computers'):

SELECT * FROM testrt WHERE MATCH ('@content Apple电脑');

As we see, the search has been succesfully executed and we got the expected result despite the fact that both original sentence and query phrase we used in this example didn't have any separators between the words.

Basic usage

We can use 'Show meta' command to see the information about the query and make sure that our sentence was properly segmented into separate words:

SHOW meta;

We see that out search phrase was indeed divided into 'apple' and '电脑' ('computers') words just as we supposed it to be done.