Terminal Terminal | Web Web
Home  //  Play

Tokenization of Chinese texts

Difficulty: Beginner
Estimated Time: 3 minutes

Manticoresearch - ICU-chinese text tokenization

In this tutorial we will show you how our ICU-based tokenization of Chinese texts works.

Tokenization of Chinese texts

Step 1 of 3

Index configuration

To setup an index for proper Chinese segmentation there are 2 index settings we need to touch: 'charset_table' and 'morphology'.

For the charset table there is the built alias 'chinese' which contains all Chinese characters.

For the morphology processing Manticore has built-in support of the ICU library which does proper Chinese segmentation.

First, let's connect to Manticore:

mysql -P 9306 -h0

And create our table:

CREATE TABLE testrt (title TEXT, content TEXT, gid INT) charset_table = 'chinese' morphology = 'icu_chinese';

Basic usage

Let's insert into the index a simple mixed Chinese-english sentence '买新的Apple电脑' ('I like Apple computers'):

INSERT INTO testrt (title, content, gid) VALUES ( 'first record', '买新的Apple电脑', 1 );

To check if our Chinese content is tokenized as expected we can run CALL KEYWORDS against our index:

call keywords('买新的Apple电脑', 'testrt');

We can see that the Chinese text is broken into 4 words, as expected.

And now let's make a search against our content field with a query phrase 'Apple电脑' ('Apple computers'):

SELECT * FROM testrt WHERE MATCH ('@content Apple电脑');

As we see, the search has been succesfully executed and we got the expected result despite the fact that both original sentence and query phrase we used in this example didn't have any separators between the words.

Basic usage

We can use 'SHOW META' command to see the information about the query and make sure that our sentence was properly segmented into separate words:

SHOW META;

We see that our search phrase was indeed divided into 'apple' and '电脑' ('computers') just as we supposed it to be done.