Terminal Terminal | Web Web
Home  //  Play

Tokenization of Chinese texts

Difficulty: Beginner
Estimated Time: 3 minutes

Manticoresearch - Chinese text tokenization

In this tutorial we will show you different methods of Chinese text tokenization available in Manticore Search:

  1. ICU-based segmentation - precise dictionary-based segmentation using the ICU library
  2. Jieba-based segmentation - advanced Chinese segmentation with search-optimized modes
  3. N-gram segmentation - simple character-based approach for basic support

Tokenization of Chinese texts

Step 1 of 6

Method 1: ICU-based segmentation

ICU (International Components for Unicode) provides precise dictionary-based segmentation for Chinese text.

Configuration:

  • charset_table = 'non_cjk,chinese'
  • morphology = 'icu_chinese'

First, let's connect to Manticore:

mysql -P 9306 -h0

And create our table with ICU:

CREATE TABLE testrt_icu (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk,chinese' morphology = 'icu_chinese';

This method provides high accuracy and good performance for Chinese text segmentation.

Testing ICU segmentation

Let's insert a mixed Chinese-English sentence '买新的Apple电脑' ('Buy new Apple computer'):

INSERT INTO testrt_icu (title, content, gid) VALUES ( 'first record', '买新的Apple电脑', 1 );

To check how the text is tokenized, use CALL KEYWORDS:

CALL KEYWORDS('买新的Apple电脑', 'testrt_icu');

You'll see the Chinese text is broken into meaningful words, not just characters.

Now search with a phrase 'Apple电脑' ('Apple computer'):

SELECT * FROM testrt_icu WHERE MATCH ('@content Apple电脑');

The search successfully finds the document even though there were no spaces between words in the original text. This demonstrates ICU's ability to understand Chinese word boundaries.

Verifying ICU segmentation

Use SHOW META to see how the query was segmented:

SHOW META;

You'll see the search phrase was divided into 'apple' and '电脑' ('computer'), confirming proper word segmentation.

ICU correctly identifies word boundaries in continuous Chinese script, making phrase searches work naturally without manual word separation.

Method 2: Jieba-based segmentation

Jieba is a popular Chinese text segmentation library that offers search-optimized segmentation modes.

Configuration:

  • charset_table = 'non_cjk,chinese'
  • morphology = 'jieba_chinese'
  • Requires package: manticore-language-packs

Key advantage: Jieba's search mode breaks words into sub-words for better search recall. For example, "清华大学" (Tsinghua University) will be indexed as both "清华大学" and "清华", so users can find it by searching for either term.

Let's create a table with Jieba segmentation:

CREATE TABLE testrt_jieba (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk,chinese' morphology = 'jieba_chinese';

Now insert the same test data:

INSERT INTO testrt_jieba (title, content, gid) VALUES ( 'second record', '买新的Apple电脑', 2 );

Test the segmentation with CALL KEYWORDS:

CALL KEYWORDS('买新的Apple电脑', 'testrt_jieba');

Search for the data:

SELECT * FROM testrt_jieba WHERE MATCH('@content Apple电脑');

Method 3: N-gram segmentation

N-gram is a simpler, character-based approach that doesn't require external libraries or dictionaries.

Configuration:

  • charset_table = 'non_cjk' (non-CJK characters)
  • ngram_len = '1'
  • ngram_chars = 'chinese' (or 'cont' for all CJK)

Characteristics:

  • Faster indexing
  • No external dependencies
  • Less accurate for complex queries
  • Good for basic Chinese text support

Let's create a table with N-gram segmentation:

CREATE TABLE testrt_ngram (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'chinese';

Insert test data:

INSERT INTO testrt_ngram (title, content, gid) VALUES ( 'third record', '买新的Apple电脑', 3 );

Test with CALL KEYWORDS:

CALL KEYWORDS('买新的Apple电脑', 'testrt_ngram');

Search:

SELECT * FROM testrt_ngram WHERE MATCH('@content Apple电脑');

Notice how with N-grams each Chinese character is indexed separately, providing basic but functional search capabilities:

SHOW META;

Comparison of Chinese segmentation methods

Let's compare the three methods we've explored:

Feature ICU Jieba N-grams
Accuracy High Very High Basic
Indexing Speed Good Good Very Fast
Search Quality Precise Excellent (search-optimized) Functional
Index Size Medium Medium Larger
Dependencies Built-in ICU library Requires manticore-language-packs None
Dictionary Support Yes Yes (customizable) No
Best For General purpose Chinese search Production Chinese search with high recall requirements Quick setup, basic Chinese support

When to use each method:

ICU - Good default choice for Chinese text search with balanced performance and accuracy

Jieba - Best for:

  • Production Chinese search applications
  • When you need high search recall (users can find results using partial terms)
  • Industry-specific terminology (supports custom dictionaries)
  • When "清华" should match "清华大学"

N-grams - Good for:

  • Quick prototyping
  • Mixed multilingual content where Chinese is secondary
  • When you want to avoid external dependencies
  • Systems with limited resources