>>
Tokenization of Chinese texts
Method 1: ICU-based segmentation
ICU (International Components for Unicode) provides precise dictionary-based segmentation for Chinese text.
Configuration:
- charset_table = 'non_cjk,chinese'
- morphology = 'icu_chinese'
First, let's connect to Manticore:
mysql -P 9306 -h0
And create our table with ICU:
CREATE TABLE testrt_icu (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk,chinese' morphology = 'icu_chinese';
This method provides high accuracy and good performance for Chinese text segmentation.
Testing ICU segmentation
Let's insert a mixed Chinese-English sentence '买新的Apple电脑' ('Buy new Apple computer'):
INSERT INTO testrt_icu (title, content, gid) VALUES ( 'first record', '买新的Apple电脑', 1 );
To check how the text is tokenized, use CALL KEYWORDS:
CALL KEYWORDS('买新的Apple电脑', 'testrt_icu');
You'll see the Chinese text is broken into meaningful words, not just characters.
Now search with a phrase 'Apple电脑' ('Apple computer'):
SELECT * FROM testrt_icu WHERE MATCH ('@content Apple电脑');
The search successfully finds the document even though there were no spaces between words in the original text. This demonstrates ICU's ability to understand Chinese word boundaries.
Verifying ICU segmentation
Use SHOW META to see how the query was segmented:
SHOW META;
You'll see the search phrase was divided into 'apple' and '电脑' ('computer'), confirming proper word segmentation.
ICU correctly identifies word boundaries in continuous Chinese script, making phrase searches work naturally without manual word separation.
Method 2: Jieba-based segmentation
Jieba is a popular Chinese text segmentation library that offers search-optimized segmentation modes.
Configuration:
- charset_table = 'non_cjk,chinese'
- morphology = 'jieba_chinese'
- Requires package: manticore-language-packs
Key advantage: Jieba's search mode breaks words into sub-words for better search recall. For example, "清华大学" (Tsinghua University) will be indexed as both "清华大学" and "清华", so users can find it by searching for either term.
Let's create a table with Jieba segmentation:
CREATE TABLE testrt_jieba (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk,chinese' morphology = 'jieba_chinese';
Now insert the same test data:
INSERT INTO testrt_jieba (title, content, gid) VALUES ( 'second record', '买新的Apple电脑', 2 );
Test the segmentation with CALL KEYWORDS:
CALL KEYWORDS('买新的Apple电脑', 'testrt_jieba');
Search for the data:
SELECT * FROM testrt_jieba WHERE MATCH('@content Apple电脑');
Method 3: N-gram segmentation
N-gram is a simpler, character-based approach that doesn't require external libraries or dictionaries.
Configuration:
- charset_table = 'non_cjk' (non-CJK characters)
- ngram_len = '1'
- ngram_chars = 'chinese' (or 'cont' for all CJK)
Characteristics:
- Faster indexing
- No external dependencies
- Less accurate for complex queries
- Good for basic Chinese text support
Let's create a table with N-gram segmentation:
CREATE TABLE testrt_ngram (title TEXT, content TEXT, gid INT) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'chinese';
Insert test data:
INSERT INTO testrt_ngram (title, content, gid) VALUES ( 'third record', '买新的Apple电脑', 3 );
Test with CALL KEYWORDS:
CALL KEYWORDS('买新的Apple电脑', 'testrt_ngram');
Search:
SELECT * FROM testrt_ngram WHERE MATCH('@content Apple电脑');
Notice how with N-grams each Chinese character is indexed separately, providing basic but functional search capabilities:
SHOW META;
Comparison of Chinese segmentation methods
Let's compare the three methods we've explored:
| Feature | ICU | Jieba | N-grams |
|---|---|---|---|
| Accuracy | High | Very High | Basic |
| Indexing Speed | Good | Good | Very Fast |
| Search Quality | Precise | Excellent (search-optimized) | Functional |
| Index Size | Medium | Medium | Larger |
| Dependencies | Built-in ICU library | Requires manticore-language-packs | None |
| Dictionary Support | Yes | Yes (customizable) | No |
| Best For | General purpose Chinese search | Production Chinese search with high recall requirements | Quick setup, basic Chinese support |
When to use each method:
✅ ICU - Good default choice for Chinese text search with balanced performance and accuracy
✅ Jieba - Best for:
- Production Chinese search applications
- When you need high search recall (users can find results using partial terms)
- Industry-specific terminology (supports custom dictionaries)
- When "清华" should match "清华大学"
✅ N-grams - Good for:
- Quick prototyping
- Mixed multilingual content where Chinese is secondary
- When you want to avoid external dependencies
- Systems with limited resources