>>
Morphology and Lemmatization in Manticore Search
Introduction to Morphology
Morphology allows search engines to normalize words to their base forms. For example, searching for "run" can also match "running", "runs", and "ran".
Manticore Search supports two types of morphology processors:
- Stemmers reduce words by removing suffixes. Fast but may produce invalid words (e.g., "business" → "busi")
- Lemmatizers reduce words to valid dictionary forms (e.g., "running" → "run"). More accurate but require dictionary files
Let's connect to Manticore and try a basic example:
mysql -P9306 -h0
Create a table with English stemming enabled:
CREATE TABLE products(title text, description text, price float) morphology='stem_en';
Insert some documents:
INSERT INTO products(title, description, price) VALUES('Running shoes', 'Best shoes for runners and jogging', 29.99);
INSERT INTO products(title, description, price) VALUES('Programming books', 'A collection of books for developers', 15.00);
INSERT INTO products(title, description, price) VALUES('Garden tools', 'Professional gardening equipment', 45.50);
Now search for "run" — it will also match "running" and "runners":
SELECT * FROM products WHERE MATCH('run');
Search for "book" — it will match "books":
SELECT * FROM products WHERE MATCH('book');
Search for "garden" — it will match "gardening":
SELECT * FROM products WHERE MATCH('garden');
Use CALL KEYWORDS to see how the stemmer normalizes words:
CALL KEYWORDS('running books gardening', 'products');
Notice how each word is reduced to its stem form.
Stemmers for Different Languages
Manticore provides built-in stemmers for English, Russian, Czech, and Arabic. For 22 additional languages, the Snowball (libstemmer) library is available.
Built-in stemmers:
- stem_en — English (Porter's stemmer)
- stem_ru — Russian
- stem_enru — English and Russian combined
- stem_cz — Czech
- stem_ar — Arabic
Let's create a table with the Russian stemmer:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, description text, price float) morphology='stem_ru';
INSERT INTO products(title, description, price) VALUES('Беговые кроссовки', 'Лучшие кроссовки для бегунов', 29.99);
INSERT INTO products(title, description, price) VALUES('Книги по программированию', 'Коллекция книг для разработчиков', 15.00);
Search for "книга" (singular) — it will match "книги" and "книг" (different forms):
SELECT * FROM products WHERE MATCH('книга');
Search for "кроссовка" — it will match "кроссовки" (plural form):
SELECT * FROM products WHERE MATCH('кроссовка');
CALL KEYWORDS('кроссовки кроссовка книги книга', 'products');
For other languages, use libstemmer. Let's try German:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, description text, price float) morphology='libstemmer_de';
INSERT INTO products(title, description, price) VALUES('Laufschuhe', 'Die besten Schuhe zum Laufen', 29.99);
INSERT INTO products(title, description, price) VALUES('Bücher', 'Eine Sammlung von Büchern für Entwickler', 15.00);
CALL KEYWORDS('Laufschuhe Büchern Entwickler', 'products');
Available libstemmer languages: ca, da, nl, en, fi, fr, de, el, hi, hu, id, ga, it, lt, no, pt, ro, ru, es, sv, ta, tr.
Combining Multiple Morphology Processors
You can combine several morphology processors for multilingual content. Processors are applied in order and processing stops once a processor modifies the word.
Create a table with English and Russian stemmers:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, description text, code text, price float) morphology='stem_en, stem_ru';
INSERT INTO products(title, description, code, price) VALUES('Running shoes', 'Лучшие кроссовки для бегунов', 'SHOE-001', 29.99);
INSERT INTO products(title, description, code, price) VALUES('Programming books', 'Книги по программированию', 'BOOK-002', 15.00);
Both English and Russian searches work:
SELECT * FROM products WHERE MATCH('run');
SELECT * FROM products WHERE MATCH('бегун');
CALL KEYWORDS('running бегунов books книги', 'products');
morphology_skip_fields
Sometimes you don't want morphology applied to certain fields. For example, product codes should stay as-is. Use morphology_skip_fields:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, description text, code text, price float) morphology='stem_en' morphology_skip_fields='code';
INSERT INTO products(title, description, code, price) VALUES('Running shoes', 'Best shoes for runners', 'SHOES-RUN', 29.99);
Compare how "code" field is processed vs "title":
CALL KEYWORDS('SHOES-RUN', 'products');
CALL KEYWORDS('running', 'products');
Search in the code field uses exact form:
SELECT * FROM products WHERE MATCH('@code SHOES-RUN');
Advanced Morphology Options
min_stemming_len
Short words can sometimes be stemmed incorrectly. For example, "gps" might become "gp". Use min_stemming_len to prevent stemming of short words:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, price float) morphology='stem_en' min_stemming_len='4';
INSERT INTO products(title, price) VALUES('GPS tracker for runners', 10.00);
CALL KEYWORDS('GPS runners', 'products');
Notice that "GPS" (3 chars) is not stemmed, but "runners" (7 chars) is reduced to "runner".
index_exact_words
When enabled, Manticore indexes both the original and the stemmed form of each word. This allows using the exact form operator = in queries:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, price float) morphology='stem_en' index_exact_words='1';
INSERT INTO products(title, price) VALUES('Running in the park', 10.00);
INSERT INTO products(title, price) VALUES('Run the test suite', 20.00);
Without the exact form operator, both documents match:
SELECT * FROM products WHERE MATCH('running');
With the exact form operator =, only the exact word matches:
SELECT * FROM products WHERE MATCH('=running');
SELECT * FROM products WHERE MATCH('=run');
This is useful when you need both broad (stemmed) and precise (exact) search capabilities.
Phonetic algorithms
Manticore also supports phonetic processors for sound-based matching:
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, price float) morphology='metaphone';
INSERT INTO products(title, price) VALUES('Smith Electronics', 10.00);
INSERT INTO products(title, price) VALUES('Smyth Solutions', 20.00);
SELECT * FROM products WHERE MATCH('Smith');
Both "Smith" and "Smyth" match because they sound alike.
CALL KEYWORDS('Smith Smyth', 'products');
CALL KEYWORDS: Debugging Morphology
CALL KEYWORDS is the essential tool for understanding how morphology processes your text. It shows the tokenized and normalized forms and can also return document statistics.
DROP TABLE IF EXISTS products;
CREATE TABLE products(title text, price float) morphology='stem_en';
INSERT INTO products(title, price) VALUES('Running shoes for professional runners', 10.00);
INSERT INTO products(title, price) VALUES('Books about programming and development', 20.00);
INSERT INTO products(title, price) VALUES('Running a small business', 30.00);
Basic CALL KEYWORDS:
CALL KEYWORDS('running professionally businesses', 'products');
With document statistics (1 as stats):
CALL KEYWORDS('running shoes books', 'products', 1 as stats);
The docs column shows how many documents contain the keyword, and hits shows the total number of occurrences.
Comparing different processors
You can create tables with different morphology settings and compare how they process the same text:
DROP TABLE IF EXISTS t_stem;
DROP TABLE IF EXISTS t_meta;
CREATE TABLE t_stem(title text) morphology='stem_en';
CREATE TABLE t_meta(title text) morphology='metaphone';
CALL KEYWORDS('running development Smith', 't_stem');
CALL KEYWORDS('running development Smith', 't_meta');
Notice how stem_en reduces "running" to "run" and "development" to "develop", while metaphone converts them to phonetic codes.
Summary of morphology options
morphology - list of processors (e.g. 'stem_en, stem_ru') morphology_skip_fields - fields to exclude from morphology min_stemming_len - minimum word length for stemming index_exact_words - index both original and stemmed forms
For full documentation, see Morphology.