Terminal Terminal | Web Web
Home  //  Play

Morphology and Lemmatization in Manticore Search

Difficulty: Beginner
Estimated Time: 15 minutes

Morphology and Lemmatization

In this course, you will learn how to use morphology processors in Manticore Search to improve full-text search by normalizing word forms.

Morphology and Lemmatization in Manticore Search

Step 1 of 5

Introduction to Morphology

Morphology allows search engines to normalize words to their base forms. For example, searching for "run" can also match "running", "runs", and "ran".

Manticore Search supports two types of morphology processors:

  • Stemmers reduce words by removing suffixes. Fast but may produce invalid words (e.g., "business" → "busi")
  • Lemmatizers reduce words to valid dictionary forms (e.g., "running" → "run"). More accurate but require dictionary files

Let's connect to Manticore and try a basic example:

mysql -P9306 -h0

Create a table with English stemming enabled:

CREATE TABLE products(title text, description text, price float) morphology='stem_en';

Insert some documents:

INSERT INTO products(title, description, price) VALUES('Running shoes', 'Best shoes for runners and jogging', 29.99);

INSERT INTO products(title, description, price) VALUES('Programming books', 'A collection of books for developers', 15.00);

INSERT INTO products(title, description, price) VALUES('Garden tools', 'Professional gardening equipment', 45.50);

Now search for "run" — it will also match "running" and "runners":

SELECT * FROM products WHERE MATCH('run');

Search for "book" — it will match "books":

SELECT * FROM products WHERE MATCH('book');

Search for "garden" — it will match "gardening":

SELECT * FROM products WHERE MATCH('garden');

Use CALL KEYWORDS to see how the stemmer normalizes words:

CALL KEYWORDS('running books gardening', 'products');

Notice how each word is reduced to its stem form.

Stemmers for Different Languages

Manticore provides built-in stemmers for English, Russian, Czech, and Arabic. For 22 additional languages, the Snowball (libstemmer) library is available.

Built-in stemmers:

  • stem_en — English (Porter's stemmer)
  • stem_ru — Russian
  • stem_enru — English and Russian combined
  • stem_cz — Czech
  • stem_ar — Arabic

Let's create a table with the Russian stemmer:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, description text, price float) morphology='stem_ru';

INSERT INTO products(title, description, price) VALUES('Беговые кроссовки', 'Лучшие кроссовки для бегунов', 29.99);

INSERT INTO products(title, description, price) VALUES('Книги по программированию', 'Коллекция книг для разработчиков', 15.00);

Search for "книга" (singular) — it will match "книги" and "книг" (different forms):

SELECT * FROM products WHERE MATCH('книга');

Search for "кроссовка" — it will match "кроссовки" (plural form):

SELECT * FROM products WHERE MATCH('кроссовка');

CALL KEYWORDS('кроссовки кроссовка книги книга', 'products');

For other languages, use libstemmer. Let's try German:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, description text, price float) morphology='libstemmer_de';

INSERT INTO products(title, description, price) VALUES('Laufschuhe', 'Die besten Schuhe zum Laufen', 29.99);

INSERT INTO products(title, description, price) VALUES('Bücher', 'Eine Sammlung von Büchern für Entwickler', 15.00);

CALL KEYWORDS('Laufschuhe Büchern Entwickler', 'products');

Available libstemmer languages: ca, da, nl, en, fi, fr, de, el, hi, hu, id, ga, it, lt, no, pt, ro, ru, es, sv, ta, tr.

Combining Multiple Morphology Processors

You can combine several morphology processors for multilingual content. Processors are applied in order and processing stops once a processor modifies the word.

Create a table with English and Russian stemmers:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, description text, code text, price float) morphology='stem_en, stem_ru';

INSERT INTO products(title, description, code, price) VALUES('Running shoes', 'Лучшие кроссовки для бегунов', 'SHOE-001', 29.99);

INSERT INTO products(title, description, code, price) VALUES('Programming books', 'Книги по программированию', 'BOOK-002', 15.00);

Both English and Russian searches work:

SELECT * FROM products WHERE MATCH('run');

SELECT * FROM products WHERE MATCH('бегун');

CALL KEYWORDS('running бегунов books книги', 'products');

morphology_skip_fields

Sometimes you don't want morphology applied to certain fields. For example, product codes should stay as-is. Use morphology_skip_fields:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, description text, code text, price float) morphology='stem_en' morphology_skip_fields='code';

INSERT INTO products(title, description, code, price) VALUES('Running shoes', 'Best shoes for runners', 'SHOES-RUN', 29.99);

Compare how "code" field is processed vs "title":

CALL KEYWORDS('SHOES-RUN', 'products');

CALL KEYWORDS('running', 'products');

Search in the code field uses exact form:

SELECT * FROM products WHERE MATCH('@code SHOES-RUN');

Advanced Morphology Options

min_stemming_len

Short words can sometimes be stemmed incorrectly. For example, "gps" might become "gp". Use min_stemming_len to prevent stemming of short words:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, price float) morphology='stem_en' min_stemming_len='4';

INSERT INTO products(title, price) VALUES('GPS tracker for runners', 10.00);

CALL KEYWORDS('GPS runners', 'products');

Notice that "GPS" (3 chars) is not stemmed, but "runners" (7 chars) is reduced to "runner".

index_exact_words

When enabled, Manticore indexes both the original and the stemmed form of each word. This allows using the exact form operator = in queries:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, price float) morphology='stem_en' index_exact_words='1';

INSERT INTO products(title, price) VALUES('Running in the park', 10.00);

INSERT INTO products(title, price) VALUES('Run the test suite', 20.00);

Without the exact form operator, both documents match:

SELECT * FROM products WHERE MATCH('running');

With the exact form operator =, only the exact word matches:

SELECT * FROM products WHERE MATCH('=running');

SELECT * FROM products WHERE MATCH('=run');

This is useful when you need both broad (stemmed) and precise (exact) search capabilities.

Phonetic algorithms

Manticore also supports phonetic processors for sound-based matching:

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, price float) morphology='metaphone';

INSERT INTO products(title, price) VALUES('Smith Electronics', 10.00);

INSERT INTO products(title, price) VALUES('Smyth Solutions', 20.00);

SELECT * FROM products WHERE MATCH('Smith');

Both "Smith" and "Smyth" match because they sound alike.

CALL KEYWORDS('Smith Smyth', 'products');

CALL KEYWORDS: Debugging Morphology

CALL KEYWORDS is the essential tool for understanding how morphology processes your text. It shows the tokenized and normalized forms and can also return document statistics.

DROP TABLE IF EXISTS products;

CREATE TABLE products(title text, price float) morphology='stem_en';

INSERT INTO products(title, price) VALUES('Running shoes for professional runners', 10.00);

INSERT INTO products(title, price) VALUES('Books about programming and development', 20.00);

INSERT INTO products(title, price) VALUES('Running a small business', 30.00);

Basic CALL KEYWORDS:

CALL KEYWORDS('running professionally businesses', 'products');

With document statistics (1 as stats):

CALL KEYWORDS('running shoes books', 'products', 1 as stats);

The docs column shows how many documents contain the keyword, and hits shows the total number of occurrences.

Comparing different processors

You can create tables with different morphology settings and compare how they process the same text:

DROP TABLE IF EXISTS t_stem;

DROP TABLE IF EXISTS t_meta;

CREATE TABLE t_stem(title text) morphology='stem_en';

CREATE TABLE t_meta(title text) morphology='metaphone';

CALL KEYWORDS('running development Smith', 't_stem');

CALL KEYWORDS('running development Smith', 't_meta');

Notice how stem_en reduces "running" to "run" and "development" to "develop", while metaphone converts them to phonetic codes.

Summary of morphology options

morphology           - list of processors (e.g. 'stem_en, stem_ru')
morphology_skip_fields - fields to exclude from morphology
min_stemming_len     - minimum word length for stemming
index_exact_words    - index both original and stemmed forms

For full documentation, see Morphology.