Terminal Terminal | Web Web
Home  //  Play

Manticore Search - Main+delta index

Difficulty: Beginner
Estimated Time: 15 minutes

Manticore Search - Main+delta indexing schema

In this tutorial you will learn how to create a main+delta indexing schema for offline indexes.

You should be already familiar with a real-time index (watch this course https://play.manticoresearch.com/rtintro/ if you're not) which allows you to insert documents to an index and they become searchable immediately.

On the contrary to that, an offline (also called "plain") index gets data only once when it's created using a defined source. This allows a quick search implementation as the indexing part is decoupled from the application code and the only thing needed to be implemented in an application is search functionality. It's especially useful when you already have your data in an external storage: MySQL, PostgreSQL, other database, CSV, TSV, XML, JSON other text file etc. But the only data available for search is the data indexed at indexing time. Any changes in the place from where you indexed the data after that are not reflected in the index. In order to have latest changes in the source data you need to rebuild the index. Usually a cron job or a scheduled task is used to refresh the index at regular intervals. The problem comes when reindexing starts taking long: seconds, minutes, hours. Finally you don't want to wait so long for a new document to become searchable.

This can be solved by using a secondary index that will capture changes made between the indexing time of the primary index and current time. This index is called a delta index and it's also a regular offline index, and it normally has same settings as the primary (called "main") one. This schema is called "main+delta". In this course you will learn how to deal with that.

Manticore Search - Main+delta index

Step 1 of 5

Initial setup

For this course we are going to use a MySQL source (https://docs.manticoresearch.com/latest/html/indexing/sql_data_sources_mysql,_postgresql.html?highlight=mysql#sql-data-sources-mysql-postgresql)

Beside the table holding documents we are going to index we are also going to use a helper table which will keep track of the moments of the main and delta indexing times.

The delta index must capture any changes in the documents table:

  • new documents
  • modified documents
  • deleted documents

For the first two it's enough to have a timestamp column in the documents table. For properly syncing the deleted documents from the table to the index we'll use an additional column, telling us whether the document is marked as deleted. It means the application must not delete a document, but should mark it as deleted (otherwise it will be impossible to figure out that the doc is not available any more and it will remain in the index until it's fully rebuilt). If it's a problem in your case - consider real time indexes instead, they're much more suited to cases when the data changes intensively. Also the documents we want to make searchable will be filtered by the "deleted" flag (we want only those that are not marked as deleted).

Here's an example of the main table from which we want to build a Manticore Search index:

CREATE TABLE documents (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(50) NOT NULL,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  deleted TINYINT UNSIGNED NOT NULL,
  PRIMARY KEY  (id)
);
Note the updated_at and deleted columns that are important in our case.

The additional helper table will have 2 columns: one for index name, another for a timestamp:

CREATE TABLE deltabreaker (
 index_name VARCHAR(50) NOT NULL,
 created_at TIMESTAMP NOT NULL  DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (index_name)
 );

Preparing the Main Index

What's next? We could do just

SELECT * FROM documents
in the source config, but it's not good as it misses some important things:
  • we don't want deleted documents in the main index, so we want to add "WHERE deleted=0"
  • we want to make sure we know the time until which we can say that all the documents before that time are for sure in the index. To do that even before we fetch anything from the table we can save the current time in the helper table (note sql_query_pre before) and then use that time in the main query (see sql_query)
  • it's possible without sql_query_range (which is primarily used for splitting too big query into multiple smaller ones, read the doc https://docs.manticoresearch.com/latest/html/conf_options_reference/data_source_configuration_options.html#sql-query-range for details), but sql_query_range mostly doesn't hurt and makes things easier

So what we have in the end is:

    sql_query_pre = REPLACE INTO deltabreaker SET index_name = 'main', created_at = NOW()
    sql_query =  SELECT id, title, UNIX_TIMESTAMP(updated_at) AS updated FROM documents WHERE deleted=0 AND  updated_at  >=FROM_UNIXTIME($start) AND updated_at  <= FROM_UNIXTIME($end)
    sql_query_range  = SELECT ( SELECT UNIX_TIMESTAMP(MIN(updated_at)) FROM documents) min, ( SELECT UNIX_TIMESTAMP(created_at)-1 FROM deltabreaker WHERE index_name='main') max
    sql_query_post_index = REPLACE INTO deltabreaker set index_name = 'delta', created_at = (SELECT created_at FROM deltabreaker t WHERE index_name='main')

The pre query will mark the time of indexing. All documents having at most the timestamp of current time will be indexed by the Main index. In the post index query we're changing the timestamp of the delta index.

Let's build the Main index:

indexer main

Enter Manticore:

mysql -P9306 -h0

Get the new index online:

RELOAD INDEXES;

And check the main index:

SELECT * FROM main;

We see we have 9 documents.

Making changes in the source

First, let's exit Manticore and go to MySQL

exit;

mysql test -u user --password=pass123

And check the documents table:

SELECT * FROM documents;

We see we have 9, same as in the Manticore index.

Let's perform some changes: we are going to add a new row, modify an existing one and mark one as deleted:

INSERT INTO documents VALUES(10,'Document Ten',NOW(),0);

UPDATE documents SET title='Altered document Six' WHERE id=6;

UPDATE documents SET deleted=1 WHERE id=5;exit;

Creating the Delta index

The source of the delta will inherit the Main source (Note "source delta : main") with a few overrides:

source delta : main
{
    sql_query_pre =
    sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(created_at) FROM deltabreaker WHERE index_name='delta') min, UNIX_TIMESTAMP() max
    sql_query_killlist = SELECT id FROM documents WHERE updated_at >=  (SELECT created_at FROM deltabreaker WHERE index_name='delta')
}

The 'sql_query_pre=' means to suppress the inherited pre query from main, as we don't want to perform any update to the helper table (as we already prepared all for the delta index in sql_query_post_index in the Main source, see step 2). We see the range will get rows with the timestamp between the marked time for the delta and current time. You may ask yourself why 2 entries are used in the helper as right now they will hold the same timestamp. The reason is that for the Main index we set a timestamp before it starts indexing. In case the Main indexation failed the next run of the Delta would use the new timestamp as a starting point, missing changes between last good indexing and the failed one. That's why we have a separate entry for the delta index and populate it after the Main index is successfully built (remember sql_query_post_index - https://docs.manticoresearch.com/latest/html/conf_options_reference/data_source_configuration_options.html#sql-query-post-index).

The delta index configuration also needs to include the killlist_target, which will tell to use the killlist to suppress affected ids in the Main index:

index delta
{
    path = /var/lib/manticore/delta
    source = delta
    killlist_target = main:kl
}

Read more about killlist_target - https://docs.manticoresearch.com/latest/html/conf_options_reference/index_configuration_options.html#killlist-target

Let's now index the Delta:

indexer delta

Enter Manticore, reload the indexes:

mysql -P9306 -h0

RELOAD INDEXES;

All together

Let's look at the Delta index:

SELECT * FROM delta;

The Delta includes the new and updated documents.

And the Main:

SELECT * FROM main;

We see in the Main index both the changed and deleted documents are missing.

To search on both indexes we can use a local distributed index over the both:

index myindex
{
    type = distributed
    local = main
    local = delta
}

SELECT * FROM myindex;

Searching on the both give us the new and updated documents as well as not showing the deleted document (due to the sql_query_killlist and killlist_target, see previous step).