Terminal Terminal | Web Web
Home  //  Play

Manticore Search - Main+delta schema

Difficulty: Beginner
Estimated Time: 15 minutes
Note: this course contains examples of how to work with plain(not real-time) tables, which cannot be reproduced with a default Manticore configuration file. For more details about the operating modes and the corresponding configuration settings, see our Manual

Manticore Search - Main+delta indexing schema

In this tutorial you will learn how to create a main+delta indexing schema for offline tables.

You should be already familiar with a real-time table (watch this course https://play.manticoresearch.com/rtintro/ if you're not) which allows you to insert documents to a table and they become searchable immediately.

On the contrary to that, an offline (also called "plain") table gets data only once when it's created using a defined source. This allows a quick search implementation as the indexing part is decoupled from the application code and the only thing needed to be implemented in an application is search functionality. It's especially useful when you already have your data in an external storage: MySQL, PostgreSQL, other database, CSV, TSV, XML, JSON other text file etc. But the only data available for search is the data indexed at indexing time. Any changes in the place from where you indexed the data after that are not reflected in the table. In order to have latest changes in the source data you need to rebuild the table. Usually a cron job or a scheduled task is used to refresh the table at regular intervals. The problem comes when reindexing starts taking long: seconds, minutes, hours. Finally you don't want to wait so long for a new document to become searchable.

This can be solved by using a secondary table that will capture changes made between the indexing time of the primary table and current time. This table is called a delta table and it's also a regular offline table, and it normally has same settings as the primary (called "main") one. This schema is called "main+delta". In this course you will learn how to deal with that.

Manticore Search - Main+delta schema

Step 1 of 5

Initial setup

For this course we are going to use a MySQL source (https://manual.manticoresearch.com/Adding_data_from_external_storages/Fetching_from_databases/Introduction)

Beside the table holding documents we are going to index we are also going to use a helper table which will keep track of the moments of the main and delta indexing times.

The delta table must capture any changes in the documents table:

  • new documents
  • modified documents
  • deleted documents

For the first two it's enough to have a timestamp column in the documents table. For properly syncing the deleted documents from the table to the table we'll use an additional column, telling us whether the document is marked as deleted. It means the application must not delete a document, but should mark it as deleted (otherwise it will be impossible to figure out that the doc is not available any more and it will remain in the table until it's fully rebuilt). If it's a problem in your case - consider real time tables instead, they're much more suited to cases when the data changes intensively. Also the documents we want to make searchable will be filtered by the "deleted" flag (we want only those that are not marked as deleted).

Here's an example of the main table from which we want to build a Manticore Search table:

CREATE TABLE documents (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  title VARCHAR(50) NOT NULL,
  updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  deleted TINYINT UNSIGNED NOT NULL,
  PRIMARY KEY  (id)
);
Note the updated_at and deleted columns that are important in our case.

The additional helper table will have 2 columns: one for a table name, another for a timestamp:

CREATE TABLE deltabreaker (
 index_name VARCHAR(50) NOT NULL,
 created_at TIMESTAMP NOT NULL  DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
 PRIMARY KEY (index_name)
 );

Preparing the Main Table

What's next? We could just run the following query in the source config:

SELECT * FROM documents
However, this is not good as it misses some important things:
  • we don't want deleted documents in the main table, so we want to add "WHERE deleted=0"
  • we want to make sure we know the time until which we can say that all the documents before that time are for sure in the table. To do that even before we fetch anything from the table, we can save the current time in the helper table (note sql_query_pre before) and then use that time in the main query (see sql_query)
  • it's possible without sql_query_range (which is primarily used for splitting too big query into multiple smaller ones, read the doc https://manual.manticoresearch.com/Adding_data_from_external_storages/Fetching_from_databases/Ranged_queries#sql_query_range for details), but sql_query_range mostly doesn't hurt and makes things easier

So what we have in the end is:

    sql_query_pre = REPLACE INTO deltabreaker SET index_name = 'main', created_at = NOW()
    sql_query =  SELECT id, title, UNIX_TIMESTAMP(updated_at) AS updated FROM documents WHERE deleted=0 AND  updated_at  >=FROM_UNIXTIME($start) AND updated_at  <= FROM_UNIXTIME($end)
    sql_query_range  = SELECT ( SELECT UNIX_TIMESTAMP(MIN(updated_at)) FROM documents) min, ( SELECT UNIX_TIMESTAMP(created_at)-1 FROM deltabreaker WHERE index_name='main') max
    sql_query_post_index = REPLACE INTO deltabreaker set index_name = 'delta', created_at = (SELECT created_at FROM deltabreaker t WHERE index_name='main')

The pre query will mark the time of indexing. All documents having at most the timestamp of current time will be indexed by the Main table. In the post table query we're changing the timestamp of the delta table.

Let's build the Main table. To populate it, we use the indexer tool. When using indexer, the generated table files must be made with permissions that allow searchd to access them. In case of Linux official packages searchd runs under manticore user, so indexer must also run under manticore user:

sudo -u manticore indexer main

If you are running 'searchd' differently, you might be able to omit the sudo -u manticore part, just make sure that the user under which your searchd instance is running has read/write permissions to the tables you generate.

Enter Manticore:

mysql -P9306 -h0

Get the new table online:

RELOAD TABLES;

And check the main table:

SELECT * FROM main;

We see that we have 9 documents.

Making changes in the source

First, let's exit Manticore and go to MySQL

exit;

mysql test -u user --password=pass123

And check the documents table:

SELECT * FROM documents;

We see we have 9, same as in the Manticore table.

Let's perform some changes: we are going to add a new row, modify an existing one and mark one as deleted:

INSERT INTO documents VALUES(10,'Document Ten',NOW(),0);

UPDATE documents SET title='Altered document Six' WHERE id=6;

UPDATE documents SET deleted=1 WHERE id=5;exit;

Creating the Delta table

The source of the delta will inherit the Main source (Note "source delta : main") with a few overrides:

source delta : main
{
    sql_query_pre =
    sql_query_range = SELECT ( SELECT UNIX_TIMESTAMP(created_at) FROM deltabreaker WHERE index_name='delta') min, UNIX_TIMESTAMP() max
    sql_query_killlist = SELECT id FROM documents WHERE updated_at >=  (SELECT created_at FROM deltabreaker WHERE index_name='delta')
}

The 'sql_query_pre=' means to suppress the inherited pre query from main, as we don't want to perform any update to the helper table (as we already prepared all for the delta table in sql_query_post_index in the Main source, see step 2). We see the range will get rows with the timestamp between the marked time for the delta and current time. You may ask yourself why 2 entries are used in the helper as right now they will hold the same timestamp. The reason is that for the Main table we set a timestamp before it starts indexing. In case the Main indexation failed the next run of the Delta would use the new timestamp as a starting point, missing changes between last good indexing and the failed one. That's why we have a separate entry for the delta table and populate it after the Main table is successfully built (remember sql_query_post_index - https://manual.manticoresearch.com/Adding_data_from_external_storages/Fetching_from_databases/Execution_of_fetch_queries#sql_query_post_index).

The delta table configuration also needs to include the killlist_target, which will tell to use the killlist to suppress affected ids in the Main table:

table delta
{
    path = /var/lib/manticore/delta
    source = delta
    killlist_target = main:kl
}

Read more about killlist_target - https://manual.manticoresearch.com/Creating_an_index/Local_indexes/Plain_and_real-time_index_settings#killlist_target

Let's now index the Delta. To populate the table we use the indexer tool. When using indexer, the generated table files must be made with permissions that allow searchd(Manticore Search server) to access them. In case of Linux official packages searchd runs under manticore user, so indexer must also run under manticore user:

sudo -u manticore indexer delta

If you are running searchd differently you might be able to omit the sudo -u manticore part, just make sure that the user under which your searchd instance is running has read/write permissions to the tables you generate.

Enter Manticore, reload the tables:

mysql -P9306 -h0

RELOAD TABLES;

All together

Let's look at the Delta table:

SELECT * FROM delta;

The Delta includes the new and updated documents.

And the Main:

SELECT * FROM main;

We see in the Main table both the changed and deleted documents are missing.

To search on both table we can use a local distributed table over the both:

table mytable
{
    type = distributed
    local = main
    local = delta
}

SELECT * FROM mytable;

Searching on the both give us the new and updated documents as well as not showing the deleted document (due to the sql_query_killlist and killlist_target, see previous step).