Terminal Terminal | Web Web
Home  //  Play

Manticore CSV Import Example

Difficulty: Beginner
Estimated Time: 10 minutes

Manticoresearch - Manticore CSV import

In this tutorial you will learn how to import in Manticore a CSV file.

The enviroment will launch a small docker instance with Manticore Search.

The dataset used in this scenario is https://data.world/data-society/imdb-5000-movie-dataset.

Manticore CSV Import Example

Step 1 of 3

Importing from CSV

One great thing about Manticore is that it can grab data from several sources in an easy way.

One of the data sources can be a CSV or TSV file. There are mainly 2 things we need to do in order to import a CSV:

  1. A small preparation of the CSV file.

Manticore requires the first column to represent the document id, which needs to be an unique integer number. If our CSV doesn't have this, we can add it easy with an awk command:

awk -v OFS=, 'NR ==1 {print "ID", $0; next} {print (NR-1), $0}' movie_metadata.csv > movie.csv

Manticore also requires to have no header in the CSV. In our sample CSV we have header, so we're going to remove it:

sed -i '1d' movie.csv

  1. Create source and index

In the source we define the source type, in this case 'csvpipe'. The data is read from the output of the 'csvpipe_command'. In our case we simply read a static file, but it can be very well a script written in PHP or Java that builds the data by getting it from a database/warehouse software, files or even an external API service.

Next we declare the each time of the CSV columns. Note that the first column - the unique id - is not declared as it's implicit.

    source movies
    {
      type = csvpipe
      csvpipe_command = cat /tutorial/movie.csv
      csvpipe_attr_string = color
      csvpipe_field_string = director_name
      csvpipe_attr_uint = num_critic_for_reviews
      csvpipe_attr_uint = duration
      csvpipe_attr_uint = director_facebook_likes
      csvpipe_attr_uint = actor_3_facebook_likes
      csvpipe_field_string = actor_2_name
      csvpipe_attr_uint = actor_1_facebook_likes
      csvpipe_attr_uint = gross
      csvpipe_field = genres
      csvpipe_field_string = actor_1_name
      csvpipe_field_string = movie_title
      csvpipe_attr_uint = num_voted_users
      csvpipe_attr_uint = cast_total_facebook_likes
      csvpipe_field_string = actor_3_name
      csvpipe_attr_uint = facenumber_in_poster
      csvpipe_field = plot_keywords
      csvpipe_attr_string = movie_imdb_link
      csvpipe_attr_uint = num_user_for_reviews
      csvpipe_attr_string = language
      csvpipe_attr_string = country
      csvpipe_attr_string = content_rating
      csvpipe_attr_uint = budget
      csvpipe_attr_uint = title_year
      csvpipe_attr_uint = actor_2_facebook_likes
      csvpipe_attr_float = imdb_score
      csvpipe_attr_float = aspect_ration
      csvpipe_attr_uint = movie_facebook_likes
    }

The index declaration is simple. We specify the location of the index and tell the index to use the above source.

index movies
 {
    type            = plain
    path            = /var/lib/manticore/data/movies
    source          = movies
 }

Indexing

Plain indexes are build with the 'indexer' tool:

indexer movies

Search daemon loads a new index if a config change is detected. In our case, the new index was already in the configuration when searchd was started. To load our index, we need to issue 'RELOAD INDEXES' command:

mysql -P9306 -h0

RELOAD INDEXES

If the index is loaded, it will appear in the list of active indexes:

SHOW TABLES;exit;

If the CSV receives new data, we need to reindex the index. As the index is already loaded, we will pass '--rotate' parameter to 'indexer' tool which will inform the daemon to pick the new index version, without daemon restart:

indexer movies --rotate

Using the new index

Now we can start running queries :

mysql -P9306 -h0

SElECT movie_title FROM movies WHERE MATCH('Tom Hanks');

SELECT * FROM movies WHERE match('Tom Hanks') LIMIT 40 FACET content_rating FACET INTERVAL(gross,0,10000000,100000000,1000000000) FACET title_year ORDER BY title_year DESC FACET imdb_score ORDER BY imdb_score DESC;