Manticore More Like This
"More like this" is a search functionality which aims to find documents that are similar to a given document. This is used to show similar posts, related news threads, products that have similar descriptions and so on. It can also be used to detect duplicate content.
At it's core, More like this is a search using words from a document. Since the body of a document can be large and have hundreds of words, a raw search using all the words can be quite slow. Another factor that can prevent the search from providing quality similarities are common, too frequent words (a, the, what, when etc.), that can influence the scoring too much.
To improve the quality of the search and it's speed a given document is first analyzed and only some of the words are kept for the actual search. Each word receives a score and only top N words are then kept. Optionally, the words can be also filtered by a list of common words (as known as stop words).
The resulting bag of words is then tested against the index. An inclusive search might not provide many results (even more, can return back just the given document), so a quorum search can be used, which will broaden the result set (aka recall). The fuzziness of the search can affect both the size of the result set and it's scoring.
Getting a good ordering of the result set for a high precision can be tricky and can require extensive tweaking.
For this tutorial we've built a web example which allows tweaking some aspects of the "More like this" search.
The first step is to extract the words from our input and keep only the most important ones. The importance of a word is given by it's TF-IDF (Term Frequency - Inverse Document Frequency, read this https://manticoresearch.com/2019/04/09/tf-idf-in-a-nutshell/ if you want to really undestand it).
For getting the IDF we first need to know the number of documents in our collection. This can be easily achieved with the SHOW INDEX STATUS call:
mysql -P9306 -h0
SHOW INDEX news STATUS;
The 'indexed documents' variable contains the number of documents in the index.
The next step is to extract the words. This is achieved with 'CALL KEYWORDS' command, to which we also enabled the 'stats' option as we want to know in how many documents each word appears.
Let's take an example:
CALL KEYWORDS('Google Faces New Round of Antitrust Charges in Europe', 'news', 1 AS stats);exit;
The CALL KEYWORDS result will break the input string into words. If a word appears multiple times in the string it will also be present multiple times in the CALL KEYWORDS result set. We will count these occurrences as they will be used in the TF-IDF calculation. From the resulting list of unique words, we can optionally take out stop words.
For each word we calculate the IDF by formula "1 + log(TOTAL_DOCS/WORD_DOCS+1)" and the TF-IDF as "square(OCCURENCES_IN_DOCUMENT)*IDF". The word collection can now be sorted and we can extract the number of top N words based on their TF-IDF scores. Manticore Search allows searching a maximum of 256 words in a query at a time, be sure your final words collection doesn't go above this limit.
All the above can be seen in the PHP code that runs the web example:
With the bag of words extracted from the given input we perform a search. There are two parts here that need attention:
- How we are going to use the bag of words
- How we rank the findings
A simple way is to use a quorum operator with percentile (float between 0.0 and 1.0) value for the quorum. On top of this we can use the words' TF-IDF values as boost for the words. More important words will have a bigger weight in the calculated score this way.
The second part is the ranking. In our example we use by default a custom ranker based on 'atc' field-level factor, a complex proximity measure that can provide very good results, but it's also computing intensive. Read more about the 'atc' here https://manual.manticoresearch.com/Searching/Sorting_and_ranking#Field-level-ranking-factors
The resulting query looks like this:
mysql -P9306 -h0
SELECT id,title,WEIGHT() as w FROM news WHERE MATCH('@title "antitrust^7.413059037146 google^4.8103693517016 round^4.5226872792499 faces^4.2550212715046 charges^3.803790473383 europe^3.7751520046429 new^1.6217944291147"/0.2') LIMIT 0,20 OPTION ranker=expr('sum(atc*1000)'), idf='plain,tfidf_unnormalized';