Terminal Terminal | Web Web
Home  //  Play

Highlighting in Manticore

Difficulty: Beginner
Estimated Time: 10 minutes

Manticoresearch - Highlighting

In this tutorial you will learn how to highlight keywords in text in Manticore Search.

Highlighting in Manticore

Step 1 of 4

Introduction

You can highlight keywords in text in Manticore Search by several means:

  • CALL SNIPPETS statement
  • SNIPPET() function
  • HIGHLIGHT() function

CALL SNIPPETS statement can be used separately from a search query to highlight a string or a list of strings. Here is an example:

CALL SNIPPETS('my text with keyword', 'idx', 'keyword');

SNIPPET() function is mostly used in a SELECT statement to highlight given text, field value or text fetched from another source using an UDF. It can be used for highlighting the same query as in the match clause or another one, it depends on you. Like this:

SELECT SNIPPET(content,'camera') FROM testrtstore WHERE MATCH('camera');

In Maticore 3.2.2 was added new function HIGHLIGHT() which makes it easier to highlight keywords in your documents when you store them in Manticore, not just index.

All the three share same highlighting options which we'll discuss in the next steps. In this tutorial we'll show examples of using HIGHLIGHT().

Let's assume you have an index called 'highlight' with the following settings:

index highlight
{
        type = rt
        path = highlight
        rt_field = title
        rt_field = content
        rt_attr_uint = gid
        stored_fields = title, content
        index_sp = 1
        html_strip = 1

}

First, let's connect to the searchd daemon via SQL interface:

mysql -P9306 -h0

Basic usage

A quick example:

First add a document:

INSERT INTO highlight(title,content,gid) VALUES('Syntax highlighting','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.',1);

SELECT HIGHLIGHT() AS h FROM highlight WHERE MATCH('text feature')\G

By default, any of matching words are highlighted with tag <b> (bold) and at most 5 words are picked around each match to form a passage. Passages are separated with '...'.

In general the HTML tag is used to highlight matches as snippets are often displayed in an HTML content, but you can customize the behaviour with "before_match", "after_match", "around" and "chunk_separator" settings. For example:

SELECT HIGHLIGHT({before_match='*',after_match='*',around=1,chunk_separator='###'}) AS h FROM highlight WHERE MATCH('text feature')\G

Control the size of the snippet

Default settings put a limit (under the setting with same name - "limit") of 256 codepoints (characters and symbols) as maximum snippet size. You can change it like this:

SELECT HIGHLIGHT({limit=10}) AS h FROM highlight WHERE MATCH('text feature')\G

Another limit which can be used is the number of total words included which is defined by "limit_words":

SELECT HIGHLIGHT({limit_words=5},'content') AS h FROM highlight WHERE MATCH('text feature')\G

It is also possible to limit the number of passages, for example if we want to get just one passage of the MATCHes and not all:

SELECT HIGHLIGHT({limit_passages=1}) AS h FROM highlight WHERE MATCH('text feature')\G

The default behaviour of the HIGHLIGHT() function is to return found passages separated by a defined separator in the space defined by the limit. As it's possible that the limit may not be enough for all passages we may get only a part of the possible passsages.

Let's add first a document with a longer text.

INSERT INTO highlight(title,content) values('wikipedia','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, such as spell checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports syntax highlighting Some text editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance as a HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML, or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read. ');

Let's run it again: SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('syntax')\G

For the new added document we see the HIGHLIGHT() doesn't give us all the passages. We can increase the limit to overcome that, the question is how much. If we use a value too big, the HIGHLIGHT() returns the full body of the content (including the highlights):

SELECT HIGHLIGHT({limit=10000},'content') AS h FROM highlight WHERE MATCH('syntax')\G

If we want just the passages we need to use force_passages option:

SELECT HIGHLIGHT({limit=10000,force_passages=1},'content') AS h FROM highlight WHERE MATCH('syntax')\G

Another possible way to get the whole text with highlights applied is to simply use limit=0:

SELECT HIGHLIGHT({limit=0},'content') AS h FROM highlight WHERE MATCH('text feature')\G

HTML Stripping and boundaries

If our index has sentence detection, we can configure highlighting to not create passages that cross between sentences:

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('html text')\G

In this example we see the passage '... markup languages, such as HTML. The feature displays text, especially source code ...' which crosses between sentences.

With passage_boundary=sentence} this passage will be split into two:

SELECT HIGHLIGHT({passage_boundary='sentence'},'content') AS h FROM highlight WHERE MATCH('html text')\G

Let's add a document with HTML content.

INSERT INTO highlight(title,content) values('html content','<p>The ideas of syntax highlighting overlap significantly with those of <a href="/wiki/Structure_editor" title="Structure editor">syntax-directed editors</a>. One of the first such class of editors for code was Wilfred Hansens 1969 code editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a href="/wiki/Autocomplete" title="Autocomplete">code completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs.</p>');

By default highlighting will process HTML content depending on the index settings. If HTML stripping is enabled in the index, then the HIGHLIGHT() result will also be HTML stripped.

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('code class')\G

If we want the highlight to include the HTML tags as well, we need to set 'html_strip_mode=none':

SELECT HIGHLIGHT({html_strip_mode='none'},'content') AS h FROM highlight WHERE MATCH('code class')\G

Please note that html_strip_mode=none can highlight words that are part of HTML syntax, like 'class'. To protect the HTML entities, the retain mode can be used, but it requires no limit for the snippet (limit=0):

SELECT HIGHLIGHT({html_strip_mode='retain',limit=0},'content') AS h FROM highlight WHERE MATCH('code class')\G