Terminal Terminal | Web Web
Home  //  Play

Manticore Highlighting

Difficulty: Beginner
Estimated Time: 10 minutes

Manticoresearch - Highlighting

In this tutorial you will learn how to implement highlightinh feature using Manticore Search.

Manticore Highlighting

Step 1 of 4

Introduction

In this course we assume an index called 'highlight' with the following settings:

index highlight
{
        type = rt
        path = highlight
        rt_field = title
        rt_field = content
        rt_attr_uint = gid
        stored_fields = title, content
        index_sp = 1
        html_strip = 1

}

Highlighting can be implemented in Manticore Search in several ways:

  • CALL SNIPPETS statement
  • SNIPPETS() function
  • HIGHLIGHT() function

CALL SNIPPETS() statement can be used separately from a search query to highlight a string or a list of strings.

SNIPPETS() function is used in a SELECT statement to return highlight of a given text, field value or text fetched from another source using an UDF. It can use for highlighting the same query as in the match clause or a given one.

HIGHLIGHT() function was added to allow highlighting of stored fields.

All three share same highlighting options which we'll discuss in the next steps.

First, connect to SQL interface:

mysql -P9306 -h0

Basic usage

A quick example:

First add a document:

INSERT INTO highlight(title,content,gid) VALUES('Syntax highlighting','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.',1);

SELECT HIGHLIGHT() AS h FROM highlight WHERE MATCH('text feature')\G

By default, any of the words matched is highlighted between b (bold) tag and at most 5 words are picked around each match to form a passage. Snippets are separated by '...'.

In general a HTML tag is used to highlight the match as the snippet will be displayed in a HTML content to the user.

These can be customized with before_match, after_match, around and chunk_separator settings:

SELECT HIGHLIGHT({before_match='*',after_match='*',around=1,chunk_separator='###'}) AS h FROM highlight WHERE MATCH('text feature')\G

Control the size of the snippet

Default settings put a limit (under the setting with same name) of 256 codepoints (characters and symbols) as maximum snippet size.

SELECT HIGHLIGHT({limit=10}) AS h FROM highlight WHERE MATCH('text feature')\G

Another limit can be imposed as the number of total words included by limit_words

SELECT HIGHLIGHT({limit_words=5},'content') AS h FROM highlight WHERE MATCH('text feature')\G

It is also possible to limit the number of passages, for example if we want to get just one passage of the MATCHes and not all:

SELECT HIGHLIGHT({limit_passages=1}) AS h FROM highlight WHERE MATCH('text feature')\G

The default view of the highlight result is to show found passages separated by a defined separator in the space defined by limit. as it's possible that the limit may not be enough for all passages, we would get only a part of the possible passsages.

Let's add first a document with a longer text.

INSERT INTO highlight(title,content) values('wikipedia','Syntax highlighting is a feature of text editors that are used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colors and fonts according to the category of terms.[1] This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. Highlighting does not affect the meaning of the text itself; it is intended only for human readers. Syntax highlighting is a form of secondary notation, since the highlights are not part of the text meaning, but serve to reinforce it. Some editors also integrate syntax highlighting with other features, such as spell checking or code folding, as aids to editing which are external to the language. Contents 1Practical benefits 2Support in text editors 3Syntax elements 3.1Examples 4History and limitations 5See also 6References Practical benefits Highlighting the effect of missing delimiter (after watch=false) in Javascript Syntax highlighting is one strategy to improve the readability and context of the text; especially for code that spans several pages. The reader can easily ignore large sections of comments or code, depending on what they are looking for. Syntax highlighting also helps programmers find errors in their program. For example, most editors highlight string literals in a different color. Consequently, spotting a missing delimiter is much easier because of the contrasting color of the text. Brace MATCHing is another important feature with many popular editors. This makes it simple to see if a brace has been left out or locate the MATCH of the brace the cursor is on by highlighting the pair in a different color. A study published in the conference PPIG evaluated the effects of syntax highlighting on the comprehension of short programs, finding that the presence of syntax highlighting significantly reduces the time taken for a programmer to internalise the semantics of a program.[2] Additionally, data gathered FROM an eye-tracker during the study suggested that syntax highlighting enables programmers to pay less attention to standard syntactic components such as keywords. Support in text editors gedit supports syntax highlighting Some text editors can also export the coloured markup in a format that is suitable for printing or for importing into word-processing and other kinds of text-formatting software; for instance as a HTML, colorized LaTeX, PostScript or RTF version of its syntax highlighting. There are several syntax highlighting libraries or "engines" that can be used in other applications, but are not complete programs in themselves, for example the Generic Syntax Highlighter (GeSHi) extension for PHP. For editors that support more than one language, the user can usually specify the language of the text, such as C, LaTeX, HTML, or the text editor can automatically recognize it based on the file extension or by scanning contents of the file. This automatic language detection presents potential problems. For example, a user may want to edit a document containing: more than one language (for example when editing an HTML file that contains embedded Javascript code), a language that is not recognized (for example when editing source code for an obscure or relatively new programming language), a language that differs FROM the file type (for example when editing source code in an extension-less file in an editor that uses file extensions to detect the language). In these cases, it is not clear what language to use, and a document may not be highlighted or be highlighted incorrectly. Syntax elements Most editors with syntax highlighting allow different colors and text styles to be given to dozens of different lexical sub-elements of syntax. These include keywords, comments, control-flow statements, variables, and other elements. Programmers often heavily customize their settings in an attempt to show as much useful information as possible without making the code difficult to read. ');

Let's run it again: SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('syntax')\G

For the new added document we see the highlight doesn't give us all the passages. We can increase the limit for that, the question is how much. If we use a value too big, the highlight returns the full body of the content (including the highlights):

SELECT HIGHLIGHT({limit=10000},'content') AS h FROM highlight WHERE MATCH('syntax')\G

If we want just the passages we need to use force_passages option:

SELECT HIGHLIGHT({limit=10000,force_passages=1},'content') AS h FROM highlight WHERE MATCH('syntax')\G

Another possible way to get the whole text with highlights applies is to simply use limit=0:

SELECT HIGHLIGHT({limit=0},'content') AS h FROM highlight WHERE MATCH('text feature')\G

HTML Stripping and boundaries

If our index has sentence detection, we can set highlighting to not create passages that cross between sentences:

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('html text')\G

In the example we see the passage '... markup languages, such as HTML. The feature displays text, especially source code ...' which cross between sentences.

With passage_boundary=sentence} this passage will be split in two:

SELECT HIGHLIGHT({passage_boundary='sentence'},'content') AS h FROM highlight WHERE MATCH('html text')\G

Let's add a document with HTML content.

INSERT INTO highlight(title,content) values('html content','<p>The ideas of syntax highlighting overlap significantly with those of <a href="/wiki/Structure_editor" title="Structure editor">syntax-directed editors</a>. One of the first such class of editors for code was Wilfred Hansens 1969 code editor, Emily.<sup id="cite_ref-hansen_3-0" class="reference"><a href="#cite_note-hansen-3">[3]</a></sup><sup id="cite_ref-4" class="reference"><a href="#cite_note-4">[4]</a></sup> It provided advanced language-independent <a href="/wiki/Autocomplete" title="Autocomplete">code completion</a> facilities, and unlike modern editors with syntax highlighting, actually made it impossible to create syntactically incorrect programs.</p>');

By default highlighting will process HTML content depending on the index settings. If HTML stripping is enabled in the index, then the highlight result will also be HTML stripping.

SELECT HIGHLIGHT({},'content') AS h FROM highlight WHERE MATCH('code class')\G

If we want the highlight to include the HTML tags as well, we need to set 'html_strip_mode=none':

SELECT HIGHLIGHT({html_strip_mode='none'},'content') AS h FROM highlight WHERE MATCH('code class')\G

Please note that html_strip_mode=none can highlight words that are part of HTML syntax, like 'class'. To protect the HTML entities, the retain mode can be used, but requires no limit for the snippet (limit=0):

SELECT HIGHLIGHT({html_strip_mode='retain',limit=0},'content') AS h FROM highlight WHERE MATCH('code class')\G