Beginning Elasticsearch – mappings and analyzers

Elasticsearch is one of the most recent and promising full-text search engine. It’s built on Lucene, just like Solr, but it comes with nice RESTful JSON API and it was developed with scalability and the cloud in mind.

The documentation on the official website is complete and detailed, but for a newcomer it can be overwhelming, so I’ve decided to put together this introductory article.

First, you need to install Elasticsearch. It’s fairly easy on OSX: brew install elasticsearch will do the job. If you’re on linux there’s probably a package ready for you.

Once installation is done, if you don’t know where the executable file is located, you can find out with which elasticsearch. On my machine it’s in /usr/local/bin/elasticsearch and that’s exaclty what I have to type to start the server manually, but before that, let’s install Marvel, the official dashboard/development console. Sense is the name of the console, it’s a very nice web interface with some autocomplete features, so it’s highly recommended for getting familiar with Elasticsearch. From the Elasticsearch home directory, run bin/plugin -i elasticsearch/marvel/latest. Now you’re ready to rock.

In your browser go to http://localhost:9200/_plugin/marvel/sense/. That’s the development console. Let’s start by indexing some data. Since Elasticsearch can be schemaless, you can just throw data at it:image

Move the cursor on each code block and click the green arrow (next to the wrench) to execute the code. The right panel shows the result.

If you want to follow along with the example by typing the code in your machine you can find the code here.

While SQL databases have databases and tables, Elasticsearch has indexes and types. Just like a table belongs to a database, a type belongs to an index. In our example the index is “examples” and the type is “movies”

Looking at the right side of screenshot you can see that the first record was saved and there are some futher details about the operation outcome.

Mappings

I know you’re looking forward to make some queries, but your best option in order to get good matches out of your data is to properly index contents. That’s what mappings are about.

We didn’t need to write any schema upfront, but internally Elasticsearch generated one starting from our data. Let’s see it:

image

Elasticsearch can infer the field type from the data you insert. If we had numbers and dates they would have been mapped accordingly.

Analyzers

What You need to know is that the type “string” is analyzed before being inserted into the index. An analyzer manipulates the original text and spits it out in a new form. Records will be indexed and retrieved according to their tokenized form.

If no analyzer is explicitly set then Elasticsearch will use the default analyzer. If you’re curious you can see what your text will look like after analysis:

Using the standard analyzer, text is basically split into single words and lowercased. The first example analyzes the text out of the context of the index, using the explicitly named “standard” analyzer. The second one shows how text is analyzed in the context of the “examples” index and “title” field. Their results are exactly the same, confirming that the analysis process was identical.
Elasticsearch comes with many builtin analyzers. Some of the most useful ones are language analyzers that can stem words according to grammar rules. Try the third example in the above screenshot and see that “Saving” is stemmed to “save”.

image

Why is this useful? The first example of the next image looks for a title that contains the “saving” word. The record will be found.

image

The second example looks for a more generic “save”, but given the fact that the title field was analyzed with the standard analyzer, no record can be found.

If we want it to be found, then we need to update the mapping. Note that I’m going to add a new field “title.en” instead of changing the existing generic “title” field.

Changing existing fields requires reindexing (reinserting) all the data, while adding new fields doesn’t. You just need to update the existing records when necessary. Besides, if in the future we need to index German titles as well, we can just add a new “title.de” field and we will be able to index both English and German titles.

Let’s update the movies type mapping with a new field, “title.en”, that will be used for english titles:

image

Now we need to update the existing record so that we can search also within the “title.en” field:

image

You’re now ready to query the title.en field and get the expected result:

image

The second example uses the “multi_match” keyword to allow search within multiple fields. This comes very handy, as we now can query all title fields in one shot by adding an asterisk (“title*”), without the need to specify which language we’re interested in (if we added “title.de” it would have been searched as well, making our search cross language).

I hope this introduction to Elasticsearch mappings and analyzers was useful!

Leave a Reply

wpDiscuz