Configure Elasticsearch on an efficient way

Share Button

Lire la version française
This post is about elasticsearch which is a great search engine.
The biggest difficulty we meet is that we do not know how to configure Elasticsearch to have relevant search results. Another difficulty is (sorry to say that), the documentation is not very well done. Ok, it’s my opinion and I can’t denied we found usefull information in it, but information are sometimes difficult to find.
So, we searched into the deeps of elasticsearch to understand how it works. And we think we finally understood lot of things we did’nt know, and we have built the perfect (maybe?) configuration.

But let’s go back to the begining. As I said, our main problem was to have relevant results (not relevant for the search engine, but relevant for the user who makes the research). And sometimes, we have weird results.

For example, the search “funny pony” would return the results :

  Title : This pony is cool
  Description: Who wants a bun with lot of jam?

  Title: Everybody loves horses
  Description: Horses are sort of big ponies with more furry.

  Title: Funny ponies are the best ones
  Description: Lorem ipsum dolor sit amet consectetur adipiscing elit

With a “basic” configuration (no specific configuration, or even a bad configuration), the first result is relevant for elasticsearch, we will see why after.
But as a user, the third one is maybe more relevant, and it could be disturbing not to see it at the first place.

Understanding how elasticsearch works means understand how it indexes the documents, and how it retrieves them.

The basic filters

The most basic configuration you should (actually, I should say you must) use is to define some basic filters.
Their role will be to “normalize” the searched string (and the indexed ones) rather than improving the results relevance. That’s why they are very important.

For example, if you do not use these filters, the French string “Mon prénom est Grégory” (my firstname is Grégory) will be indexed like that.
So, if someone perform a search with the string “gregory”, you are not sure the string will be matched (because of the capital letter and the accent).

This is the role of the filters lowercase, which removes all capital letters, and asciifolding which transforms each Unicode character that is not part of the Basic Latin Unicode Block to their ascii equivalent. More simply, it transforms all weirds characters like é, à, ç, ð, æ, å, etc.

The worddelimiter filter

The worddelimiter filter is used to split a “word” into several words. A little example to understand : imagine you have made a typo in the sentence “To be or not to be.That is the question”. You can notice the space after the dot is missing. Without this filter, Elasticsearch will index “be.That” as a unique word : “bethat”. With the filter, it understands it has to index “be” and “that” separately.

The stopword filter

The stopword filter consists in a list of non-significant words that are removed from the document before beginning the indexing process. This filter is used to avoid indexing some words like “and”, “a”, “the”, “to”, etc. Of course, a list is specific to a language, but for some languages, several lists (more or less comprehensive) exist and you can choose the one you prefer.

The snowball filter

The snowball filter is used to stem words based on a specific stemmer. A stemmer uses some rules to determine the proper stem of a word. That means different stemmers may return different results.
For example, the words “indexing”, “indexable”, “indexes”, “indexation”, etc will be stemmed as “index”. It’s particularly interesting to retrieve a document with the title “Make my string indexable” when you search “Indexing a string”.

The elision filter

The elision filter can be important for some language (like French) and a bit less important for other ones (like English). It removes some non-significant “words” before the indexing, for example “j’attends que tu m’appelles” (I’m waiting for your call) will be indexed as “attends que tu appelles” (at the end, “que” and “tu” will be probably removed by the stopword filter). As you see, the words “j’” (I) and “m’” (me) have been removed because theses words are part of the elision list filter (see the configuration example below)

Define your own filters

You can and should define your own filters. If you have a look at the end of this post, you’ll see an example of configuration. You will also see that the filter “stopwords” we use in our analyzers is a custom filter of type stopwords for French. If you need another filter for English, you can add another custom filter name “stopwords_en” for example.

The nGram tokenizer

We searched for some examples of configuration on the web, and the mistake we made at the beggining was to use theses configurations directly without understanding them.
This include the nGram tokenizer, which role is very important.

For example, our searched string “funny pony” will be split into differents parts. Here is two examples :

1st example with the configuration :

 "min_gram" : "2",
 "max_gram" : "3"

The result will be fu, fun, un, unn, nn, nny, po, pon, on, ony, ny

2st example with the (better) configuration :

 "min_gram" : "3",
 "max_gram" : "20"

The result will be fun, funn, funny, unn, unny, nny, pon, pony, ony

To explain a bit more, the ngram tokenizer split each word in each combination between min_gram and max_gram characters. With min_gram = 3 and max_gram = 20, “elasticsearch” will be transformed in “ela, elas, elast, …, elasticsearc, elasticsearch”. And the same process is repeated from the second letter : “las, last, lasti, lastic, …, lasticsearch”, and then from the third letter etc… In this case, the max_gram is not reached because “elasticsearch” contains 13 characters, so if you configure a max_gram higher than 13, the result won’t change.

This tokenizer is cool because from “funny”, we can match any result which contains “fun”. But with the first example, there are too many small words that will match anything. For example “on” will obviously match “on” but also “of”, “ony” will match “any”, “pon” can match “non”, etc…

We better understand why the first result is relevant for ElasticSearch. Elastic found several interesting words like “pony” in the title (expected), but also “bun”(fun) and “of”(on) in the description (not expected).
Defining higher min_gram and max_gram is a better solution to avoid these side effects. But it’s event better to use the filters seen above. Now, the analyzers come into play.

Define your own analyzer(s)

You are encouraged to define and use your own analyzers. An analyzer is used to “clean” the document before the indexing, and also to clean a query string before searching in the index. Actually, the filters are not used directly but are used by the analyzers. In our example below, we have defined 2 analyzers : one to query the text (custom_search_analyzer) and another one to index the document with the ngram tokenizer (custom_analyzer).
The ngram tokenizer is not used to transform the query as we do not want to change the searched string. If the user want to retrieve “keyboard”, with the ngram tokenizer, we’ll maybe find “key”, “board”, “oar” or whatever. That could lead to strange results (like “boar”, “boardwalk”, “keynote”, …)
We thought this configuration is fine and work well. You are obviously free to define other analyzers but some tests proved that this configuration is sufficient (but perfectible).

Functional (and optimal?) full example

This configuration is the one we use most of time in our projects. It’s a functional basis to index a document and to retrieve relevant results. But only half of the work is done… Another big and important part is to define a good mapping and to build the right query to retrieve the results you want.
We’ll write another post to talk about that.

In the meantime, have fun and take pleasure with the indexing with this configuration :

      client: default
              custom_analyzer :
                type     :    custom
                tokenizer:    nGram
                filter   :    [stopwords, asciifolding ,lowercase, snowball, elision, worddelimiter]
              custom_search_analyzer :
                type     :    custom
                tokenizer:    standard
                filter   :    [stopwords, asciifolding ,lowercase, snowball, elision, worddelimiter]
                type:     nGram
                min_gram: 2
                max_gram: 20
                type:     snowball
                language: French
                type:     elision
                articles: [l, m, t, qu, n, s, j, d]
                type:      stop
                stopwords: [_french_]
                ignore_case : true
              worddelimiter :
                type:      word_delimiter
      # now you can use your analyzer to index and search items
          article :
                    boost: 6
                    index_analyzer : custom_analyzer
                    search_analyzer : custom_search_analyzer
                    index_analyzer: custom_analyzer
                    search_analyzer : custom_search_analyzer
                  createdAt   :
                      type: "date"
                  categories    :
                      type: "object"
                          id : ~
                  author :
                      type : "object"
                      properties :
                        id : ~
                        name :
                          index_analyzer : analyzer_troc
                          search_analyzer : custom_search_analyzer
                  driver: orm
                  model: Obtao\BlogBundle\Entity\Article
                  finder: ~
                  provider: ~
                  listener: ~

In a future post, we’ll explain how to build a query with Elastica and how to configure your mapping in a proper way.

Share Button

7 thoughts on “Configure Elasticsearch on an efficient way

  1. Thanks! I’m returning my studies in ElasticSearch almost thinking in use the pure Elastica in Symfony writing my own listeners becausa FOSElasticaBundle is using some 0.90 version of Elastica…

    Great article! I will use to search in Brazilian Portuguese that has a lot of problems clouse as from French.

  2. Héhé, thank you! I hope we can help you to create the perfect configuration.
    Concerning FosElasticaBundle, I think it will be compatible with the 1.0.0 version soon.

  3. Hi,
    Adding your own analyzers is explained at the bottom of the post (see, for example, index.analysis.analyzer.custom_analyzer).

    And I have updated the configuration to show an example of how to use the analyzers you defined.

  4. Does it perform better than the default analyzer for french ? more fine tuned eg ngram and asciifolding (oe) ?

  5. Hi, the goal of this post is to explain how the analyzers works and how to custom them. It’s more than probably the native French analyzer is better than our to index French text.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Protected by WP Anti Spam