Use aggregations for statistics with Symfony and Elasticsearch

Share Button

Lire la version française

Elasticsearch is able to index huge sets of data, documents as well as numeric data.
In the versions before the v1.0.0, facets allowed to calculate statistics for a list of indexed documents (tag distribution, mean, standard deviation, …)
Over time, the use of facets evolved. The developers wished to use them to do more and more complex statistics.

To fit this need, the great Elasticsearch team added the implementation of aggregations.
- Metric: sum, minimum, maximum, mean, …
- Buckets : term, date or value distributions (and many others) aggregations can contain sub aggregations. So, we can calculate statistics inside a term distribution, how crazy is that?

The goal of this post is to describe one of the many uses of aggregations : statistics.


You can find more information about aggregation on this post :
https://www.found.no/foundation/elasticsearch-aggregations/
And do not hesitate to read the official documentation

To write this post, we’ll use the sources available on the blog repository on Github

Our objective is to provide :

  • a list of tags (and the number of articles for each of them)
  • the distribution of the articles depending on their publication date (sorted by category).

The articles and categories indexing is not explained here, as well as the search and base filters. You can find these information in our previous posts :

Concerning aggregations

There are many possibles uses for aggregation. Here are the most interesting in our opinion :

Statistics / Metrics :

  • stats : Return min/max/sum/avg/count
  • min/max : Return the minimum/maximum value of a field
  • sum/avg : Return a field sum/mean

Distribution of the number of indexed documents :

  • terms : Depending on the value of a field. Looks like a SQL group by + count , but more powerful (allow to count the embedded elements)
  • range : Depending on the ranges of values given by the user (See the documentation)
  • date range : Depending on the ranges of dates given by the user
  • histogram : Depending on an interval of values defined by the user (ex, the distribution of products per range of 50€)
  • date histogram : Depending on an interval of dates defined by the user

Others :

  • filter : Allow to add specific filters to an aggregation (or to a group of aggregations)
  • nested : Allow to add aggregations on nested objects

Calculation

Careful: Aggregations, as facets, are calculated from the results of the associated query, not from the filters used.

For example, for this Elasticsearch query :
-> aggregations will be calculated on the whole index (no specific Query except match_all).
-> results will be filtered on the publication date, and on the boolean isPublished.

{
  "query": {
    "match_all": {}
  },
  "filter": {
    "bool": {
      "must": [
        {
          "range": {
            "publishedAt": {
              "gte": "2014-04-07T00:00:00Z",
              "lte": "2014-04-08T23:59:59Z"
            }
          }
        },
        {
          "terms": {
            "published": [
              "true"
            ]
          }
        }
      ]
    }
  }
}

Whereas with this second query, the aggregations AND the results will be affected by the filters on date and isPublished fields.

{
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "publishedAt": {
              "gte": "2013-04-07T00:00:00Z",
              "lte": "2014-04-08T23:59:59Z"
            }
          }
        },
        {
          "terms": {
            "published": [
              "true"
            ]
          }
        }
      ]
    }
  }
}

Some use cases :

  • You want to use filters : use a filtered query. When you’ll include the filters in the query part of your search, the aggregations/facets will be also affected
  • You want general aggregations but you want to filter the search results on a field : add filters to your search
  • You want one of your aggregations on the whole index : use the “global” aggregation. see the documentation
  • You want several aggregations, each filtered differently : use the aggregation filters. see the documentation

Get a tags list and the number of associated articles

First concrete case in our post : do a search to get the number of articles associated to each tag.
We need a “Terms” aggregation. It get every disctinct value of “tags” and count the number of results for each of them.

The query in JSON
To understand our search query, we display it in JSON, as it will be sent to Elasticsearch (and as you can write it in the plugin head or in command line)

{
  "query": {
    "match_all" : {}
  },
  "aggs": {
    "tag": {
      "terms": {
        "field": "tags"
      }
    }
  },
  "size": 0
}

  • "aggs" : Sub-part of the query that contains the aggregations
  • "tag" : The aggregation name. Defined by the user
  • "terms" : The aggregation type
  • "field" : "tags" : Field concerned by the aggregation. Here, we want to aggregate on the “tag” field

You can note we don’t want to get any result, only the aggregations. We specify to Elasticsearch to not return the results : "size" : 0

The result on our blog articles :

{
  took: 2
  timed_out: false
  _shards: {
    total: 5
    successful: 3
    failed: 0
  }
  hits: {
    total: 10
    max_score: 0
    hits: [ ]
  }
  aggregations: {
    tag: {
      buckets: [
        {
          key: symfony2
          doc_count: 9
        }
        {
          key: wsse
          doc_count: 3
        }
        {
          key: rest
          doc_count: 2
        }
        {
          key: android
          doc_count: 1
        }
        {
          key: csv
          doc_count: 1
        }
        {
          key: currency
          doc_count: 1
        }
        {
          key: jquery
          doc_count: 1
        }
        {
          key: knpmenubundle
          doc_count: 1
        }
        {
          key: twig
          doc_count: 1
        }
      ]
    }
  }
}

The structure is quite simple :

  • aggregations : Your aggregations array, whose keys are the aggregation names given in the query (Tag in our case)
  • bucket : Our aggregation type being “bucket”, we receive an array of objects as result
  • key : The current result key (a Tag in our case)
  • doc_count : The number of results (in our case, the number of articles for each tag)

We could imagine a more complexe sub-aggregation structure. This point will be discussed below.

The search with ElasticaBundle and Symfony

Let’s transform this query to be used with ElasticaBundle and Symfony. In the article SearchRepository :

//Obtao\BlogBundle\Entity\SearchRepository\ArticleRepository.php

    public function getStatsQuery()
    {
        $query = new \Elastica\Query(new \Elastica\Query\MatchAll());

        // Simple aggregation (based on tags, we get the doc_count for each tag)
        $tagsAggregation = new \Elastica\Aggregation\Terms('tag');
        $tagsAggregation->setField('tags');

        $query->addAggregation($tagsAggregation);

        // we don't need the search results, only statistics
        $query->setSize(0);

        return $query;
    }

You can note our method returns a query, and not a search results. The reason is we are not going to use the Finder (that transforms the Elasticsearch results into Doctrine objects) but directly the index (the raw search layer).

Now, let’s create our search in the controller :

//Obtao\BlogBundle\Controller\ArticleController
public function statsAction(Request $request)
    {
        $query = $this->container->get('fos_elastica.manager')->getRepository('ObtaoBlogBundle:Article')->getStatsQuery($articleSearch);
        $results = $this->get('fos_elastica.index.obtao_blog.article')->search($query);

        return $this->render('ObtaoBlogBundle:Article:stats.html.twig',array(
            'aggs' => $results->getAggregations()
        ));
    }

We simply run our Query on the Article index. That will return a raw response and not Doctrine objects.

Le template :

{# ObtaoBlogBundle:Article:stats.html.twig #}
{% extends 'ObtaoBlogBundle::layout.html.twig' %}

{% block body %}
    {% for tagAgg in aggs.tag.buckets %}
        {{ tagAgg.key }} ({{ tagAgg.doc_count }}){% if not loop.last %}, {% endif %}
    {% endfor %}
{% endblock %}

This is it! The result : symfony2 (9), wsse (3), rest (2), android (1), csv (1), currency (1), jquery (1), knpmenubundle (1), twig (1)

Get a distribution of articles per date, and for each date the categories which were used

Now, you understand the basis of Elasticsearch aggregations. You could have done the same thing with facets.
Now, let’s see a more complex case.

A client (any client) would like to get the distribution of articles published per month. And for each month, the categories used.
It’s possible thanks to sub-aggregations!

Here is an implementation of the Repository

    public function getStatsQuery(ArticleSearch $articleSearch)
    {
        $query = new \Elastica\Query(new \Elastica\Query\MatchAll());

        // More complex aggregation, we get categories for each month
        $dateAggregation = new \Elastica\Aggregation\DateHistogram('dateHistogram','publishedAt','month');
        $dateAggregation->setFormat("dd-MM-YYYY");
        $categoryAggregation = new \Elastica\Aggregation\Terms('category');
        $categoryAggregation->setField("category");

        $dateAggregation->addAggregation($categoryAggregation);

        $query->addAggregation($dateAggregation);

        // we don't need any results, only the statistics
        $query->setSize(0);

        return $query;
    }

We create a first DateHistogram aggregation named “dateHistogram”. This aggregation will be based on the field publishedAtwith intervals of a month.
The key format in the results set will be "dd-MM-YYY".

Then, we create a second Terms aggregation named “category”. But, instead of adding this aggregation to the $query, we add it to our previous aggregation.
This will create a sub-aggregation and our results will be aggregated by month and then by category.

Once the query is executed, here are the results :

aggregations: {
  dateHistogram: {
    buckets: [
      {
        key_as_string: 01-11-2012
        key: 1351728000000
        doc_count: 1
        category: {
          buckets: [
              {
                  key: Symfony2
                  doc_count: 1
              }
          ]
        }
      }
      {
        key_as_string: 01-04-2013
        key: 1364774400000
        doc_count: 1
        category: {
          buckets: [
            {
              key: Symfony2
              doc_count: 1
            }
          ]
        }
      }
      {
        key_as_string: 01-05-2013
        key: 1367366400000
        doc_count: 2
        category: {
          buckets: [
            {
              key: Android and API
              doc_count: 1
            }
            {
              key: Symfony2
              doc_count: 1
            }
          ]
        }
      }
      {
        key_as_string: 01-06-2013
        key: 1370044800000
        doc_count: 1
        category: {
          buckets: [
            {
              key: Symfony2
              doc_count: 1
            }
          ]
        }
      }
      {
        key_as_string: 01-09-2013
        key: 1377993600000
        doc_count: 2
        category: {
          buckets: [
            {
              key: Android and API
              doc_count: 1
            }
            {
              key: Symfony2
              doc_count: 1
            }
          ]
        }
      }
    ]
  }
}

You can notice several things :

  • A “key_as_string” entry has appeared for the date aggregation : this is the formated date depending on the format we have defined in the query (dd-MM-YYY)
  • A sub-aggregation “category” is shown in the dateHistogram aggregation. It contains all the categories concerned by the period (the sub-aggregation is “filtered” by the parent aggregation)
  • The template

    {% extends 'ObtaoBlogBundle::layout.html.twig' %}
    
    {% block body %}
        {% for dateAgg in aggs.dateHistogram.buckets %}
            
    • {{ dateAgg.key_as_string }} ({{ dateAgg.doc_count}})
        {% for category in dateAgg.category.buckets %}
      • {{ category.key }}
      • {% endfor %}
    {% endfor %} {% endblock %}

    You can do lot of things thanks to aggregations : the calculation and ElasticSearch response speed are impressive. With few lines of code, it’s possible to associate the aggregations to a Google Chart, to customize the aggregation thanks to a form, … Use your imagination.

    Good luck!

    Share Button

    3 thoughts on “Use aggregations for statistics with Symfony and Elasticsearch

    1. Great article! It help me so much, but I would like to ask.

      I’m creating an aggregation on the users in a GeoHash, and I would like to add a filter GeoBoundingBox, but after adding it the search method is ignoring it completly, any idea why? My code looks like:

      $query = new \Elastica\Query();
      $agg = new \Elastica\Aggregation\GeoHashGrid(‘zoom2′, ‘location’);
      $agg->setPrecision(3);

      $query->addAggregation($agg);
      $filter = new \Elastica\Filter\GeoBoundingBox(‘location’, array(’52.288213,4.6026469′,’51.8861425,4.4610554′));
      $query->setFilter($filter);
      $query->setSize(0);
      $userGroupedResults = $this->get(‘fos_elastica.index.search.user’)->search($query);

      Thanks!

      • If someone needs it in the future, I managed to create the aggregation for GeoHashGrid with a GeoBoundingBox filter with Symfony2 mixing ElasticBundle and Elastica PHP:

        $query = new \Elastica\Query();
        $filter = new \Elastica\Filter\GeoBoundingBox(‘location’, array(’52.288213,4.6026469′,’51.8861425,4.4610554′));
        $agg = new \Elastica\Aggregation\GeoHashGrid(‘zoom1′, ‘location’);
        $agg->setPrecision(12);
        $aggFilter = new \Elastica\Aggregation\Filter(‘zoomedArea’);
        $aggFilter->setFilter($filter);
        $aggFilter->addAggregation($agg);
        $userGroupedResults = $this->get(‘fos_elastica.index.search.user’)->search($query);

        return $this->render(‘XimMotortourerBundle:ElasticSearch:search.html.twig’, array(
        ‘userGroupedResults’=>$userGroupedResults->getAggregations()
        ));

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

    Protected by WP Anti Spam