shallow focus photography of pineapple

Finding popular keywords or tags is what twitter’s trends are (other than a means to manipulate public opinion and introduce artificial trends by paying good cash). While I think having a “trends” feature tends to introduce more problems than the value it provides for discovery, I wanted to figure out how I’d do it before deciding not to.

There are many ways it could be achieved. It’s probably possible to make some pretty baroque SQL queries and/or massive application-side aggregation that could deliver the desired results. I’m sure it’s also possible to compute the trends little by little as posts trickle in. This time however I decided to try and see if I can make an Elasticsearch query to get what I want.

The way I want to think of a trend is that in a given timeframe it’s the most popular keywords, where “popular” is measured by the number of people using the keyword. By counting people instead of posts, this can avoid the situation when a small number of very active (spamming) people take over the trends.

In a data structure where a post has an author and may have multiple (or no) tags, I’d need an aggregation that can give me the tags with the highest number of distinct authors for each timeframe. I thought this was going to be trivial to implement in Elasticsearch, but it turned out to be a bit more tricky than that.

worm's eye-view photography of ceiling

Since I don’t have access to a sufficiently large dataset (yet), I decided to use the Elasticsearch demo environment for testing instead. The logs-* indices have enough data and I used the geo “continent name” instead of tags and the “country name” instead of authors. This means that even if a tag (continent) has the most documents, if there are only a few authors (countries) it’s gonna “lose” to tags (continents) with more authors (countries). In this dataset this is the case with North America: while it has by far the most documents, there are only a few countries, so it hardly ever makes it to the top.

The way to achieve this is by nesting a cardinality aggregation inside of a terms aggregation. If you want to get the trends over time (eg daily) instead of a fixed period (eg last n hours), also add a date histogram around all of this. The important thing is that the terms aggregation has to be ordered based on the cardinality (unique count) nested aggregation. Using the demo logs-* index the aggregation would be something like this:

"aggs": {
  "over_time": {
    "date_histogram": {
      "field": "@timestamp",
      "fixed_interval": "1d"
    },
    "aggs": {
      "trends": {
        "terms": {
          "field": "client.geo.continent_name",
          "size": 5,
          "order": {
            "over_countries": "desc"
          }
        },
        "aggs": {
          "over_countries": {
            "cardinality": {
              "field": "client.geo.country_iso_code"
            }
          }
        }
      }
    }
  }
}

Also don’t forget to add "size": 0 to your query unless you actually want to get all the documents in the given timeframe too (and I definitely wouldn’t want to get millions of docs in one query).

One thing to pay attention to is that this is not suitable if you need results that are guaranteed to be precisely accurate. The cardinality aggregation is “an approximate count of distinct values” so it’s only as accurate as the underlying HyperLogLog++ algo and the precision config. However for my use case (calculating trends on a social network) it’s definitely good enough. Just don’t organize a hashtag war competition in your fandom please.