Sorting by date proximity in Elasticsearch

data-analysis elasticsearch painless

Using a painless script we can sort documents in Elasticsearch by the proximity of a certain date. For example, sorting {1,2,3,4,5} by the proximity of the 3 gives [3,2,4,1,5], so each number is equally or more distant from 3 than its previous one.

The simplest way to achieve this is sorting by the absolute value of the difference of the date field of the document and the required reference date:

abs(<doc_timestamp> - <ref_timestamp>)

Let's create some data to test:

POST /test_index/test_type/_bulk
{ "index": {} }
{ "date": "2016-12-29" }
{ "index": {} }
{ "date": "2016-12-30" }
{ "index": {} }
{ "date": "2016-12-31" }
{ "index": {} }
{ "date": "2017-01-01" }
{ "index": {} }
{ "date": "2017-01-02" }
{ "index": {} }
{ "date": "2017-01-03" }
{ "index": {} }
{ "date": "2017-01-04" }
{ "index": {} }
{ "date": "2017-01-05" }

To sort these documents by proximity to January 2, 2017 first we need to calculate the timestamp for that date, we can get it easily in Groovy (and similarly in Painless and Java). The timestamp used is in milliseconds, but rounded to seconds for simplicity.

groovy:000> import java.util.Calendar
groovy:000> cal = Calendar.getInstance()
groovy:000> cal.set(2017, Calendar.JANUARY, 2, 0, 0, 0)
groovy:000> 1000L*(cal.getTimeInMillis()/1000 as int)
===> 1483315200000

The Elasticsearch query with the sort script is then as follows.

GET /test_index/test_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": {
    "_script": {
      "script": {
        "lang": "painless",
        "inline": "Math.abs(params.timestamp-doc['date'].value)",
        "params": {
          "timestamp": 1483315200000
        }
      },
      "type": "number",
      "order": "asc"
    }
  }
}

There are different possible combinations since some dates are equally distant to the reference date, for example the first and third of January are both one day away. The following output is reduced for simplicity.

"hits": [
  {
    "_source": {
      "date": "2017-01-02"
    },
    "sort": [
      0
    ]
  },
  {
    "_source": {
      "date": "2017-01-01"
    },
    "sort": [
      86400000
    ]
  },
  {
    "_source": {
      "date": "2017-01-03"
    },
    "sort": [
      86400000
    ]
  },
  {
    "_source": {
      "date": "2016-12-31"
    },
    "sort": [
      172800000
    ]
  },
  {
    "_source": {
      "date": "2017-01-04"
    },
    "sort": [
      172800000
    ]
  },
  {
    "_source": {
      "date": "2017-01-05"
    },
    "sort": [
      259200000
    ]
  },
  {
    "_source": {
      "date": "2016-12-30"
    },
    "sort": [
      259200000
    ]
  },
  {
    "_source": {
      "date": "2016-12-29"
    },
    "sort": [
      345600000
    ]
  }
]

Sorting just by proximity of a given month and date is just a bit more complex. Let's remove the previous data and create new documents with dates with different years.

POST /test_index/test_type/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

POST /test_index/test_type/_bulk
{ "index": {} }
{ "date": "2013-06-19" }
{ "index": {} }
{ "date": "2013-06-20" }
{ "index": {} }
{ "date": "2014-06-22" }
{ "index": {} }
{ "date": "2015-06-21" }
{ "index": {} }
{ "date": "2016-06-21" }
{ "index": {} }
{ "date": "2017-06-23" }

We are going to sort by the difference of days between the document date field and the reference month and day. Taking into account the periodicity of months in a year (12) and days in a month (30.5, approx.), this is the minimum value of the absolute difference of days and the number of days in a year minus the absolute difference of days:

diff = abs((30.5*<doc_month> + <doc_day>) - (30.5*<ref_month> + <ref_day>))
min(diff, 365-diff)

Taking the number of days in a month as 30.5 will produce an error of 0.5 in many values, this will produce mistakes in the sort order only in documents where February is between the document and reference date, so an adjust of the script might be required. For the rest of the documents the error won't affect the sort order because the documents with that error will appear in the proper position since their neighbour will have the same real value.

Then, the Elasticsearch query with the script to sort by proximity to June 21 is:

GET /test_index/test_type/_search
{
  "query": {
    "match_all": {}
  },
  "sort": {
    "_script": {
      "script": {
        "lang": "painless",
        "inline": "double diff=Math.abs(30.5*doc['date'].date.monthOfYear+doc['date'].date.dayOfMonth-30.5*params.month-params.day);return Math.min(diff,365-diff)",
        "params": {
          "month": 6,
          "day": 21
        }
      },
      "type": "number",
      "order": "asc"
    }
  }
}

And the simplified output:

"hits": [
  {
    "_source": {
      "date": "2015-06-21"
    },
    "sort": [
      0
    ]
  },
  {
    "_source": {
      "date": "2016-06-21"
    },
    "sort": [
      0
    ]
  },
  {
    "_source": {
      "date": "2014-06-22"
    },
    "sort": [
      1
    ]
  },
  {
    "_source": {
      "date": "2013-06-20"
    },
    "sort": [
      1
    ]
  },
  {
    "_source": {
      "date": "2017-06-23"
    },
    "sort": [
      2
    ]
  },
  {
    "_source": {
      "date": "2013-06-19"
    },
    "sort": [
      2
    ]
  }
]