Natural sort for Elasticsearch

1. Example for Elasticsearch 5.x

The following is an introduction on how to use natural sort plugin in Elasticsearch.

Let’s assume we have a list with mixed textual und numeric content, like Bob’s points he received from three teachers, and we want to sort the statements with regard to the points.

The teachers gave him these statements: "Bob: 2 points", "Bob: 3 points", "Bob: 10 points".

Ordinary sort using the canonical string order would give us the list "Bob: 10 points", "Bob: 2 points", "Bob: 3 points". That is obviously not the order we want.

Natural sort tokenizes the content, and reformats the numeric parts to a string with leading zeroes before sorting, so the result will be different: "Bob: 2 points", "Bob: 3 points", "Bob: 10 points".

How does look like using Elasticsearch? The commands are encoded in Sense syntax.

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
            "analyzer": {
               "natural": {
                  "type": "naturalsort",
                  "locale": "en",
                  "digit": 5,
                  "maxTokens": 5
               }
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "points": {
               "type": "text",
               "store": true,
               "fields": {
                  "encoded": {
                     "type": "text",
                     "fielddata": true,
                     "analyzer": "natural"
                  }
               }
            }
         }
      }
   }
}

PUT /test/doc/1
{
  "points" :  "Bob: 2 points"
}

PUT /test/doc/2
{
  "points" : "Bob: 3 points"
}

PUT /test/doc/3
{
  "points" : "Bob: 10 points"
}

POST /test/_search
{
   "query": {
       "match_all" : {}
    },
   "stored_fields" : "points",
    "sort" : {
            "points.encoded" : {
                 "order" : "asc"
            }
    }
}

Response

{
   "took": 15,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": null,
      "hits": [
         {
            "_index": "test",
            "_type": "doc",
            "_id": "1",
            "_score": null,
            "fields": {
               "points": [
                  "Bob: 2 points"
               ]
            },
            "sort": [
               "\u0000T\u0000b\u0000T\u0000\u0006\u0000G\u0000H\u0000c\u0000b\u0000\\\u0000a\u0000g\u0000f\u0000\u0000\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0000\u0000\u0002\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001"
            ]
         },
         {
            "_index": "test",
            "_type": "doc",
            "_id": "2",
            "_score": null,
            "fields": {
               "points": [
                  "Bob: 3 points"
               ]
            },
            "sort": [
               "\u0000T\u0000b\u0000T\u0000\u0006\u0000G\u0000I\u0000c\u0000b\u0000\\\u0000a\u0000g\u0000f\u0000\u0000\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0000\u0000\u0002\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001"
            ]
         },
         {
            "_index": "test",
            "_type": "doc",
            "_id": "3",
            "_score": null,
            "fields": {
               "points": [
                  "Bob: 10 points"
               ]
            },
            "sort": [
               "\u0000T\u0000b\u0000T\u0000\u0006\u0000H\u0000G\u0000F\u0000c\u0000b\u0000\\\u0000a\u0000g\u0000f\u0000\u0000\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000\u0001\u0000w\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0000\u0000\u0002\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001\u0000\u0001"
            ]
         }
      ]
   }
}

2. Collation keys

Internally, the natural sort plugin uses a collation key, the same as returned from java.text.Collator#getCollationKey(String), for defining the sorting order. The collation keys are encoded in binary form, can be compared bitwise and work with Elasticsearch sort operation. They look a bit weird in the returned result, but that can be ignored.

Do not upgrade the Java VM or change the collation key implementation during the lifetime of the index. The comparison algorithm of java.text.Collator is not portable and is only guaranteed to work on the same version of the Java VM.

3. Options

The natural sort plugin offers the following options which can be passed to the analyzer definition in the mapping:

locale	the locale for the collator
digits	the number of digits used for numeric parts in the field. Can be in the range from `1` to `9`. Default is `9`
maxTokens	the number of tokens which are used while text/numeric tokenization of the field. If there are more tokens, they will be ignored for the sort key. Default is `5`.
bufferSize	internal buffer size for the underlying `KeywordTokenizer`. Default is 256 or `KeywordTokenizer.DEFAULT_BUFFER_SIZE`

4. Gradle test report

The current test report is here