Searching with suggestions is a useful feature for every saerch solution.

Imagine a user enters saerch terms

Elasticsearch is a distributed, scalable, RESTful open-source search engine implementation based on Lucene. One of the most attractive features of Elasticsearch is the extensiblity by writing plugins.

Sometimes it is reported as a disadvantage that Elasticsearch is developed only by a single person, Shay Banon. But it should be emphasized that he equipped Elasticsearch with a convenient plugin mechanism for developers to encourage participation. The plugin mechanism allows other developers the extension of Elasticsearch with a lot of interesting features, such as adding analyzers, adding rivers, or adding connection mechanism to external libraries and systems, without the need to step into each and every detail of the Elasticsearch core.

In this writeup, I want to show how easy it is to make use of some of the most interesting features of Elasticsearch. The example explains

  • how to extend Elasticsearch with a RESTful command called _termlist
  • how to distribute and collect basic data structures within the shard architecture
  • how to expose Lucene's IndexReader terms() method via the Elasticsearch API

The first step before we can get to the gory details is to step back and get an overview about the Elasticsearch philosophy.

  • We need to understand the API that Elasticsearch uses to receive requests and to send responses.
  • We need an action which distributes the job from a node to the shards.
  • And finally, we need the implementation of the real low-level work on the shard level.

The plan of attack:

  • Write a Plugin class, for attaching the plugin to the Elasticsearch core
  • Write a REST action class for API interaction
  • Write Request and Reponse classes for the node that receives the REST action
  • Write a Transport action that is dedicated to the process that works on shard level

Source code

The source code of the example plugin can be found at

https://github.com/jprante/elasticsearch-index-termlist

Writing a Plugin class

For the example plugin, we choose the name termlist. The class extends AbstractPlugin. Each plugin may come with modules. Elasticsearch is divided into logical parts which are called modules. For our simple plugin, we need to hook a RestModule and an ActionModule into the Elasticsearch core. By using injection, the Elasticsearch core invokes the onModule() method of our plugin and sets up the necessary structures in the background for us. All we have to do is

public void onModule(RestModule module) {
    module.addRestAction(RestTermlistAction.class);
}

public void onModule(ActionModule module) {
    module.registerAction(TermlistAction.INSTANCE, TransportTermlistAction.class);        
}

So we have declared three Action classes RestTermlistAction, TermlistAction, and TransportTermlistAction that we need to implement by further steps.

Writing a REST Action class

In the action class RestTermlistAction, we assign the REST URI info (path patterns) to executions, and the main request/response method handleRequest that is evaluating the query parameters and writing the JSON response.

Here, we just need to fetch the values out of the URI info, transfer it to a global request TermlistRequest, and invoke a client that will produce a TermListResponse. Elasticsearch offers a RestXContentBuilder API for easy JSON wrapping of our response data. The term list we want to generate is an array of strings, which is just built by writing the line

builder.array("terms", response.getTermlist().toArray());

When failures occured, we use Elasticsearch XContentThrowableRestResponse to create a failure response.

Writing a Request and Reponse class

The termlist plugin should work on the whole cluster or on Elasticsearch indexes. This is done by formulating the request TermlistRequest as an extension of a BroadcastOperationRequest. This kind of request will be distributed automagically to all the shards that participate in storing the index (or indexes) requested. If no indexes are given in the request, the whole cluster is addressed.

We add a custom parameter field beside the index parameter, so that we can expose the IndexReader of Lucene filtered by a given field name. Confining term lists to the terms of a single field is a good idea, for instance if there are a lot of unique terms in the index.

Elasticsearch requires us make the request and the response streamable. Streams are used by the network transport layer to serialize the request to the nodes. All we need to do is writing a few code lines for the field parameter in readFrom(StreamInput in) and writeTo(StreamOutput out). This mechanism is responsible for both submitting and receving the parameters, from node to node.

The TermlistResponse is going the opposite direction after the work has done, from the participating shards to the node that will produce the API response. At instantiation of the class, we know the numbers of total shards, successful shards, and failed shards, and a set of terms, the termlist for the RestXContentBuilder. The response must be streamable just like the request class.

Writing the Action class

The action class TermlistAction is a glue code that clamps together TermlistRequest, TermlistResponse, and a TermlistRequestBuilder into a logical unit. All we need to do is taking care of the generics of the class and filling out the instantiation helpers newRequestBuilder() and newResponse(). From now on, the Elasticsearch internal client knows how to construct the request and response structures for our action. But up to this point, no shard logic is involved.

Writing the Transport Action class

The shard level logic is done by TransportItemlistAction, a class extending TransportBroadcastOperationAction. This action class holds the real execution code of our _termlist command. Here, we can inject an IndicesServer, which is the magic Elasticsearch doorway to the Lucene IndexReader internals. In the shardOperation method, we can extract an Engine.Searcher object for our current shard ID, from which we obtain the Lucene IndexReader terms in a loop.

After collecting the terms into a string set, we send them back in a ShardTermlistResponse. ShardTermlistRequest and ShardTermlistResponse are sibling classes to TermlistRequest and TermlistResponse on the shard level, extending BroadcastShardOperationRequest and BroadcastShardOperationResponse respectively.

Besides that, we can control threading and some other shard topics on this level. So it is possible to address only a subset of shards, the primary shards for example, to save resource usage.

Summary

We have learned how to write a simple Elasticsearch plugin by understanding that such an extension is separated into three actions, one for REST, one for the node level to orchestrate the shards, and one for the shard level to perform the low-level work. Each level is connected by a pair of Request/Response classes. While the Request/Response classes can stream data, the REST action builds the response in memory with a RestXContentBuilder.

By extending the Request classes, we learned how we can also pass additional parameters we need for our operation. By extending the Response classes, we learned how to pass our own result structure back to the REST API surface.

We are now able to

  • write Elasticsearch basic plugin code
  • extend the REST API of Elasticsearch by a custom command
  • expose Lucene IndexReader to the Elasticsearch API

Of course, there is a lot more that Elasticsearch can offer by using the plugin architecture. In this text, we could only scratch the surface. Have fun exploring Elasticsearch for more exciting opportunities!


comments powered by Disqus