As Wordpress is not very friendly with embedded code & has too many features I switched to Octopress. Even if there are a couple of decent migration scripts around, I prefer to move posts manually. No tool can do justice to the original layout of my posts and Octopress is a good opportunity to improve on the Wordpress layouts. As such some of my old posts will be moved and some will just be deleted forever. By the end of next month my old wordpress blog will contain just a redirect to my current domain : paulsabou.com.

Assuming that you want to be able to change an ES index in more advanced ways (ie. adding custom analyzers) you will have to be able to recreate the index without disturbing your production system.

An efficient way to do this with ES alone is to make use of index aliases. An alias is basically just a smart forwarder that can map a name to one or multiple indexes. See more on aliases here : http://www.elasticsearch.org/guide/reference/api/admin-indices-aliases.html

Creating the index & the alias

You want to create an index with the name “bars” and then use/change it without any production “glitch”

What you should do :

  1. create an index with the name “bars_v1″
  2. create an alias with the name “bars” that points to “bars_v1″
  3. put all your data & percolated queries in “bars_v1″
  4. make sure that everyone that uses the index calls the alias “bars” instead of the actual index “bars_v1″

Note : the difference between adding documents & percolated queries when using aliases

  • when you put just documents in the “bars_v1″ index you can use “bars_v1″ (the actual index name) or “bars” (the alias); both work the same (your operation will have the same effect)
  • when you want to add percolated queries in “bars_v1″ make sure you use the actual index name (“bars_v1″) and not the alias “bars”; if you will use the alias “bars” instead of the actual index name “bars_v1″ the percolated queries will not work

Major changes to the index without any downtime

You want to make major changes to your index (“bars_v1″) but you cannot afford to put the index offline so you should work on another index and then just redirect the alias when you are done

What you should do :

  1. create an index with the name “bars_v2″
  2. put all your data & percolated queries in “bars_v2″
  3. when you are done change the alias “bars” to point to the “bars_v2″ index (this can be done atomically) delete the index “bars_v1″

Exact matching is a basic feature of many search apps. One simple applications of this would be that of matching a location name to a document. For example : retrieve the document associated with the name “San Francisco”. To be able to do this we need two things :

configure the index schema so that the “name” field is never tokenized : ”San Francisco” and “San Mateo” will be indexed as two terms and not three : “San”, “Mateo”, “Francisco”) make sure the the query is doing exact matching : when searching for “San” it should not match “San Mateo” and “San Francisco”, but just “San”

1. Tokenization

Tokenization is one of the most important things to control in ElasticSearch/Lucene. Whenever we need to match proper names : countries, cities, companies etc. we need to do exact matching. As such the query “San Francisco” should not match “San Mateo”, but only “San Francisco”.

As such we have to make sure that specific fields are not tokenized at indexing time and the whole text get’s treated as one big term (or token). The effect of this is that the text “San Mateo” is indexed as the term “San Mateo” (one term) instead of two terms : “San” & “Mateo”.

One of the easiest way to do this is to use a PatternTokenizer that never matches anything. The pattern tokenizer can be configured to match the token separators and it should be configured to match nothing (the pattern $^ never matches anything). As such the pattern tokenizer will not match any separators => it will consider the whole text as one token.

2. Exact matching

There are two ways to do exact matching :

  1. using an ES term query
  2. using an ES string query

If we are concerned only about exact matching then the only difference between them is that the TermQuery does not accept custom analizers filters. The string query accepts that. This might prove usefull if we need need ES to apply some text transformations on the query before execution. Three filters come to mind :

  1. lowercase : transforms the query into lowercase
  2. ascii folding : transforms all the non ASCII characters into ASCII characters
  3. trim : eliminates the leading/trailing whitespaces

I believe it’s wiser to have those transformations done on the server side (ElasticSearch side) as this allows us to get this basically for free. It doesn’t matter from where we make those calls to ElasticSearch, our queries are always processed the same way. If we use the TermQuery we would need to apply all those transformations on the client side. If we use multiple programming languages to send queries then this could prove rather difficult to maintain.

The main challenge in using a string query for matching is that the string query is tokenized twice. The reason for this is that the Elastic Search query string is internally translated into a Lucene query string which accepts a minimal DSL. As such your query can contain some logical operators like : AND, OR, NOT, etc. very usefull for a basic site search box. The effect of this is that the custom tokenizers configured in the index document mapping will be run only after the DSL tokenizer finished. To give an example :

  1. we have configured our custom analyzer so that there is no tokenization
  2. we search with a query like : “San Mateo”
  3. first ES/Lucene executes the DSL tokenization and our expression is tokenized as two expressions : “San” & “Mateo”
  4. secondly it will apply our tokenizer on each separate expression : one for “San” and one for “Mateo”
  5. finally it will generate two TermQueries : one for “San” and one for “Mateo” and construct a boolean TermQuery that contains them both with an OR : “San OR Mateo”
  6. we get all documents that match either “San” or “Mateo” (probably none as there is no place in the world that has either the name “San” or “Mateo”) So the DSL tokenization can give us the impression that the custom anayzers that we have set have no effect. But they do. Just that we expect them to be the only ones who run. In reality they are just the second link in a chain that starts with the DSL tokenization.

The solution to this is rather simple. We have to put extra quotes inside our query to tell the DSL tokenizer that it should not tokenize “San Mateo” as two tokens but to leave it as one expression. So, change the query from “San Mateo” to “\”San Mateo\”” and things will work out as expected.

The problem :

We would like to have an application that has two layers :

  1. a non-dynamic layer
  2. dynamic layer

The non-dynamic layer could include things like : * database sessions * mail services * templating engines * basic beans/components etc.

As to keep things simple, a bean/component belongs to the non-dynamic layer if you are ok with changing only through redeploys. The examples mentioned (ie. database sessions, etc.) clearly belong to this category as we almost never want to change the database session settings while the application runs.

The dynamic layer could include things like :

  • configurable modules
  • rule processing chains
  • logging

The rule of thumb would be : anything that you might want to change while the application runs (without any redeploy) would belong to the dynamic layer.

The Spring JAVA framework can help a lot with this. There are mainly 3 ways to do it :

  1. multiple hierarchical application contexts
  2. dynamic language support
  3. OSGI

The easiest (and best performing) solution is to use multiple hierarchical application contexts.

1. Creating hierarhical application contexts

The principle behind this is the fact that the Spring application context is similar to a tree, builded from bottom up. As such every application context that we load (an XML file content or just plain String) can have another application context as parent.

The fact that the tree is builded bottom up is a usefull metaphor to remember as it constrains the visibility of the application contexts (and the beans defined inside of them):

  1. the child can see the parent
  2. the parent cannot see the child.

This makes sense as when the parent context was loaded, there was no child context available. As such the visibility/scope is going only one way. You cannot use the parent context to get a bean from the child context. Only the other way around : use the child context to get a bean from the parent context.

2. Managing multiple contexts – the simplest way to go

The easiest/cleanest way to work with this situation is to think of the child application context as a “second class citizen”. As such it can contain beans that refer to the parent/stable context. Also you avoid cross-refering/cross-using beans from one child context to another (this is not possible through the Spring context application loader, but it could be done trough different hacks in your code). A good example of the child application context would be that of a job that gets configured as a chain of multiple resources/beans available in the parent :

As you can see the child context just define “job plans”, which can be fancifull assemblies of existing parent beans. Whenever you want to call a job (ie. by id) you just need to know in which child context was the job defined and then use that particular context to instantiate the bean. The straightforward way to do this is to keep a Map<String,AbstractApplicationContext> somewhere accesible in the parent context and whenever you add a new child context, update the map with all the beans from the child context that you know you will need to call directly (ie. in the above example it would make sense to get all the beans that are an instance of the com.example.Job class as I’m sure that I will call just job beans directly)

2.1 Load, Refresh & Close

The main reson why we want to have child application contexts is because we want our jobs to be flexible. As such we need to be able to create, update & delete jobs without any redeploy.This is done straightforward in our situation :

If you have some mechanism to update the child application context XML files, all you need to do is to refresh the child application (or close and load them), and you get a new “version” of the jobs.

3. Pitfalls

  1. Don’t cross reference beans from two different child contexts. What happens when you end up closing one of the contexts and the other one needs it?
  2. Don’t use beans from the child context in the parent context. You cannot do this through Spring directly, but you can still assign it through you code. When the child application gets closed or refreshed you will still keep references from the old version.