As Wordpress is not very friendly with embedded code & has too many features I switched
to Octopress. Even if there are a couple of decent migration
scripts around, I prefer to move posts manually. No tool can do justice to the original
layout of my posts and Octopress is a good opportunity to improve on the Wordpress layouts.
As such some of my old posts will be moved and some will just be deleted forever. By the end
of next month my old wordpress blog will contain just a redirect to my current domain : paulsabou.com.
You want to create an index with the name “bars” and then use/change it without any production “glitch”
What you should do :
create an index with the name “bars_v1″
create an alias with the name “bars” that points to “bars_v1″
put all your data & percolated queries in “bars_v1″
make sure that everyone that uses the index calls the alias “bars” instead of the actual index “bars_v1″
Note : the difference between adding documents & percolated queries when using aliases
when you put just documents in the “bars_v1″ index you can use “bars_v1″ (the actual index name) or “bars” (the alias); both work the same (your operation will have the same effect)
when you want to add percolated queries in “bars_v1″ make sure you use the actual index name (“bars_v1″) and not the alias “bars”; if you will use the alias “bars” instead of the actual index name “bars_v1″ the percolated queries will not work
Major changes to the index without any downtime
You want to make major changes to your index (“bars_v1″) but you cannot afford to put the index offline
so you should work on another index and then just redirect the alias when you are done
What you should do :
create an index with the name “bars_v2″
put all your data & percolated queries in “bars_v2″
when you are done change the alias “bars” to point to the “bars_v2″ index (this can be done atomically)
delete the index “bars_v1″
Exact matching is a basic feature of many search apps. One simple applications of this would be that of matching a location name to a document. For example : retrieve the document associated with the name “San Francisco”.
To be able to do this we need two things :
configure the index schema so that the “name” field is never tokenized : ”San Francisco” and “San Mateo” will be indexed as two terms and not three : “San”, “Mateo”, “Francisco”)
make sure the the query is doing exact matching : when searching for “San” it should not match “San Mateo” and “San Francisco”, but just “San”
Tokenization is one of the most important things to control in ElasticSearch/Lucene.
Whenever we need to match proper names : countries, cities, companies etc. we need to do exact
matching. As such the query “San Francisco” should not match “San Mateo”, but only “San Francisco”.
As such we have to make sure that specific fields are not tokenized at indexing time and the whole text
get’s treated as one big term (or token). The effect of this is that the text “San Mateo” is indexed as the
term “San Mateo” (one term) instead of two terms : “San” & “Mateo”.
One of the easiest way to do this is to use a PatternTokenizer that never matches anything.
The pattern tokenizer can be configured to match the token separators and it should be configured to
match nothing (the pattern $^ never matches anything). As such the pattern tokenizer will not match any separators => it will consider the whole text as one token.
If we are concerned only about exact matching then the only difference between them is that the TermQuery does not accept custom analizers filters. The string query accepts that. This might prove usefull if we need need ES to apply some text transformations on the query before execution. Three filters come to mind :
ascii folding : transforms all the non ASCII characters into ASCII characters
trim : eliminates the leading/trailing whitespaces
I believe it’s wiser to have those transformations done on the server side (ElasticSearch side) as this allows us to get this basically for free. It doesn’t matter from where we make those calls to ElasticSearch, our queries are always processed the same way. If we use the TermQuery we would need to apply all those transformations on the client side. If we use multiple programming languages to send queries then this could prove rather difficult to maintain.
The main challenge in using a string query for matching is that the string query is tokenized twice. The reason for this is that the Elastic Search query string is internally translated into a Lucene query string which accepts a minimal DSL. As such your query can contain some logical operators like : AND, OR, NOT, etc. very usefull for a basic site search box.
The effect of this is that the custom tokenizers configured in the index document mapping will be run only after the DSL tokenizer finished. To give an example :
we have configured our custom analyzer so that there is no tokenization
we search with a query like : “San Mateo”
first ES/Lucene executes the DSL tokenization and our expression is tokenized as two expressions : “San” & “Mateo”
secondly it will apply our tokenizer on each separate expression : one for “San” and one for “Mateo”
finally it will generate two TermQueries : one for “San” and one for “Mateo” and construct a boolean TermQuery that contains them both with an OR : “San OR Mateo”
we get all documents that match either “San” or “Mateo” (probably none as there is no place in the world that has either the name “San” or “Mateo”)
So the DSL tokenization can give us the impression that the custom anayzers that we have set have no effect. But they do. Just that we expect them to be the only ones who run. In reality they are just the second link in a chain that starts with the DSL tokenization.
The solution to this is rather simple. We have to put extra quotes inside our query to tell the DSL tokenizer that it should not tokenize “San Mateo” as two tokens but to leave it as one expression. So, change the query from “San Mateo” to “\”San Mateo\”” and things will work out as expected.
We would like to have an application that has two layers :
a non-dynamic layer
The non-dynamic layer could include things like :
* database sessions
* mail services
* templating engines
* basic beans/components
As to keep things simple, a bean/component belongs to the non-dynamic layer if you are ok
with changing only through redeploys. The examples mentioned (ie. database sessions, etc.)
clearly belong to this category as we almost never want to change the database session settings
while the application runs.
The dynamic layer could include things like :
rule processing chains
The rule of thumb would be : anything that you might want to change while the application runs
(without any redeploy) would belong to the dynamic layer.
The Spring JAVA framework can help a lot with this. There are mainly 3 ways to do it :
multiple hierarchical application contexts
dynamic language support
The easiest (and best performing) solution is to use multiple hierarchical application contexts.
1. Creating hierarhical application contexts
The principle behind this is the fact that the Spring application context is similar to a
tree, builded from bottom up. As such every application context that we load (an XML file content
or just plain String) can have another application context as parent.
The fact that the tree is builded bottom up is a usefull metaphor to remember as it constrains the visibility of the application contexts (and the beans defined inside of them):
the child can see the parent
the parent cannot see the child.
This makes sense as when the parent context was loaded, there was no child context available.
As such the visibility/scope is going only one way. You cannot use the parent context to get a bean from the child context. Only the other way around : use the child context to
get a bean from the parent context.
2. Managing multiple contexts – the simplest way to go
The easiest/cleanest way to work with this situation is to think of the child application context as a “second class citizen”. As such it can contain beans that
refer to the parent/stable context. Also you avoid cross-refering/cross-using beans from one child context to another (this is not possible through the Spring
context application loader, but it could be done trough different hacks in your code).
A good example of the child application context would be that of a job that gets configured as a chain of multiple resources/beans available in the parent :
As you can see the child context just define “job plans”, which can be fancifull assemblies of existing parent beans.
Whenever you want to call a job (ie. by id) you just need to know in which child context was the job defined and then
use that particular context to instantiate the bean. The straightforward way to do this is to keep a Map<String,AbstractApplicationContext>
somewhere accesible in the parent context and whenever you add a new child context, update the map with all the beans from the child context
that you know you will need to call directly (ie. in the above example it would make sense to get all the beans that are an instance of the
com.example.Job class as I’m sure that I will call just job beans directly)
2.1 Load, Refresh & Close
The main reson why we want to have child application contexts is because we want our jobs to be flexible. As such we need to be able to create, update & delete jobs without any redeploy.This is done straightforward in our situation :
If you have some mechanism to update the child application context XML files, all you need to do is to refresh the child application (or close and
load them), and you get a new “version” of the jobs.
Don’t cross reference beans from two different child contexts. What happens when you end up closing one of the contexts and the other one needs it?
Don’t use beans from the child context in the parent context. You cannot do this through Spring directly, but you can still assign it through you code. When the child application gets closed or refreshed you will
still keep references from the old version.