The road so far….

May 16, 2010

Some useful componenets to tweak Lucene

Filed under: java, lucene — Rahul Sharma @ 9:55 am

Lucene is a comprehensive package for indexing and searching. Several other top-level packages have been built on it. But if you are working directly with Lucene-core here is a list of components that I have found quite handy in tweaking it so as to suit to your requirements.

PerfieldAnalyzerWrapper

In Lucene a Document consists of  Fields and should be parsed using an Analyzer to get index Terms. Each of the Field can be configured independently, if it is required to be Analyzed or Stored or Both.  But how can we configure different Fields to be Analyzed using independent mechanisms, as there is  only one Analyzer that is used by the IndexWriter for a given Document ?

The  PerfieldAnalyzerWrapper comes to our rescue in such situations. It is a decorator over the existing set of  Analyzers and  requires a map of FieldNames and Analyzers. When the IndexWriter invokes the PerFieldAnalyzerWrapper, it consults the map for each of the document fields and then uses the analyzer that has been configured against the corresponding field. It also has a default analyzer which it consults if  it does not find a matching analyzer from the map.

Using this analyzer we can create a Document having multiple Fields and each of the Field can be mapped to a separate kind of analysis.

PatternAnalyzer

This analyzer can be found in the contrib analyzer package. It can be configured to use regular expression to find certain matching parts of the field text so as to generate terms out of it.  It combines the LetterTokenizer, the  LowerCaseTokenizer, the WhitespaceTokenizer, the StopFilter along with the regular expressions.

As per the documentation the analyzer is considerably faster than the normal Lucene tokenizers and can be used to incrementally build filter chains.

But anyhow if you are just looking for some patterns in the text this is analyzer to use.

NGramTokenizer/EdgeNGramTokenizer

If you want to handle typos in the text or autocomplete kind of functionality NGramTokenizer is required to be in place. This tokenizer can break the text into smaller terms of varying sizes, that is specified as a range. If the starting range is too small i.e. say 2 it will generate large number of tokens and this will have an effect on the scoring of documents. If you have other analyzers also then you need to look into boosting, minimum match clauses etc.

The EdgeNGramTokenizer is a variant of NGramTokenizer where you can configure it to make tokens out of the edges also. But it can be configured to use only one side either front or end but not both. This will generate tokens with leading/ending blanks.

NumericField/NumericUtils

Lucene by default treats everything as text. But numbers are not entirely text e.g.  if you search “effect” then you can get “affect”. This is so because “affect” has just one typo and the rest is matching  with “effect” and  thus should be considered. But this is not same for numbers 11 is entirely different from 111 and should not be matched. Moreover for numbers we can probably have rangeQueries where we can say 10 < x <15.

In order to support this Lucene introduced NumericField, where a number can be stored and operated as numeric. It also has a corresponding utility class for string to number and vice versa conversions.

Explanation

Often when you have a query(Boolean, MultiTerm, MultiPhrase etc) that have several subquerries  then it may happen that you were not expecting some documents and they pop-up. In such cases you need to figure out how such unexpected documents have popped up ?

Now the explanation class comes into the picture, this class can explain the scoring mechanism that has been used for a result. It gives you a detailed structure of which terms matched what and how the score was derived.

An object to the same can be obtained using the API Searcher.explain(query, docId). This will give an explanation for a document against the executed query.

SetBasedFieldSelector

Let us say that I was looking for  some terms and found the document id “X” matching to it. Now I must get the Document and the corresponding Field value. But Documents in Lucene can contain large number for fields and it does not make sense to load the full Document when I am interested in just one single field.  Such a behavior can  be tweaked using SetBasedFieldSelector.  It has a list of fields which it will  load eagerly and other list which it will load lazily. We can use this FieldSelector in a variant of  IndexSearcher.doc(), which will load the Document with the required fields only.

NIOFSDirectory

The IndexSearcher and IndexReader classes are thread-safe i.e. you can use them to simultaneously execute multiple queries but the underlying Directory implementation can cause some serious damage to the index performance. If you are using the vanilla implementation of FSDirectory i.e. SimpleFSDirectory then there are synchronised calls at the end. Replace the same with the NIOFSDirectory implementation to get concurrent thread safe access to the underlying index.

ParallelMultiSearcher

In case the index becomes large Lucene performance will not remain optimal. A normal query will take a decent amount of time to execute. In such cases the index can be sliced and then can be searched in parallel.The ParallelMultiSearcher can be used in such cases to work simultaneously over multiple indexes.  It will find the results for each of the smaller indexes and then will group the whole data before returning it.

TimeLimitingCollector

As said before in larger indexes Lucene will take a substantial amount of time. But how can  you control the query execution time, so that you can fine tune it  ?

The TimeLimitingCollector is one such entity using which you can specify the amount of time you would like the query to execute. If it takes more time then it if just terminate the query execution and will throw an exception.

These are just a few of the many components available in Lucene. All of the components that I have mentioned here are available in the version 3.0 and above of Lucene.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: