The road so far….

May 1, 2010

Introduction to Lucene: understanding the components and the mechanism

Filed under: java, lucene — Rahul Sharma @ 9:39 pm

I have been working over Lucene for quite some time now. Here I will try to share my experience with Lucene. What is it ? Its components ?  How it can be used ?

Say if you are reading a book, let us say “Head first Design patterns” and you need to find about the Strategy Pattern, then how will you proceed ?

You can start looking from the contents section of the book and if you are lucky you can find it in the section of some chapter OR If you know it you can directly go to the chapter and start looking for the section that can describe “Strategy pattern”. OR You can directly go to the back of the book , straight forward-looking in the index try to find out the term “Strategy Pattern”. You will find the term there and the pages corresponding to it. The last approach is going to be the most efficient one.

With the explosion of information around, we need some mechanism of efficient search and Lucene provides us one such mechanism. It makes the index(just like th one you were looking at in the book) for us that can be utilised for efficient search.

Lucene components:

An index in Lucene just like the index of a  book consists of different components:

Lucene components   :   Book components
Index                :  Index at the back
Terms                :  words in the index of the book
Document             :  A page of the book
Fields               :  some important text of a page

How can we create such an index ?

In order to create an index we need a Document,the content that needs to be indexed. The Document should consist of at least one Field. Then the content  under this field will be parsed to get Terms out of it, those that will be put in index. In order to do so we would require an Analyzer, someone who can analyze data and give back meaningful terms/words.

Now we can generate all the required index components but we need someone who can coordinate things to get a meaningful index out of them . The IndexWriter does this task and creates an index that is stored in a Directory( can be file, in memory etc).  Put these 6 things in place and we will have a Lucene Index.

Ok, so now we have the index in place but how to use it?

In order to search an index we require an IndexSearcher. We have to build a Query that can run on the index stored in the Directory. Here we have to take care of some things. The index consists of terms that were understood by our Analyzer and we must produce similar terms if we want the data back, so we must use the same Analyzer to get terms that can be searched over the index. After the query is fired it will give you the list of documents ids that it has found for the terms you searched( just like page numbers you see in the index of the book ) along with a score that signifies how good a match it is. You can  get the documents from the index using the document id and the IndexSearcher.

Beside doing it using API there are tools that you can use to search and peek into the index. Luke is one such tool. Provide it the directory where you have created the index and the you can use it anyway you like.

BOTTOM-LINE:

Lucene is a powerful framework that can be used to index any type of information. There are a whole bunch of Analyzers that can understand and interpret information from the same Lucene document  in all kinds of manners. Earlier versions of Lucene had all different finds of analyser but from 3.0 onwards most of the analyzers have been moved out of the core package into  lucene-Analyzers contribution package.

It also provide a wide range of Queries that can be used to query the index in different manners with different grades of fussiness. The TermQuery finds exactly the same, then you have BooleanQuery, wildcardQuery and SpanQuery etc. You can also specify the documents and terms that are more inportant while index building using document boosting and field boosting respectively. But you can also specify the more important terms of your search by again using a field boosting while searching.

Lucene also offers quite a few performance tuning mechanisms. Firstly you can choose between the different implements of the Directory(in memory/ Disk/ NIO and more), the location where the index will be stored. Then there are options to split the index and keep their sizes low as with bigger index Lucene take time in searching and indexing. These multiple indexes can then be  searched  in parallel using the ParallelMultiSearcher.

Resources:

Apache Lucene

Lucene Contib package

Luke

Advertisements

2 Comments »

  1. great post!I found it really comperhensive! 🙂

    Comment by nikos lianeris — May 5, 2010 @ 10:10 pm

  2. […] This post was mentioned on Twitter by AleksandarMarinković. AleksandarMarinković said: RT @DZone "Introduction to Lucene: understanding the components and the mechanism" http://dzone.com/uCgL […]

    Pingback by Tweets that mention Introduction to Lucene: understanding the components and the mechanism | The road so far…. -- Topsy.com — May 6, 2010 @ 3:27 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: