Introduction to Lucene.Net

by Rohit Kukreti

Rohit Kukreti describes the basic steps to use Lucene.Net within an ASP.Net application.

What is Lucene.Net?

Lucene.Net is an exact port of the original Lucene search engine library, written in C#. It provides a framework (APIs) for creating applications with full text search.

Lucene.Net can be downloaded from http://incubator.apache.org/lucene.net/download.html. Currently it is undergoing incubation at Apache Software Foundation (ASF).

Why Use Lucene.Net?

You can use Lucene.Net to add more power to an already existing search in your ASP.Net web application or website. It can also be used to index and search documents (word, pdf, etc.) within your application.

This article describes how we can use Lucene.Net to add full text search in our ASP.Net applications. Any search function consists of two basic steps, first to index the text and second to search the text. We will use Lucene.Net to do both of the steps.

In this example we will try to read the content of a text file and index it using Lucene.Net. First download the dll and add a reference to the project.

How to Use Lucene.Net

Indexing the text

There are a few things to understand before we start indexing.

1. Analyzer - To read the text and break them into words (Tokens). Can also be used to remove 'noise words' (common words which you would not want to index).

2. Fields - Content holders with a name and a value.

3. Documents - The unit of indexing and search. Is a collection of fields. Documents are added to the index and are returned as a list of results.

4. Index - is a collection of documents.

5. IndexWriter - Writes the document to the index file.

Code for creating the index file

string strIndexDir = @"D:\Index";
Lucene.Net.Store.Directory indexDir = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir));
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29); //Version parameter is used for backward compatibility. Stop words can also be passed to avoid indexing certain words
IndexWriter idxw = new IndexWriter(indexDir, std, true, IndexWriter.MaxFieldLength.UNLIMITED); //Create an Index writer object.
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldText = new Lucene.Net.Documents.Field("text", System.IO.File.ReadAllText(@"d:\test.txt"), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.YES);
//write the document to the index
//optimize and close the writer
Response.Write("Indexing Done");

Parameters passed while adding Field are:

1. Lucene.Net.Documents.Field.Store. YES - Field is stored in the index and would be returned in search results. Passing NO would not store the field in the index and would not be shown in the results.

2. Lucene.Net.Documents.Field.Index. ANALYZED - Field can be searched. NO means it will not be searchable. NOT_ANALYZED means field would be searched but analyzer is not used.

3. Lucene.Net.Documents.Field.TermVector. YES - Stores list of terms and number of occurrences (Google to understand TermVector more).

It is recommended to call the IndexWriter.Optimize() on completion of the indexing. It "optimizes" the index for the fastest possible search.

First part of indexing the text is completed. We will now search the index for the text entered in the textbox.

Search the text

We will build the search query using the QueryParser class. There are more Query classes available in Lucene.Net, such as TermQuery, RangeQuery, etc., which can be used for different requirements. To create a search query we need use the Analyzer object and the field in the index to search in.

string strIndexDir = @"D:\Index";
Analyzer std = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);
Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", std);
Lucene.Net.Search.Query qry = parser.Parse(Search.Text);

After creating the query object we will use the IndexReader object for opening the index in read only mode.

Lucene.Net.Store.Directory directory = Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)); //Provide the directory where index is stored
Lucene.Net.Search.Searcher srchr = new Lucene.Net.Search.IndexSearcher(Lucene.Net.Index.IndexReader.Open(directory, true));//true opens the index in read only mode

Lucene.Net stores the search results (documents) in Collectors. There are different Collectors available in Lucene.Net. In this example we will use "TopScoreDocCollector," which sorts the results based on athe number of occurrences in each document. Create method of "TopScoreDocCollector" accepts two parameters - maximum number of documents required (int) and whether to sort the docs by score.

TopScoreDocCollector cllctr = TopScoreDocCollector.create(100, true);

Once the collector object is ready we will perform the search and get the results from the collector in a ScoreDoc array.

ScoreDoc[] hits = cllctr.TopDocs().scoreDocs; 
for (int i = 0; i < hits.Length; i++)
int docId = hits[i].doc;
float score = hits[i].score;
Lucene.Net.Documents.Document doc = srchr.Doc(docId);
Response.Write("Searched from Text: " + doc.Get("text"));

This is just an introduction to Lucene.Net. There are a lot of other areas to be explored, such as different Analyzers, QueryParsers, Collectors, etc.

Happy learning.

This article was originally published on Wednesday Jan 18th 2012
Mobile Site | Full Site