An overview of Lucene.Net

Wednesday, Apr 14, 2010 3 minute read Tags: c# .net examine umbraco-examine
Hey, thanks for the interest in this post, but just letting you know that it is over 3 years old, so the content in here may not be accurate.

Please note, this document is a work in progress and will be expanded over time

Table of Contents

What is Lucene.Net?

Although you can read the official word on the Lucene.Net project site I’ll do an abridged version here, explaining it in the way that I understand it.

Lucene.Net is an exact port of the Java Lucene search API, which comprises of indexers, analyzers and searchers. There’s very few differences between the two frameworks, you’re actually able to read the Java API documentation (which is really all you have to go on) and it is going to match up with the functionality. The only real differences are the namespacing in the .NET API is .NET-ish, and some of the API has be re-cased to match a more .NET style.

Lucene takes string data which is then passed into an analyzer and serialized into an index file. Lucene works with strings, and it only understands strings. How it understands strings is defined by the analyzer which you are using.

Once you have your data into an index you then get it out via a searcher. A searcher takes a query which uses a construct similar to other search engines (here’s the query syntax documentation). Documents are then returned from Lucene, which references the point in the index file that a result is located, and then can be deserialized into a set of fields which represents the original string data you passed in.

What is Lucene.Net not?

In a word, smart. Lucene has no smarts about it, it doesn’t understand file types, it doesn’t really understand dates or numbers. I’m often asked “Can Lucene index x?”, the simple answer is “No”, but really the answer is “Yes”. If you’re able to represent it as a string you can have Lucene handle it. This poses some interesting ideas, say you want to index an Office document, well if that’s an OpenXML document then it’s realatively easy, the OpenXML API is quite good in the regard of extracting text.

Understanding Lucene terminology

To not get completely lost with Lucene you need to understand the terms which it uses.

  • Document
  • This is a record within the Lucene index. It is comprised of fields. When ever you’re working with data from the index you’re working with a Document
  • Field
  • A single piece of data associated with a document. A field may or may not be indexed, depending on how you’re inserting it into your index, and this defines how you can interact with it, and how Lucene will treat it
  • Term
  • A part of a Lucene query. A Term is comprised of a left and a right part, looking like this: Field:Query. The left part is the name of the field you’re scoring against, the right part is the data to use when scoring
  • Score
  • Lucene generates results determined by how well the score against a search query. Scores are generated by using the search query and comparing the Document’s Fields to it.
  • Analyzer
  • An analyzer defines how the indexer or searcher will handle the data. There are many different analyzers in Lucene and each handle indexing and searching in subtly different ways
  • Indexer
  • The Indexer is what is responsible for searializing a Document and storing within the index file.
  • Searcher
  • The Searcher will take a Query and retrieve a list of Documents out of the a Lucene index.
  • Query
  • A Query is comprised of a group of Terms and Boolean Operations which are passed into a searcher to retrieve Documents out of the Lucene index. The Query is also used to determine the score of a Document within the record set
  • Boolean Operation
  • AND, OR, NOT all comprise Boolean Operations which can affect how a Term is handled within a Query