Lucene is a Java-based library.
It is not a web application and it doesn't run as a server.
It doesn't have any configuration files.
Lucene provides APIs (Java classes: Document, Field, IndexWriter, IndexSearcher, ...) that can be used to index and search documents.
A document in Lucene is a collection of fields which are name-value pairs.
A value can be a string, number, date, location, ...
Lucene supports multi-valued fields.
Which means a field that can store an array of values.
When indexing a text (document), Lucene will use text analyzers to tokenize the unstructured text into a stream of words (tokens).
Lucene can be configured to apply further operations on the extracted tokens, so, for example, a token can be discarded, substituted, or reduced (stemming).
The result of text analysis is a list of tokens called terms (text -> tokens -> terms).
Text analysis can be applied to each field of a document and Lucene will index the terms of each field.
Lucene index each field's term along with its ordinal position in the text and a link to its associated documents
(this is why it's called an inverted index).
Text Analysis includes tokenization and filtering:
-
Tokenizers: They extract tokens from the provided unstructured text (whitespace, regular expressions, ...).
-
Filters: They process the stream of tokens extracted by the Tokenizers.
They can remove from the stream non needed tokens (punctuation, ...),
they can execute some operation on tokens (lower-case, upper-case, ....),
and they can add new tokens to the stream (synonyms, ...).
When searching for documents, Lucene will parse the query string (using a query parser: lucene, dismax, edismax)
and it will apply the same text analysis (if configured) on each field of the query before executing the search query.
A relevance score will be assigned to each document of the search results.
Lucene provides:
-
Text analysis that transforms a text string into a list of terms (tokennizers, filters).
-
Query parser
-
Scoring algorithm
-
Inverted index to find documents related to an indexed term.
-
Search features (query completion, query spell checker, highlighter, ...)
-
Search enhancing features (faceted navigation, spatial search, ...)