Apache Solr - Introduction

Apache Solr | Introduction

Solr
Lucene

Solr
Solr is an, open source, enterprise search server. Its indexing and search features are based on Lucene. Solr exposes Lucene features via configuration files. Solr's configuration can be managed using its rest apis or solrj.

Solr provides:
- An Admin UI (Solr configuration, query interface, schema browser, index and query analyzer, statistics, ...).
- Configuration files (schema, config, ...).
- Faceted search.
- Extended DisMax (eDisMax) query parser.
- Cluster management (leveraging ZooKeeper), distributed search, and index replication.
- Multiple search caches implementation.
Solr search features:
- Search query
- Faceting (field value facets, query facets, numeric range facets, date range facets)
- MoreLikeThis (similarity)
- Spell Checking
- Highlighting
- Suggester
- Result Grouping
- Result Clustering
- Spatial search (filter results by location information)
- Query boosting (documents scores)
- Pagination of Results
- Query debuging
Request Handlers:
Solr uses request handlers to process requests. They can be configured to respond to specific HTTP requests. Request Handlers can be defined in the "solrconfig.xml" file using URL paths. The path starts by the "/" character. The behaviour of Request Handlers can be adjusted using parameters. They can be submitted using URL query parameters or Post form parameters. They can also be defined in the request handler element in the "solrconfig.xml" file. In the request handler element you can specify default parameters in the defaults element (parameters with default values are not required when querying Solr). You can also define parameters within the appends or invariants elements (the latest ones cannot be overridden when querying Solr).
Lucene
Lucene is a Java-based library. It is not a web application and it doesn't run as a server. It doesn't have any configuration files.

Lucene provides APIs (Java classes: Document, Field, IndexWriter, IndexSearcher, ...) that can be used to index and search documents.

A document in Lucene is a collection of fields which are name-value pairs. A value can be a string, number, date, location, ...

Lucene supports multi-valued fields. Which means a field that can store an array of values.

When indexing a text (document), Lucene will use text analyzers to tokenize the unstructured text into a stream of words (tokens). Lucene can be configured to apply further operations on the extracted tokens, so, for example, a token can be discarded, substituted, or reduced (stemming). The result of text analysis is a list of tokens called terms (text -> tokens -> terms). Text analysis can be applied to each field of a document and Lucene will index the terms of each field. Lucene index each field's term along with its ordinal position in the text and a link to its associated documents (this is why it's called an inverted index).

Text Analysis includes tokenization and filtering:
- Tokenizers: They extract tokens from the provided unstructured text (whitespace, regular expressions, ...).
- Filters: They process the stream of tokens extracted by the Tokenizers. They can remove from the stream non needed tokens (punctuation, ...), they can execute some operation on tokens (lower-case, upper-case, ....), and they can add new tokens to the stream (synonyms, ...).
When searching for documents, Lucene will parse the query string (using a query parser: lucene, dismax, edismax) and it will apply the same text analysis (if configured) on each field of the query before executing the search query. A relevance score will be assigned to each document of the search results.

Lucene provides:
- Text analysis that transforms a text string into a list of terms (tokennizers, filters).
- Query parser
- Scoring algorithm
- Inverted index to find documents related to an indexed term.
- Search features (query completion, query spell checker, highlighter, ...)
- Search enhancing features (faceted navigation, spatial search, ...)