Before we dive and try to understand what’s Solr is, we need to know a brief about lucene.

So what is Lucene?

  • In simple terms, it is search storage engine for web scale project.
  • Highly Scalable and most importantly an open source library.
  • It could handle terrabytes of data.
  • Some of the companies which uses lucene are Apple, X (Formerly known as twitter).

If you are interested to know the list of companies powered by lucene, please click the below link.

  • Helps you to Index and Search with its powerful, accurate & efficient search algorithms.

Why not use traditional database over these search engine?

  • The answer to that is a Database can’t return results for those complex queries like we do with facets, more like this, suggester, highlight and so much more.
  • Database is mostly for Structured data.
  • Database takes longer time to return results.
  • Database has limited data type to work with.
  • Facets – It’s the categories which we filter the results against our query. One of the common examples we can see is in e-commerce sites like amazon.
  • The toys age range, interest & others are all facets.
  • Query – Its the search terms or questions we post in the search boxes.
  • Suggest – It’s the suggestion we get for the query we search for.
  • In the below example, toys is the query and toys for girls and boys are the suggestive keywords.

Additional reasons for using Lucene

  • These search engines can handle millions of transaction smoothly.
  • Cross platform independent and interoperability is amazing.
  • It could perform wide operations such as Indexing & Searching, Ranked Searching(Boosting), powerful queries phrase, prefix queries, wildcard queries, proximity queries, Fielded Searching e.g search by title & author, multiple index searching with merged results, flexible faceting(dynamic), joins, highlighting, grouping of results.
  • The best part is we can plugin our own ranking model.

What is Solr?

  • Solr is an Enterprise Grade Web Search built on top of Lucene.
  • It can also work on Lucene Core if required.
  • Solr uses lucene java api.
  • Faster information retrieval in real time.
  • This is for real time apps.
  • Also keep in mind, it is very different than hadoop.

Note : Elastic search which is another popular enterprise search platform is also built on top of lucene.

Solr is similar to Google Search Engine but the google search does way more complex queries and has its own algorithm for ranking results.

What is Hadoop?

  • Uses for large data sets for analytics purpose.
  • This is slow and offline api application.
  • Hadoop distributed file system HDFS is the architecture of Hadoop.
  • Written on top of java, thrift framework maybe used to fetch file from client.
  • Hadoop is very different than Solr.

Can we crawl web pages in Solr?

Ofcourse we can, Lucene and Solr doesn’t offer it out of the box but with the usage of apache nutch we can implement the same.

What is Search System?

Let’s understand few important concepts of Search Systems.

Index, Analyse and Search are three main components of Solr.

Indexing

  • Processing of original data into highly efficient cross reference lookup to facilitate rapid search. Uses Data Handler to index different type of data.
  • Think of indexing like the back of book with list of words i.e glossary with pages where we can find the words.
  • Amount of data index is drastically reduced using Glossary i.e look up data structure.

Analyse

  • Does not index text directly.
  • Text broken into series of individual elements called tokens.
  • It requires memory to store the data glossary & cache depends on cache we want to store and return query in real time.

What is Tokenisation?

Splitting sentences in to single words or tokens which has position offset & length. We can define in solr how we want to analyse the document and index the tokens.

Various Document Type Solr Supports?

Json, Xml, Any type. We just have to tell Solr about the format.

Apache Solr is a highly flexible and schema-less search platform that is capable of indexing a wide range of document types. Solr does not impose strict limitations on the types of documents it can index, and it can handle various data formats. The supported document types in Solr include, but are not limited to:

  1. Text Documents: Solr is commonly used to index and search textual data, such as articles, blog posts, web pages, PDFs, and various types of documents in plain text, HTML, or other text-based formats.
  2. JSON Documents: Solr supports JSON documents, making it easy to work with structured data in JSON format. You can index JSON documents directly into Solr and perform searches on them.
  3. XML Documents: Solr can index XML documents, including RSS feeds, XML files, and other structured data in XML format. It offers flexible options for handling XML data.
  4. CSV Data: Solr can index data in CSV (Comma-Separated Values) format, which is useful for handling tabular data or structured datasets.
  5. HTML Documents: Solr can index HTML documents, allowing you to perform full-text searches on web pages, extract content, and build search applications for web content.
  6. Office Documents: Solr can index Microsoft Office documents (Word, Excel, PowerPoint), OpenDocument formats, and other office file formats. Plugins or parsers may be required for these formats.
  7. PDF Documents: Solr supports PDF indexing, enabling full-text search within PDF documents. It can extract text and metadata from PDF files.
  8. Images: While Solr primarily focuses on text-based content, you can index image metadata, captions, and other related text using Solr’s capabilities. To index the actual content of images, you may need to use OCR (Optical Character Recognition) and associated technologies.
  9. Geospatial Data: Solr provides support for geospatial data, enabling indexing and searching of geographic information, such as latitude and longitude coordinates.
  10. Binary Data: Solr can handle binary data, but it’s common to extract metadata or textual content from binary files (like images or proprietary document formats) for indexing and searching.
  11. Custom Data Formats: Solr’s flexibility allows you to index and search custom data formats by defining custom document transformers and parsers.

Solr’s flexibility is one of its key strengths, and it can be adapted to work with various document types. To effectively handle certain document types, you may need to configure the appropriate field types, analyzers, and parsers in Solr’s schema to ensure that the data is indexed and searched correctly. Additionally, you may need to use plugins, libraries, or external tools to support specific document formats or data extraction.

Scoring in Solr?

Every document is scored based on matches and algorithms we use to return results.

  • Pure Boolean Model – If the term not matched, no scoring is done. Return results only with match terms.
  • Vector Space Model – When you write query, it is transformed vector to dimensional data structure. Data or Documents transformed into vectors in multi dimensional space overlapped and remaining results are returned according to that. Relevance and Similarities are weights given.

Similar Posts