Apache Solr use cases
Some of the interesting use cases listed for Apache SOLR Training.
- Drupal integration
- Hathi trust
- Auto suggestions
- Spatial search application
- Clustering support with carrot2
- Near real time search
- Loggly = full text search in logs
- Solandra = SOLR + Cassandra
- Category Browsing through the facets
- Jetnick – open Twitter search
- Plaxo – online address management
- Replace fast or Google search
- Search application prototyping
- Solr a whitelist
This particular use case can recognize as a generic use case used for integrating the SOLR into PHP projects. For integration of PHP we will get a choice of the HTTP interface for querying and retrieving XML or JSON.
Hathi trust is a good project and an example to prove the SOLR’s ability to search big digital libraries. To extract directly from the article: “ the index of our one million book index is over 200 gigabytes. So we expect to end up with a two terabyte index for 10 million books.
There exist two approaches to implement the auto- suggestion also known as auto- completion with SOLR.
The two approaches are
If you wish to move it to the extreme then you can use Lucene index entirely in RAM.
If you want to learn about Apache Solr Indexing and Querying, refer to this insightful tutorial!
Spatial search applications:
This can be helpful in several ways like, fingerprint search, for Bioinformatics, facial search, etc. the easiest way is implemented in Jetwick to prevent the duplicate tweets, but this causes the performance as O (n) where n indicates the number of queries terms. This is fine for 10 or lesser numbers, but it will be better at O (1)!. This technique is known as local sensitive hashing.
It is created with the open source and its “zero click” data will be done by SOLR using the dismax query handler.
Clustering support with carrot2:
This is one of the added plugins of SOLR. By using the carrort2 you can achieve clustering. Clustering can be defined as the assignment of a collection of observations into subsets also known as clusters, so that the observations in the same cluster are similar to some extent. The below figure one of the visual example for carrot2.
Near Real Time Search:
SOLR isn’t really actual time yet, but you could tune Solr to the point where it comes to be near live, which indicates that the moment (‘ real time latency’) that a paper requires searchable after it gets indexed is less compared to 60 seconds even if you should upgrade regularly. Performing this job, you could configuration 2 indices. One write-only index “W” for the indexer and also one read-only index “R” for your application. Index R refers to the exact same data directory of W, which has to be defined in the solrconfig .xml of R through.
To be confirmed your users as well as R index check the indexed documents of W, you need to trigger an empty commit every minute.
wget -q http://localhost:port/solr/update?stream.body=%3Ccommit/%3E -O /dev/null
Every time such a dedicate is activated a new searcher without any cache entries is developed. This can harm efficiency for visitors striking the empty cache directly here after dedicate, however you can fill up the cache with static searches with the help of the new Searcher entrance in your solrconfig.xml. In addition, the autowarmCount building has to be tuned, which loads the cache with a new Searcher from old entrances.
Loggly = Full text search in the logs:
Feeding log files into Solr as well as browsing them at near real-time programs that Solr could deal with large amounts of information as well as questions the information rapidly. I’ve configuration a straight forward job where I’m doing comparable points, however, loggly has done a lot more to make the exact same task real-time as well as dispersed. You’ll should maintain the create index as small as feasible otherwise commit time will boost undue.
Loggly creates a new Solr index every 5 minutes and includes this when searching making use of the distributed abilities of Solr! They are combining the cores to keep the number of indices, small, however this is not as easy as it seems. Enjoy this video clip to get some information about their job.
Solandra= APACHE SOLR+Cassandra:
Solandra combines Solr and the distributed database Cassandra, which was created by Facebook for its inbox search and afterwards open sourced. Right now Solandra is not intended for manufacturing use. There are still some bugs as well as the distributed limitations of Solr apply to Solandra as well. The developers are functioning really hard making Solandra a lot better.
Jetwick can currently run via Solandra simply by changing the solrconfig.xml. Solandra likewise has the advantages of being real-time (no optimize, no devote!) and distributed without any major configuration included. The exact same is true for Solr Cloud.
Category Browsing through Facets:
SOLR offers facets, which make it simple to reveal the individual some valuable filter alternatives like those shown in the “Drupal integration” instance. Like I described previously, it is even possible to check out a deep category tree. The main benefit below is that the categories depend upon the query. By doing this the individual could even more filter the search results page with this classification tree provided by you. Here is an instance where this function is carried out for among the biggest second hand shops in Germany. A click ‘Schauspieler’ shows its sub-items.
Jetwick- Open Twitter Search:
You may have seen that Twitter is using Lucene under the hood. Twitter has a quite extreme use situation: over 1,000 tweets each second, over 12,000 queries each 2nd, however the real-time latency is under 10 sec’s ! Nonetheless, the significance at that quantity is often not that excellent in my point of view. Twitter search usually consists of a bunch of duplicates and also sound.
I’m stating Jetwick right here since it makes severe use of facets which offers all the filters to the customer. Facets are used for the RSS-alike feature (conserved searches), the numerous filters like language as well as retweet-count on the left, as well as to obtain trending terms as well as links on the right.
Plaxo- Online Address Management:
Plaxo.com, which is currently possessed by Comcast, host internet addresses for more than 40 million individuals and provides clever explore the addresses – with the assistance of Solr. Plaxo is trying to get the current ‘social’ info of yours in touches with through blog posts, tweets, etc. Plaxo also attempts to lower duplicates.
Replace Fast or Google Search:
A number of users report that they have shifted from an industrial search option like FAST or Google Search Appliance (GSA) to Solr (or Lucene). The reasons for that movement are various: FAST goes down Linux assistance as well as Google can make combination issues. The main reason for me is that Solr isn’t really a black box– you could fine-tune the source code, keep old versions as well as repair your pests faster!
Search Application Prototyping:
With the help of the currently incorporated velocity plugin and also the data import handler it is possible to create an application prototype for your search within a couple of hrs. The following version of Solr makes making use of velocity less complicated.
SOLR as a Whitelist:
Picture you are the new google and also you have a bunch of different kinds of data to present, e.g. ‘news’, ‘video’, ‘music’, ‘maps’, ‘buying’ and a lot more. Some of those types could just be fetched from some heritage systems and you just wish to show one of the most appropriated types based on your business logic. E.g. a query which consists of ‘New York City’ ought to result in the option of results from ‘maps’, but ‘new yorker’ ought to favor arise from the ‘purchasing’ type.
With SOLR you could establish such a whitelist-index that will certainly help to determine which type is more vital for the search query. As an example, if you get more or even more pertinent results for the ‘buying’ kinder then you must favor arise from this kind. Without the whitelist-index – i.e. having all data in different indices or systems, would make it almost difficult to compare the significance.
The whitelist-index can be made use of as highlighted in the following steps.
- Query the whitelist-index
- Determine which data kinds to present
- Query the sub-systems and
- Screen arises from the selected kinds only.
SOLR is helpful in scientific applications, like DNA search applications. SOLR is also used for totally different alphabets so that it is possible to query nucleotide sequences in alternative of words.
One more different idea is that, you can harness to build a very personalized search. It is possible for all the users to drag and drop any web sites they desire and query them afterwards. E.g. most of the time it is required that stack overflow, mailing lists, some wikis with the expected results, but usually web search engines provides the results that are too messy. My ultimate idea is to build a SOLR based app would be based on a Lucene/SOLR implementation of desktop search.
This blog will help you get a better understanding of Solr + Hadoop = Big Data Love