Posts

Showing posts from 2013

Tika File Digester

Image
I got a requirement on create a module to read, persist and index the any kind of document or file content. I have been trying out several APIs and spend few hours with them and play around with them. Then I identified Apache Tika   is the best suitable API for file digesting and content extraction. It supports lots of MIME types. Current version I have been using is Apache Tika 1.4. It supports the following document formats. HyperText Markup Language XML and derived formats Microsoft Office document formats OpenDocument Format Portable Document Format Electronic Publication Format Rich Text Format Compression and packaging formats Text formats Audio formats Image formats Video formats Java class files and archives The mbox format Maven dependency <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>...</version> </dependency> <dependency>...

Solr Cloud 4.5

Image
SolrCloud Solr Cloud is designed to provide the highly available, fault tolerant environment that can index data for searching. In this environment data can be organized into multiple pieses, shards or can reside in several machines, with replicas it provides the redundancy for both scalability and fault tolerance. Integration with ZooKeeper it helps to manage the overall environment so that both indexing and search requests can be routed properly. Following section you will get an understanding on How to distribute data over multiple instances by using ZooKeeper and creating shards. How to create redundancy for shards by using replicas. How to create redundancy for the overall cluster by running multiple ZooKeeper instances This section explains SolrCloud and its inner workings in detail, but before you dive in, it's best to have an idea of what it is you're trying to accomplish. This page provides a simple tutorial that explains how SolrCloud works on a p...