Tika File Digester
I got a requirement on create a module to read, persist and index the any kind of document or file content. I have been trying out several APIs and spend few hours with them and play around with them. Then I identified Apache Tika is the best suitable API for file digesting and content extraction. It supports lots of MIME types. Current version I have been using is Apache Tika 1.4. It supports the following document formats. HyperText Markup Language XML and derived formats Microsoft Office document formats OpenDocument Format Portable Document Format Electronic Publication Format Rich Text Format Compression and packaging formats Text formats Audio formats Image formats Video formats Java class files and archives The mbox format Maven dependency <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>...</version> </dependency> <dependency>...