Tika File Digester
I got a requirement on create a module to read, persist and index the any kind of document or file content. I have been trying out several APIs and spend few hours with them and play around with them. Then I identified Apache Tika is the best suitable API for file digesting and content extraction. It supports lots of MIME types.
Current version I have been using is Apache Tika 1.4. It supports the following document formats.
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image formats
- Video formats
- Java class files and archives
- The mbox format
Maven dependency
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>...</version> </dependency>
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>...</version> </dependency>
Tika as a command line utility
Tika has its own command line utility to try out the Tika. Which was the initial attraction point of my self towards Tika.
Following command will open up a GUI to play with Tika.
usage: java -jar tika-app.jar -g
Read Files using Tika
Following java code will describes you to how to read or digest a document using Tika.
public void fileContentRead() throws IOException, SAXException, TikaException
{
String fileName = "my_test.docx";
The output of this method can be like:
Easy Shooting Mode : Manual
Image Type : Canon EOS 350D DIGITAL
Model : Canon EOS 350D DIGITAL
Metering Mode : Evaluative
Quality : Fine
Shutter/Auto Exposure-lock Buttons : AF/AE lock
ISO Speed Ratings : 400
Looking a head I'm focusing on integrating Apache Tika with Solr server to digest the files and automatically index the documents inside the Solr. Which is my ultimate goal.
References
Comments