Tika File Digester

- November 12, 2013

I got a requirement on create a module to read, persist and index the any kind of document or file content. I have been trying out several APIs and spend few hours with them and play around with them. Then I identified Apache Tika is the best suitable API for file digesting and content extraction. It supports lots of MIME types.

Current version I have been using is Apache Tika 1.4. It supports the following document formats.

Maven dependency

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>...</version>
  </dependency>

 <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>...</version>
  </dependency>

Tika as a command line utility

Tika has its own command line utility to try out the Tika. Which was the initial attraction point of my self towards Tika.

Following command will open up a GUI to play with Tika.

usage: java -jar tika-app.jar -g

Read Files using Tika

Following java code will describes you to how to read or digest a document using Tika.

 public void fileContentRead() throws IOException, SAXException, TikaException

    {

String fileName = "my_test.docx";

        Tika tika = new Tika();

        InputStream stream = new FileInputStream(fileName);

        System.out.println("Stream Data :" + stream.available());

        ByteArrayOutputStream os = new ByteArrayOutputStream();

        Metadata metadata = new Metadata();                                                                                                                             // Auto detect the parser according to the MIME type of the file.

        Parser parser = new AutoDetectParser();

        long start = System.currentTimeMillis();

        BodyContentHandler handler = new BodyContentHandler(os);

        InputStream content = stream;

        parser.parse(content, handler, metadata, new ParseContext());

        for (String name : metadata.names())

        {

            System.out.println(name + ":\t" + metadata.get(name));

        }

        System.out.println("\n\nContent :\n" + new String(os.toByteArray()));

        System.out.println(String.format("------------ Processing took %s                    millis\n\n",

             System.currentTimeMillis() - start));

        stream.close();

    }

The output of this method can be like:

Easy Shooting Mode : Manual
Image Type : Canon EOS 350D DIGITAL
Model : Canon EOS 350D DIGITAL
Metering Mode : Evaluative
Quality : Fine
Shutter/Auto Exposure-lock Buttons : AF/AE lock
ISO Speed Ratings : 400

Looking a head I'm focusing on integrating Apache Tika with Solr server to digest the files and automatically index the documents inside the Solr. Which is my ultimate goal.

References

Search This Blog

All About Knows and Dont knows