Tika File Digester




I got a requirement on create a module to read, persist and index the any kind of document or file content. I have been trying out several APIs and spend few hours with them and play around with them. Then I identified Apache Tika  is the best suitable API for file digesting and content extraction. It supports lots of MIME types.

Current version I have been using is Apache Tika 1.4. It supports the following document formats.


Maven dependency

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>...</version>
  </dependency>

 <dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>...</version>
  </dependency>


Tika as a command line utility

Tika has its own command line utility to try out the Tika. Which was the initial attraction point of my self towards Tika. 
Following command will open up a GUI to play with Tika.

usage: java -jar tika-app.jar -g



Read Files using Tika

Following java code will describes you to how to read or digest a document using Tika.

 public void fileContentRead() throws IOException, SAXException, TikaException
    {
        String fileName = "my_test.docx";


        Tika tika = new Tika();
        InputStream stream = new FileInputStream(fileName);
        System.out.println("Stream Data :" + stream.available());


        ByteArrayOutputStream os = new ByteArrayOutputStream();
        Metadata metadata = new Metadata();                                                                                                                             // Auto detect the parser according to the MIME type of the file.
        Parser parser = new AutoDetectParser();


        long start = System.currentTimeMillis();


        BodyContentHandler handler = new BodyContentHandler(os);
        InputStream content = stream;


        parser.parse(content, handler, metadata, new ParseContext());


        for (String name : metadata.names())
        {
            System.out.println(name + ":\t" + metadata.get(name));
        }


        System.out.println("\n\nContent :\n" + new String(os.toByteArray()));


        System.out.println(String.format("------------ Processing took %s                    millis\n\n",
             System.currentTimeMillis() - start));


        stream.close();


    }

The output of this method can be like:

Easy Shooting Mode : Manual
Image Type : Canon EOS 350D DIGITAL
Model : Canon EOS 350D DIGITAL
Metering Mode : Evaluative
Quality : Fine
Shutter/Auto Exposure-lock Buttons : AF/AE lock
ISO Speed Ratings : 400

Looking a head I'm focusing on integrating Apache Tika with Solr server to digest the files and automatically index the documents inside the Solr. Which is my ultimate goal.


References





Comments

TRADING FOREX ONLINE
Penarikan paling Tercepat
Kamu bisa Trading dengan minimal Deposit Rp. 50.000 mengunakan bank lokal
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Kelebihan bertransaksi di DetikTrade
1. Teregulasi di FCA
2. DetikTrade memberikan Bonus Deposit awal 10%** T&C Applied
3. Perusahaan berdiri sejak 2017 telah mengalami perubahan platform lengkap dengan fitur2 analis dan teknikal trading
4. Deposit & Penarikan menggunakan BANK LOKAL BCA, BNI, BRI dan Mandiri
5. Anda juga dapat uang tambahan dari Bonus Referral 1% dari hasil profit tanpa turnover
- - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Segera bergabung dan rasakan pengalaman trading yang light, kunjungi website kami DetikTrade

Jika membutuhkan bantuan hubungi kami melalui :
WA : 087752543745

Popular posts from this blog

PostgreSQL bytea and oid

MySQL as Hive metadata store

Big Memory Go.