The road so far….

May 20, 2010

Apache Tika = Power to Parse Almost Everything

Filed under: java, lucene — Tags: — Rahul Sharma @ 10:19 pm

The other day I was browsing through the subprojects that have evolved under Lucene. There are a couple of them that have been organised for a couple of varied purposes. So I landed at Apache Tika. Initially it sounded something like any other parser but latter when I started playing, it was quite a fun.

Apache Tika

The Tika project is quite powerful that can be used to parse in-order to extract metadata and structured text from  many types of documents. It combines various existing libraries to extract data/text from different kind of documents.  As  the hood of car hides the underlying complexity of components used, the Tika library hides the complexity of parsing and presents an utter simplistic API to read data from every kind of supported file.

It has the Parser interface that is used to parse the text and read data. You need to create an instance of this Parser and call the parse API, with a few arguments, to get the required data. The project currently supports around 14 different formats e.g. text, pdf,  audio, video, image, archives, msoffice, xmls, emails etc. For each of the  supported format the package has a Parser which it employs to read the format.

Besides this usual set of  parsers for each supported type,  it also has a AutoDectectParser, which internally uses a decorator to determine the format being parsed and then automatically chooses the required parser to parse the document.

Using the parser API is quite simple and  requires 4 argument to work

  • InputStream : the stream of data to be parsed
  • MetaData : a meta data object in which it will return the found meta data
  • ContentHandler : where it will return the structured text
  • ParserContext : something you would like to tell the parser
@Test
public void testTikaRoundOne() throws Exception {
  Parser parser=new AutoDetectParser();
  parseLocationUsingParser("Tika_EN.txt",parser);
  parseLocationUsingParser("Tika_EN.pdf",parser);
  parseLocationUsingParser("Tika_EN.pdf", new PdfParser());
  parseLocationUsingParser("TestTikka.class",parser);
}
private void parseLocationUsingParser(String resourceLocation, Parser parser)
      throws IOException, SAXException, TikaException {
 InputStream input = this.getClass().getClassLoader()
       .getResourceAsStream(resourceLocation);
 ContentHandler textHandler = new BodyContentHandler();
 Metadata metadata = new Metadata();
 parser.parse(input, textHandler, metadata, new ParseContext());
 input.close();
 System.out.println("Title: " + metadata.get("title"));
 System.out.println("Author: " + metadata.get("Author"));
 System.out.println("content: " + textHandler.toString());
 int count = 1;
 for (String stringMD : metadata.names()) {
 System.out.println("Metadata " + count + " ----> name : "
      + stringMD + "  value : " + metadata.get(stringMD));
    count++;
   }
 }

The Parser requires a stream of data to be analyzed. It is not concerned about the specs of the Stream i.e. how it is opened etc. The code using this API should take care of the stream maintainance i.e opening and closing. Since it understands stream so it can be used to parse a format if a stream is available, no mater how e.g. you can configure it to parse a webpage.

@Test
 public void testTikaRoundTwoParseSite() throws Exception {
  Resource resource = new UrlResource("https://devlearnings.wordpress.com");
  InputStream input = resource.getInputStream();
  ContentHandler textHandler = new BodyContentHandler();
  Metadata metadata = new Metadata();
  Parser parser = new AutoDetectParser();
  parser.parse(input, textHandler, metadata, new ParseContext());
  input.close();
  System.out.println("Title: " + metadata.get("title"));
  System.out.println("metadata: "+metadata);
  System.out.println("content: " + textHandler.toString());
 }

Here using spring Resource I have obtained a handle to the URLResource and I have presented it to the parser.

The MetaData that the parser generates confirms to Dublin Core metadata names along with other names.

Add the Power Of Lucene

Using Tika we can parse any kind of supported format and generate the corresponding structured text and metadata. This structured text and metadata can then be fed to the Lucene engine to generate an index. The Lucene document can contain the meta data fields along with data. Make sure you do not put null fields in the Document else you will get NPEs.

Document getDocument(ParsedData data) {
  Field authorfield = getField("author", data.author);
  Field locfield = getField("location", data.location);
  Field titlefield = getField("title", data.title);
  Field contentfield = getField("content", data.content);
  if (authorfield == null && locfield == null && titlefield == null
            && contentfield == null) {
               return null;
    }
  Field[] fields = new Field[] { authorfield, locfield, titlefield,
              contentfield };
   Document doc = new Document();
    for (Field field : fields) {
       if (field != null)
         doc.add(field);
     }
  return doc;
}
private Field getField(String fname, String data) {
 if (data == null || fname == null) {
    return null; }
 Field field = new Field(fname, data, Store.YES, Index.ANALYZED);
    return field;
 }

By this way we can generate a comprehensive index that is derived from various attributes of varied source data.

5 Comments »

  1. […] This post was mentioned on Twitter by shivie, Richard Laksana. Richard Laksana said: Apache Tika = Power to Parse Almost Everything – http://su.pr/3IuU8A […]

    Pingback by Tweets that mention Apache Tika = Power to Parse Almost Everything « The road so far…. -- Topsy.com — May 21, 2010 @ 10:42 am

  2. How can we get charts, tables of PDF document using tike?

    Comment by supriya raikhelkar — September 20, 2010 @ 2:16 pm

    • You can parse the pdf (containing charts/tables) using the built-in PdfParser or the AutoDetectParser, as explained in the examples. It will give back table/chart values as the extracted text in the ContentHandler.

      Comment by Rahul Sharma — September 20, 2010 @ 10:09 pm

  3. Hello there,

    I’m actually doing a research on how a document search can be made. I landed up at Apache Tika. I’m not sure how to get started with it. I’m a PHP /Python Developer with Linux and normal HTTP Apache server background.
    What is the background needed in order to work with this technology

    Thank You
    Sai

    Comment by Sai — March 3, 2011 @ 1:00 pm

    • Hi Sai,

      Tika is a java API that can be used to read different types of docs.
      If you are looking for making docs searchable then 2 steps read docs and then make them searchable.
      I am not sure what you can use to read docs, Tika may not work as it is only available in java.
      For making docs searchable you can use PyLucene, a python variant of Lucene.

      regards
      Rahul

      Comment by Rahul Sharma — March 3, 2011 @ 9:57 pm


RSS feed for comments on this post. TrackBack URI

Leave a reply to Tweets that mention Apache Tika = Power to Parse Almost Everything « The road so far…. -- Topsy.com Cancel reply