Is it possible to extract text by page for word/pdf files using Apache Tika? -


all documentation can find seems suggest can extract entire file's content. need extract pages individually. need write own parser that? there obvious method missing?

actually tika handle pages (at least in pdf) sending elements <div><p> before page starts , </p></div> after page ends. can setup page count in handler using (just counting pages using <p>):

public abstract class mycontenthandler implements contenthandler {     private string pagetag = "p";     protected int pagenumber = 0;     ...     @override     public void startelement (string uri, string localname, string qname, attributes atts) throws saxexception  {            if (pagetag.equals(qname)) {             startpage();         }     }      @override     public void endelement (string uri, string localname, string qname) throws saxexception {            if (pagetag.equals(qname)) {             endpage();         }     }      protected void startpage() throws saxexception {     pagenumber++;     }      protected void endpage() throws saxexception {     return;     }     ... } 

when doing pdf may run problem when parser doesn't send text lines in proper order - see extracting text pdf files apache tika 0.9 (and pdfbox under hood) on how handle this.


Comments

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -