Is it possible to extract text by page for word/pdf files using Apache Tika? -
all documentation can find seems suggest can extract entire file's content. need extract pages individually. need write own parser that? there obvious method missing?
actually tika handle pages (at least in pdf) sending elements <div><p>
before page starts , </p></div>
after page ends. can setup page count in handler using (just counting pages using <p>
):
public abstract class mycontenthandler implements contenthandler { private string pagetag = "p"; protected int pagenumber = 0; ... @override public void startelement (string uri, string localname, string qname, attributes atts) throws saxexception { if (pagetag.equals(qname)) { startpage(); } } @override public void endelement (string uri, string localname, string qname) throws saxexception { if (pagetag.equals(qname)) { endpage(); } } protected void startpage() throws saxexception { pagenumber++; } protected void endpage() throws saxexception { return; } ... }
when doing pdf may run problem when parser doesn't send text lines in proper order - see extracting text pdf files apache tika 0.9 (and pdfbox under hood) on how handle this.
Comments
Post a Comment