Java Programming

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Problem extracting text with pdfbox.

807588Jun 9 2009

I'm trying to search for some text in a pdf file using pdfbox. Once I find that text I mark the page so I can split that page out using the pdfsplitter. This has been working for a little over a year now. However, recently we received a new batch of pdf's to parse last month that were generated using ghostscript. All of these pdf's failed to parse. When debugging it, I noticed that when I am getting the COSStrings out of the pdf's they seem to be invalid characters (). I thought at first that this was a batch of bad pdf's, but I can open the pdf's just fine. I also tried using the PDFTextStripper to retrieve the text and that displayed it just fine as well.

The code I am using is taken from one of the examples in pdfbox (PDFBox-0.7.3\src\org\pdfbox\examples\pdmodel\ReplaceString.java). The code is as follows (with the only difference that I am not replacing the string and saving the file, but I am searching for text and saving the page number that I find the text on):

PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes() );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes() );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}

If anyone knows why this code is not extracting the text as the PDFTextStripper does or how I can modify this I would greatly appreciate the help.

Thanks.

Locked Post

New comments cannot be posted to this locked post.

Locked on Jul 7 2009

Added on Jun 9 2009

0 comments

772 views