Issue
- I was created the extended class of [org.apache.tika.parser.microsoft.ExcelExtractor] same as below to extract the embedded documents of excel(*.xls) file.
public class CustomExcelExtractor extends ExcelExtractor {
private CustomAbstractPOIFSExtractor poi;
public CustomExcelExtractor(ParseContext context, Metadata metadata, Path outputDir) {
super(context, metadata);
poi = new CustomAbstractPOIFSExtractor();
}
@Override
public void parse(
DirectoryNode root, XHTMLContentHandler xhtml,
Locale locale) throws IOException, SAXException, TikaException {
// Extract embedded documents
for (Entry entry : root) {
if (entry.getName().startsWith("MBD")
&& entry instanceof DirectoryEntry) {
try {
poi.extractEmbeddedOfficeDoc((DirectoryEntry) entry, null,
xhtml, embeddedCnt);
} catch (TikaException e) {
// ignore parse errors from embedded documents
}
}
}
}
private class CustomAbstractPOIFSExtractor {
private TikaConfig config = TikaConfig.getDefaultConfig();
/**
* Handle an office document that's embedded at the POIFS level
*/
protected void extractEmbeddedOfficeDoc(
DirectoryEntry dir, String resourceName,
XHTMLContentHandler xhtml, int embeddedCnt)
throws IOException, SAXException, TikaException {
if (dir.hasEntry("Package")) {
return;
}
// It's regular OLE2:
POIFSDocumentType type = POIFSDocumentType.detectType(dir);
try {
if (type == POIFSDocumentType.WORDDOCUMENT) {
FileOutputStream fos = new FileOutputStream(new File("test.doc"));
HWPFDocument document = new HWPFDocument((DirectoryNode) dir);
document.write(fos);
document.close();
} else if (type == POIFSDocumentType.POWERPOINT) {
FileOutputStream fos = new FileOutputStream(new File("test.ppt"));
HSLFSlideShowImpl document = new HSLFSlideShowImpl((DirectoryNode) dir);
document.write(fos);
document.close(); // After call this method, I cannot continue to extract embedded documents
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
}
- [extractEmbeddedOfficeDoc] method will write the stream of embedded documents to files.
When I call [HSLFSlideShowImpl.close()] method to close the stream of "test.ppt" document, I cannot continue to loop to extract the other embedded documents. The exception will be occurred.
java.lang.IndexOutOfBoundsException: Block 1079 not found
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486)
at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:169)
at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142)
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:257)
at org.apache.poi.poifs.filesystem.NDocumentInputStream.readUShort(NDocumentInputStream.java:305)
at org.apache.poi.poifs.filesystem.DocumentInputStream.readUShort(DocumentInputStream.java:182)
at org.apache.poi.hssf.record.RecordInputStream$SimpleHeaderInput.readRecordSID(RecordInputStream.java:115)
at org.apache.poi.hssf.record.RecordInputStream.readNextSid(RecordInputStream.java:198)
at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:132)
at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:120)
at org.apache.poi.hssf.record.RecordFactoryInputStream.<init>(RecordFactoryInputStream.java:184)
at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:491)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:348)
at extractor.CustomExcelExtractor$CustomAbstractPOIFSExtractor.extractEmbeddedOfficeDoc(CustomExcelExtractor.java:324)
at extractor.CustomExcelExtractor.parse(CustomExcelExtractor.java:119)
at extractor.CustomOfficeParser.parse(CustomOfficeParser.java:76)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)
at extractor.ExtractEmbeddedByTika.extract(ExtractEmbeddedByTika.java:37)
at main.ExcelEmbeddedtExtractor.main(ExcelEmbeddedtExtractor.java:61)
Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 552960 in stream of length -1
at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:42)
at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:484)
... 18 more
Additional Information
- The exception was NOT occurred after I called [HWPFDocument.close()] method.
Now, I would like to fix this issue but I don't know the root cause of this issue. Please help me!
Thanks in advance.