Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

APACHE POI: java.lang.IndexOutOfBoundsException was occurred after closed the FileSystem

Hieu TranJul 9 2017 — edited Jul 24 2017

Issue

- I was created the extended class of [org.apache.tika.parser.microsoft.ExcelExtractor] same as below to extract the embedded documents of excel(*.xls) file.

public class CustomExcelExtractor extends ExcelExtractor {

        private CustomAbstractPOIFSExtractor poi;

   

        public CustomExcelExtractor(ParseContext context, Metadata metadata, Path outputDir) {

            super(context, metadata);

            poi = new CustomAbstractPOIFSExtractor();

        }

   

        @Override

        public void parse(

                DirectoryNode root, XHTMLContentHandler xhtml,

                Locale locale) throws IOException, SAXException, TikaException {

   

            // Extract embedded documents

            for (Entry entry : root) {

                if (entry.getName().startsWith("MBD")

                        && entry instanceof DirectoryEntry) {

                    try {

                        poi.extractEmbeddedOfficeDoc((DirectoryEntry) entry, null,

                                xhtml, embeddedCnt);

                    } catch (TikaException e) {

                        // ignore parse errors from embedded documents

                    }

                }

            }

        }

   

        private class CustomAbstractPOIFSExtractor {

            private TikaConfig config = TikaConfig.getDefaultConfig();

   

            /**

             * Handle an office document that's embedded at the POIFS level

             */

            protected void extractEmbeddedOfficeDoc(

                    DirectoryEntry dir, String resourceName,

                    XHTMLContentHandler xhtml, int embeddedCnt)

                    throws IOException, SAXException, TikaException {

                if (dir.hasEntry("Package")) {

                    return;

                }

   

                // It's regular OLE2:

                POIFSDocumentType type = POIFSDocumentType.detectType(dir);

   

                try {

                    if (type == POIFSDocumentType.WORDDOCUMENT) {

                        FileOutputStream fos = new FileOutputStream(new File("test.doc"));

                        HWPFDocument document = new HWPFDocument((DirectoryNode) dir);

                        document.write(fos);

                        document.close();

                    } else if (type == POIFSDocumentType.POWERPOINT) {

                        FileOutputStream fos = new FileOutputStream(new File("test.ppt"));

                        HSLFSlideShowImpl document = new HSLFSlideShowImpl((DirectoryNode) dir);

                        document.write(fos);

                        document.close(); // After call this method, I cannot continue to extract embedded documents

                    }

                } catch (Exception ex) {

                    ex.printStackTrace();

                }

            }

        }

    }

- [extractEmbeddedOfficeDoc] method will write the stream of embedded documents to files.

When I call [HSLFSlideShowImpl.close()] method to close the stream of "test.ppt" document, I cannot continue to loop to extract the other embedded documents. The exception will be occurred.

java.lang.IndexOutOfBoundsException: Block 1079 not found

at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:486)

at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:169)

at org.apache.poi.poifs.filesystem.NPOIFSStream$StreamBlockByteBufferIterator.next(NPOIFSStream.java:142)

at org.apache.poi.poifs.filesystem.NDocumentInputStream.readFully(NDocumentInputStream.java:257)

at org.apache.poi.poifs.filesystem.NDocumentInputStream.readUShort(NDocumentInputStream.java:305)

at org.apache.poi.poifs.filesystem.DocumentInputStream.readUShort(DocumentInputStream.java:182)

at org.apache.poi.hssf.record.RecordInputStream$SimpleHeaderInput.readRecordSID(RecordInputStream.java:115)

at org.apache.poi.hssf.record.RecordInputStream.readNextSid(RecordInputStream.java:198)

at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:132)

at org.apache.poi.hssf.record.RecordInputStream.<init>(RecordInputStream.java:120)

at org.apache.poi.hssf.record.RecordFactoryInputStream.<init>(RecordFactoryInputStream.java:184)

at org.apache.poi.hssf.record.RecordFactory.createRecords(RecordFactory.java:491)

at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:348)

at extractor.CustomExcelExtractor$CustomAbstractPOIFSExtractor.extractEmbeddedOfficeDoc(CustomExcelExtractor.java:324)

at extractor.CustomExcelExtractor.parse(CustomExcelExtractor.java:119)

at extractor.CustomOfficeParser.parse(CustomOfficeParser.java:76)

at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)

at extractor.ExtractEmbeddedByTika.extract(ExtractEmbeddedByTika.java:37)

at main.ExcelEmbeddedtExtractor.main(ExcelEmbeddedtExtractor.java:61)

Caused by: java.lang.IndexOutOfBoundsException: Unable to read 512 bytes from 552960 in stream of length -1

at org.apache.poi.poifs.nio.ByteArrayBackedDataSource.read(ByteArrayBackedDataSource.java:42)

at org.apache.poi.poifs.filesystem.NPOIFSFileSystem.getBlockAt(NPOIFSFileSystem.java:484)

... 18 more

Additional Information

- The exception was NOT occurred after I called [HWPFDocument.close()] method.

Now, I would like to fix this issue but I don't know the root cause of this issue. Please help me!

Thanks in advance.

Comments
Post Details
Added on Jul 9 2017
1 comment
3,135 views