读取 PDPage 时出现 java.io.IOException : RandomAccessBuffer already closed,

标签 java runtime-error pdfbox random-access

我创建了一个将 PDF 转换为 Excel 的程序。转换需要很长时间(100 页=10 分钟)。它运行正常大约 15-20 分钟,之后读取 PDPage 时出现错误。

Java GC 是否有可能在程序结束之前“清理”变量?

代码:

private class Search_Text implements Runnable {

    private int x, y, width, height;
    private PDPage pdPage;
    private Object lock;
    private ArrayList<Object[]> result;
    private PDFTextStripperByArea strip;

    public Search_Text(int x, int y, int width, int height, PDPage pdPage, Object lock) throws IOException {
        this.x = x;
        this.y = y;
        this.width = width;
        this.height = height;
        this.pdPage = pdPage;
        this.lock = lock;
        this.result = new ArrayList<>();
        this.strip = new PDFTextStripperByArea();
    }

    @Override
    public void run() {

        if (height < 10) {
            int upper = y;
            int bottom = 1;
            ArrayList<Object[]> st = new ArrayList<>();
            String str = "";
            while (upper + bottom <= y + height) {
                strip.addRegion("cell", new Rectangle(x, upper, width, bottom));
                //System.out.println("prova.Pdf2Excell.log_extract()BEFORE LOCK" + init);
                synchronized (lock) {
                    try {
                        strip.extractRegions(pdPage);
                    } catch (IOException ex) {
                        Logger.getLogger(Pdf2Excell.class.getName()).log(Level.SEVERE, null, ex);
                    }
                }
                str = strip.getTextForRegion("cell");
                if (!emptyString(str)) {

                    st.add(new Object[]{str, upper + bottom, upper});
                    upper += bottom;
                    bottom = 1;

                    while (upper + bottom < height + y && !emptyString(str)) {
                        strip.addRegion("cell", new Rectangle(x, upper, width, bottom));

                        synchronized (lock) {
                            try {
                                strip.extractRegions(pdPage);
                            } catch (IOException ex) {
                                Logger.getLogger(Pdf2Excell.class.getName()).log(Level.SEVERE, null, ex);
                            }
                        }
                        str = strip.getTextForRegion("cell");
                        upper++;
                        //System.out.println("prova.Pdf2Excell.pdf2EX()DENTRO");
                    }
                } else {
                    bottom += 1;
                    //System.out.println("prova.Pdf2Excell.pdf2EX()UPPER;;"+upper+";;BOTTOM;;" + bottom);
                }
                if (upper == y) {
                    st.add(new Object[]{"", y + height, upper});
                }
                result = st;
            }
        } else {
            try {
                int half_rec = height / 2;
                Rectangle first_rec = new Rectangle(x, y, width, half_rec);
                Rectangle last_rec = new Rectangle(x, y + half_rec, width, height - half_rec);

                Search_Text first_search = new Search_Text(x, y, width, half_rec, pdPage, lock);
                Search_Text last_search = new Search_Text(x, y + half_rec, width, height - half_rec, pdPage, lock);

                Thread first = new Thread(first_search);
                Thread last = new Thread(last_search);

                strip.addRegion("cell", first_rec);
                synchronized (lock) {

                    strip.extractRegions(pdPage);

                }
                String temp = strip.getTextForRegion("cell");
                if (!emptyString(temp)) {
                    first.start();
                }

                strip.addRegion("cell", last_rec);
                synchronized (lock) {
                    strip.extractRegions(pdPage);
                }
                temp = strip.getTextForRegion("cell");
                if (!emptyString(temp)) {
                    last.start();
                }
                first.join();
                last.join();
                result = first_search.getResult();
                ArrayList<Object[]> temp_res = last_search.getResult();
                for (int i = 0; i < temp_res.size(); i++) {
                    result.add(temp_res.get(i));
                }
            } catch (InterruptedException | IOException ex) {
                Logger.getLogger(Pdf2Excell.class.getName()).log(Level.SEVERE, null, ex);

            }

        }

    }

这是错误消息:

Exception in thread "Thread-214418" java.lang.RuntimeException: java.io.IOException: RandomAccessBuffer already closed
    at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:198)
    at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:205)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:255)
    at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
    at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:458)
    at org.apache.pdfbox.util.PDFTextStripperByArea.extractRegions(PDFTextStripperByArea.java:153)
    at prova.Pdf2Excell$Search_Text.run(Pdf2Excell.java:954)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: RandomAccessBuffer already closed
    at org.apache.pdfbox.io.RandomAccessBuffer.checkClosed(RandomAccessBuffer.java:325)
    at org.apache.pdfbox.io.RandomAccessBuffer.seek(RandomAccessBuffer.java:105)
    at org.apache.pdfbox.io.RandomAccessFileInputStream.read(RandomAccessFileInputStream.java:96)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
    at org.apache.pdfbox.io.PushBackInputStream.read(PushBackInputStream.java:90)
    at org.apache.pdfbox.io.PushBackInputStream.peek(PushBackInputStream.java:68)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.hasNextSpaceOrReturn(PDFStreamParser.java:560)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:408)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:374)
    at org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:49)
    at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:193)
    ... 8 more

最佳答案

PDFBox 是为每个文档的单线程使用而开发的,而 OP 使用多个线程访问同一文档。虽然这可能仍然有效(因为它是只读用例),但适当的同步是必要的。

这种同步很可能会进一步减慢一切速度。因此,解决方案是完全使用不同的架构,即

take PDFTextStripper, override writeString(String text, List<TextPosition> textPositions), and collect the required information from that List<TextPosition> textPositions. TextPosition contains information on a small piece of text (usually a single letter, I think), including its position.

结果是

like 4 times faster.

关于读取 PDPage 时出现 java.io.IOException : RandomAccessBuffer already closed,,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35012948/

相关文章:

java - 如何在 java 中正确执行二进制 switch 语句?

java - 无法在 Neo4j JAVA 中执行密码匹配查询

PHP 警告 : pack(): Type H: illegal hex digit r error

php - PHP-带有Euler常数的 “Division by zero”

java - 使用 PDFBox 获取 PDF 文本对象

java - 使用 PDFBox 嵌入 *.ttc 字体

java - Spring HATEOAS 构建指向分页资源的链接

Java:计算对象中的实例数

java - 如何诊断Android并发修改异常

pdfbox - java.lang.NoClassDefFoundError : Could not initialize class org. apache.pdfbox.pdmodel.font.PDFont