java - pdfbox 类转换异常

标签 java pdf pdfbox apache-tika

我想阅读以下 pdf 文件中的文本。我使用的是 pdfbox 版本 1.8.8。我收到以下错误。

2014-12-18 15:02:59 WARN  XrefTrailerResolver:203 - Did not found XRef object at specified startxref position 4268142
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
2014-12-18 15:03:00 ERROR PDPageNode:202 - No Kids found in getAllKids(). Probably a malformed pdf.
java.lang.ClassCastException: org.apache.pdfbox.cos.COSDictionary cannot be cast to org.apache.pdfbox.cos.COSStream
    at org.apache.pdfbox.pdmodel.common.COSStreamArray.<init>(COSStreamArray.java:68)
    at org.apache.pdfbox.pdmodel.common.PDStream.createFromCOS(PDStream.java:185)
    at org.apache.pdfbox.pdmodel.PDPage.getContents(PDPage.java:639)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:380)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
    at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
    at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:288)
    at com.algotree.pdf.test.PdfBoxTest.pdftoText(PdfBoxTest.java:53)
    at com.algotree.pdf.test.PdfBoxTest.main(PdfBoxTest.java:71)

是的,我看过很多关于这个错误的帖子。我仍然找不到读取该文件的解决方案。 谢谢

file.pdf

这是我的代码:

static String pdftoText(String fileName) throws IOException {
        PDFParser parser;
        String parsedText = null;;
        PDFTextStripper pdfStripper = new PDFTextStripper();
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File(fileName);
        if (!file.isFile()) {
            System.err.println("File " + fileName + " does not exist.");
            return null;
        }
        try {

            parser = new PDFParser(new FileInputStream(file));
        } catch (IOException e) {
            System.err.println("Unable to open PDF Parser. " + e.getMessage());
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdfStripper.setSuppressDuplicateOverlappingText(false);
            pdDoc = new PDDocument(cosDoc);
            int endPage=pdDoc.getPageCount();
            if(endPage>300)
                endPage=300;
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(endPage);
            parsedText = pdfStripper.getText(cosDoc);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return parsedText;
    }

最佳答案

这个有效

static String pdftoText(String fileName) throws IOException {
    String parsedText = null;;
    PDFTextStripper pdfStripper = new PDFTextStripper();
    PDDocument pdDoc = null;
    File file = new File(fileName);
    if (!file.isFile()) {
        System.err.println("File " + fileName + " does not exist.");
        return null;
    }
    try {
        pdDoc=PDDocument.loadNonSeq(file, null);
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        return null;
    }
    try {
        pdfStripper = new PDFTextStripper();
        int endPage=pdDoc.getPageCount();
        if(endPage>300)
            endPage=300;
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(endPage);
        parsedText = pdfStripper.getText(pdDoc);
        System.out.println(parsedText);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    return parsedText;
}

关于java - pdfbox 类转换异常,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27542342/

相关文章:

java - 如何将两个 pdf 页面与 pdfbox (java) 拼接在一起?

pdf - 突出显示现有 PDF 中的单词

java - java中如何判断字符数组中是否存在某个元素

java - 由于 XMI 无法正确加载容器,正在加载序列化的 Ecore 模型

java - 如何从 ble 设备获取正确(解码)的制造商广告数据?

c# - 将 HTML 转换为 PDF dink 到 pdf 时 CSS 转换和写入模式属性的问题

java - 玩框架初学者。从表单中获取数据

c# - selectPDF 不从 HTML 字符串保存 PDF

asp.net-mvc - 如何在 Rotativa PDF 中的页眉或页脚上设置图像

PDFBox True Type 字体粗体