java - 使用 PDFBox 比较两个 PDF 文件文本失败，即使两个文件具有相同的文本

我在我的 selenium 自动化导出测试中使用 PDFBOX 作为实用程序。我们正在使用 pdfbox 将实际导出的 pdf 文件与预期的文件进行比较，然后相应地通过/失败测试。这工作非常顺利。然而最近我遇到了实际导出的文件，它看起来和预期的一样(就数据而言)，但是当它与 pdfbox 比较时，它失败了

Expected pdf file

Actual pdf file

下面是我用来比较 pdf 文件的通用工具

    private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
    LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
    PDDocument pdf1 = PDDocument.load(pdfFile1);
    PDDocument pdf2 = PDDocument.load(pdfFile2);
    PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
    PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
    try
    {
        if (pdf1pages.getCount() != pdf2pages.getCount())
        {
            String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
            LOG.debug(message);
            throw new TestException(message);
        }
        PDFTextStripper pdfStripper = new PDFTextStripper();
        LOG.debug("pdfStripper is :- " + pdfStripper);
        LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
        for (int i = 0; i < pdf1pages.getCount(); i++)
        {
            pdfStripper.setStartPage(i + 1);
            pdfStripper.setEndPage(i + 1);
            String pdf1PageText = pdfStripper.getText(pdf1);
            String pdf2PageText = pdfStripper.getText(pdf2);
            if (!pdf1PageText.equals(pdf2PageText))
            {
                String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
                LOG.debug(message);
                System.out.println("fff");
                LOG.debug("pdf1PageText is " + pdf1PageText);
                LOG.debug("pdf2PageText is " + pdf2PageText);
                String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
                LOG.debug("difference is "+difference);
                throw new TestException(message+" [[ Difference is ]] "+difference);
            }
        }
        LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
    } finally {
        pdf1.close();
        pdf2.close();
    }
}

Eclipse 在控制台中显示了这种差异

https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png

我可以看到它失败是因为像(花括号、{}、井号#、感叹号!)这样的符号，但是我不知道如何解决这个问题..

谁能告诉我如何解决这个问题？

最佳答案

However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing

这可能会发生，您不应该感到惊讶。毕竟您的测试不是比较相关页面的外观，而是比较文本提取的结果。

虽然页面上文本数据的外观取决于相应(如果是您的文件)嵌入式字体文件中相关字形的绘图说明，但页面上相同文本数据的文本提取结果取决于在 ToUnicode 表或该字体文件的 PDF 字体信息结构的 Encoding 值上。

事实上，虽然预期文档和实际文档的文本数据使用各自字体的相同字形，但预期文档和实际文档中的 ToUnicode 表中的一种字体声称某些字形代表不同的 Unicode 代码点。

有问题的字体有这三个字形:

预期文档中该字体的 ToUnicode 映射包含映射

<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]

声称这三个字符分别对应U+0000、U+F125、U+F128。

实际文档中该字体的 ToUnicode 映射包含映射

<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]

声称这三个字符分别对应U+0000、U+F126、U+F129。

因此，您的测试正确地发现了预期文档和实际文档之间的差异，因此其失败结果是正确的。因此，您不必修复任何东西，生成实际文档的软件具有一个问题!

(有人可能会争辩说，差异在 Unicode 专用区内，无关紧要。在那种情况下，您必须更新测试以忽略来自 Unicode 专用区的字符差异。但这应该被告知你在开始创建测试之前。)

关于java - 使用 PDFBox 比较两个 PDF 文件文本失败，即使两个文件具有相同的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49256943/

java - 使用 PDFBox 比较两个 PDF 文件文本失败，即使两个文件具有相同的文本

上一篇：JavaFX - 如何使用另一个类的场景更改场景？

下一篇：JavaFX 8 Bindings.when 和 Bindings.divide 创建除以零