java - 带有粗体/斜体信息的 PDFBox 文本提取不适用于某些文件

此程序适用于我创建的 PDF 文件，但我必须为 Stedman's Dictionary.pdf 获取粗体和斜体信息，这似乎有隐藏此信息的技巧。我们将热烈欢迎任何建议。

注意: 这纯属自愿，是为了帮助一些医生 friend 。

    package arspdfbox;

    import java.io.*;
    import org.apache.pdfbox.exceptions.InvalidPasswordException;

    import org.apache.pdfbox.pdmodel.PDDocument;
    import org.apache.pdfbox.pdmodel.PDPage;
    import org.apache.pdfbox.pdmodel.common.PDStream;
    import org.apache.pdfbox.util.PDFTextStripper;
    import org.apache.pdfbox.util.TextPosition;
    import java.io.IOException;
    import java.util.List;

    public class PrintTextLocations extends PDFTextStripper {

        public PrintTextLocations() throws IOException {
            super.setSortByPosition(true);
        }

        public static void main(String[] args) throws Exception {

            PDDocument document = null;
            try {
                File input = new File("Stedman_Medical_Dictionary.pdf");
                //File input = new File("results/FontExample5.pdf");
                document = PDDocument.load(input);
                if (document.isEncrypted()) {
                    try {
                        document.decrypt("");
                    } catch (InvalidPasswordException e) {
                        System.err.println("Error: Document is encrypted with a password.");
                        System.exit(1);
                    }
                }
                PrintTextLocations printer = new PrintTextLocations();
                List allPages = document.getDocumentCatalog().getAllPages();
                //for (int i = 0; i < allPages.size(); i++) {
                for (int i = 99; i < 100; i++) {
                    PDPage page = (PDPage) allPages.get(i);
                    System.out.println("Processing page: " + i);
                    PDStream contents = page.getContents();
                    if (contents != null) {
                        printer.processStream(page, page.findResources(), page.getContents().getStream());
                    }
                }
            } finally {
                if (document != null) {
                    document.close();
                }
            }
        }

        /**
         * @param text The text to be processed
         */
        @Override /* this is questionable, not sure if needed... */
        protected void processTextPosition(TextPosition text)  {
            System.out.println("String[" + text.getXDirAdj() + ","
                    + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                    + text.getXScale() + " height=" + text.getHeightDir() + " space="
                    + text.getWidthOfSpace() + " width="
                    + text.getWidthDirAdj() + "]" + text.getCharacter());
            System.out.append(text.getCharacter()+" <--------------------------------");
           // System.out.println("String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getCharacter());
            System.out.println(text.getFont().getBaseFont()); System.out.println(" Italic="+text.getFont().getFontDescriptor().isItalic()); 
            System.out.println(" Bold="+text.getFont().getFontDescriptor().getFontWeight()); 
            System.out.println(" ItalicAngle="+text.getFont().getFontDescriptor().getItalicAngle()); 
            //try{
            System.out.println(" xxxx="+text.getFont().getFontDescriptor().isFixedPitch());
            //} catch (IOException ioex){}

        }

    }

最佳答案

This program works OK for PDF files that I have created but I have to get bold and italic info for Stedman's Dictionary.pdf which appears to have a trick to hide this info.

您的程序也适用于 Stedman's Dictionary:PDF 中那些字典样式页面上的文本信息对普通、粗体、斜体等文本使用相同的字体。样式仅出现在叠加图像中，该图像只是……图像，而不是文本提取的信息源。

一些细节:

寻找例如进入第 132 个文档页面(编号 110，随机选择)的内容流显示以下条目

entry for Bal'four's disease

以下来源:

/F1 22 Tf
BT
1 0 0 1 61 2559 Tm
(Bal'four's)Tj
ET
/F1 21.46 Tf
BT
1 0 0 1 210 2559 Tm
(disease')Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 327 2561 Tm
([George)Tj
ET
/F1 22.71 Tf
BT
1 0 0 1 444 2563 Tm
(Williatn)Tj
ET
/F1 23.33 Tf
BT
1 0 0 1 565 2564 Tm
(Balfour,)Tj
ET
/F1 24.76 Tf
BT
1 0 0 1 692 2566 Tm
(English)Tj
ET
/F1 23 Tf
BT
1 0 0 1 94 2525 Tm
(physician,)Tj
ET
/F1 24.09 Tf
BT
1 0 0 1 252 2526 Tm
(1822-1903.])Tj
ET
/F1 25.93 Tf
BT
1 0 0 1 447 2530 Tm
(Chloroma.)Tj
ET

即每个单词使用相同的字体 (F1)，没有不同的样式，只是大小不同:

22 岁的“Bal'four's”
21.46 处的“疾病”
[乔治] 24.76
“Williatn”在 22.71
“贝尔福”，23 点 33 分
24.76 的“英语”
“医师”，23 岁
“1822-1903.]”在 24.09
“色光”。在 25.93

(坐标在当前页面上按系数 0.23945 缩放；因此，PDFBox 将为您提供按该系数缩放的数字，而不是列出的尺寸。)

您看到粗体(Balfour's disease')或斜体(Balfour,)文本的原因是此文本信息是在渲染模式 3 中“渲染”，即不可见，并且在其上方显示扫描图像。因此，您没有关于文本样式的任何可靠信息(除了对该图像应用样式文本 的 OCR)。

话虽这么说，那些大小，如果人们试图看到任何相关性，对于粗体文本来说似乎较小，分界线介于 22 和 22.5 之间(我的印象是看了三四个字典条目)。因此，您可能会尝试从小尺寸中获得大胆。不过，我不认为这是肯定的，一些粗体文本可能更大，一些非粗体文本可能更小

关于java - 带有粗体/斜体信息的 PDFBox 文本提取不适用于某些文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21207943/

java - 带有粗体/斜体信息的 PDFBox 文本提取不适用于某些文件

上一篇：Java 通配符转换最佳实践

下一篇：java - 为什么这个 Spring Aspect 没有按方法参数打印？