character-encoding - PDF框2.0 : Overcoming dictionary key encoding

标签 character-encoding pdfbox

我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本,提取 AcroForm 字段的详细信息。我从单选按钮字段中挖掘出外观字典。我对/N 和/D 条目(正常和“向下”外观)感兴趣。像这样(交互式 Bean shell):

field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
  ap = annot.getAppearance();
  keys = ap.getCOSObject().getDictionaryObject("N").keySet();
  keyList = new ArrayList(keys.size());
  for (cosKey : keys) {keyList.add(cosKey.getName());}
  print(String.join("|", keyList));
}

输出为

Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off

问号 Blob 应该是瑞典语字符“ä”或“å”。使用 iText RUPS,我可以看到字典键是用 ISO-8859-1 编码的,而 PDFBox 假设它们是 Unicode,我猜。

有没有办法使用 ISO-8859-1 解码 key ?或者有其他方法可以正确检索 key ?

此 PDF 表单样本可在此处下载:http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

最佳答案

Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.

Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?

更改假定的编码

当从源 PDF 中读取名称时,PDFBox 对名称中的字节编码(只有名称可以用作 PDF 中的字典键)的解释发生在 BaseParser.parseCOSName() 中:

/**
 * This will parse a PDF name from the stream.
 *
 * @return The parsed PDF name.
 * @throws IOException If there is an error reading from the stream.
 */
protected COSName parseCOSName() throws IOException
{
    readExpectedChar('/');
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    int c = seqSource.read();
    while (c != -1)
    {
        int ch = c;
        if (ch == '#')
        {
            int ch1 = seqSource.read();
            int ch2 = seqSource.read();
            if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
            {
                String hex = "" + (char)ch1 + (char)ch2;
                try
                {
                    buffer.write(Integer.parseInt(hex, 16));
                }
                catch (NumberFormatException e)
                {
                    throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
                }
                c = seqSource.read();
            }
            else
            {
                // check for premature EOF
                if (ch2 == -1 || ch1 == -1)
                {
                    LOG.error("Premature EOF in BaseParser#parseCOSName");
                    c = -1;
                    break;
                }
                seqSource.unread(ch2);
                c = ch1;
                buffer.write(ch);
            }
        }
        else if (isEndOfName(ch))
        {
            break;
        }
        else
        {
            buffer.write(ch);
            c = seqSource.read();
        }
    }
    if (c != -1)
    {
        seqSource.unread(c);
    }
    String string = new String(buffer.toByteArray(), Charsets.UTF_8);
    return COSName.getPDFName(string);
}

如您所见,在读取名称字节并解释 # 转义序列后,PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此,要更改此设置,您必须修补此 PDFBox 类并替换底部指定的字符集。

这里的 PDFBox 正确吗?

根据规范,将名称对象视为文本时

the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.

(第 7.3.5 节名称对象, ISO 32000-1 )

BaseParser.parseCOSName() 就实现了这一点。

尽管如此,PDFBox 的实现并不完全正确,因为无需将名称解释为字符串的行为就是错误的:

name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text

因此,PDF 库应尽可能长时间地将名称作为字节数组进行处理,并且仅在明确需要时才查找字符串表示形式,只有这样,上面的建议(假设 UTF-8)才应发挥作用。该规范甚至指出了这可能会导致问题的地方:

PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.

另一种情况在手头的文档中变得很明显,如果字节序列不构成有效的 UTF-8,它仍然是一个有效的名称。但这些名称会通过上述方法更改,任何无法解析的字节或子序列都会被 Unicode 替换字符“�”替换。因此,不同的名称可能会合并为一个名称。

另一个问题是,当写回 PDF 时,PDFBox 对称地运行,而是解释名称的 String 表示形式(已作为 UTF-8 检索) 8 解释(如果从 PDF 读取)使用纯 US_ASCII,参见。 COSName.writePDF(OutputStream):

public void writePDF(OutputStream output) throws IOException
{
    output.write('/');
    byte[] bytes = getName().getBytes(Charsets.US_ASCII);
    for (byte b : bytes)
    {
        int current = (b + 256) % 256;

        // be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
        if (current >= 'A' && current <= 'Z' ||
                current >= 'a' && current <= 'z' ||
                current >= '0' && current <= '9' ||
                current == '+' ||
                current == '-' ||
                current == '_' ||
                current == '@' ||
                current == '*' ||
                current == '$' ||
                current == ';' ||
                current == '.')
        {
            output.write(current);
        }
        else
        {
            output.write('#');
            output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
        }
    }
}

因此,任何有趣的 Unicode 字符都会被替换为 US_ASCII 默认替换字符,我假设为“?”。

因此,幸运的是 PDF 名称通常只包含 ASCII 字符...;)

历史上

根据 PDF 1.4 引用中的实现说明,

In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.

因此,手头的示例文档似乎遵循 Acrobat 4 的约定,即上个世纪的约定。

源代码摘录自 PDFBox 2.0.0,但乍一看似乎在 2.0.1 或开发主干中没有更改。

关于character-encoding - PDF框2.0 : Overcoming dictionary key encoding,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36964496/

相关文章:

php - 即使尽可能设置 UTF-8,也无法显示德语变音符号

java - 使用 Apache PDFBox 将实心圆添加到 PDF 页面

java - 无法获取填写 PDFBox 的路径

java - 从 java PDFBOX 获取违规行为

java.io.IOException : Stream closed During PDFbox setValue loop 异常

java - 为什么PDFBox在转换为图像文件时会删除矩形线

php - mysql查询问题

sql - 奇怪的编码问题 ¶

PHP DOMDocument nodeValue 转储文字 UTF-8 字符而不是编码