我正在使用 Apache PDFBox 2.0.1 从 PDF 表单中提取文本,提取 AcroForm 字段的详细信息。我从单选按钮字段中挖掘出外观字典。我对/N 和/D 条目(正常和“向下”外观)感兴趣。像这样(交互式 Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
输出为
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
问号 Blob 应该是瑞典语字符“ä”或“å”。使用 iText RUPS,我可以看到字典键是用 ISO-8859-1 编码的,而 PDFBox 假设它们是 Unicode,我猜。
有没有办法使用 ISO-8859-1 解码 key ?或者有其他方法可以正确检索 key ?
此 PDF 表单样本可在此处下载:http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf
最佳答案
Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
更改假定的编码
当从源 PDF 中读取名称时,PDFBox 对名称中的字节编码(只有名称可以用作 PDF 中的字典键)的解释发生在 BaseParser.parseCOSName()
中:
/**
* This will parse a PDF name from the stream.
*
* @return The parsed PDF name.
* @throws IOException If there is an error reading from the stream.
*/
protected COSName parseCOSName() throws IOException
{
readExpectedChar('/');
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int c = seqSource.read();
while (c != -1)
{
int ch = c;
if (ch == '#')
{
int ch1 = seqSource.read();
int ch2 = seqSource.read();
if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
{
String hex = "" + (char)ch1 + (char)ch2;
try
{
buffer.write(Integer.parseInt(hex, 16));
}
catch (NumberFormatException e)
{
throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
}
c = seqSource.read();
}
else
{
// check for premature EOF
if (ch2 == -1 || ch1 == -1)
{
LOG.error("Premature EOF in BaseParser#parseCOSName");
c = -1;
break;
}
seqSource.unread(ch2);
c = ch1;
buffer.write(ch);
}
}
else if (isEndOfName(ch))
{
break;
}
else
{
buffer.write(ch);
c = seqSource.read();
}
}
if (c != -1)
{
seqSource.unread(c);
}
String string = new String(buffer.toByteArray(), Charsets.UTF_8);
return COSName.getPDFName(string);
}
如您所见,在读取名称字节并解释 # 转义序列后,PDFBox 无条件地将结果字节解释为 UTF-8 编码。因此,要更改此设置,您必须修补此 PDFBox 类并替换底部指定的字符集。
这里的 PDFBox 正确吗?
根据规范,将名称对象视为文本时
the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.
(第 7.3.5 节名称对象, ISO 32000-1 )
BaseParser.parseCOSName()
就实现了这一点。
尽管如此,PDFBox 的实现并不完全正确,因为无需将名称解释为字符串的行为就是错误的:
name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text
因此,PDF 库应尽可能长时间地将名称作为字节数组进行处理,并且仅在明确需要时才查找字符串表示形式,只有这样,上面的建议(假设 UTF-8)才应发挥作用。该规范甚至指出了这可能会导致问题的地方:
PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.
另一种情况在手头的文档中变得很明显,如果字节序列不构成有效的 UTF-8,它仍然是一个有效的名称。但这些名称会通过上述方法更改,任何无法解析的字节或子序列都会被 Unicode 替换字符“�”替换。因此,不同的名称可能会合并为一个名称。
另一个问题是,当写回 PDF 时,PDFBox 不对称地运行,而是解释名称的 String
表示形式(已作为 UTF-8 检索) 8 解释(如果从 PDF 读取)使用纯 US_ASCII
,参见。 COSName.writePDF(OutputStream)
:
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
int current = (b + 256) % 256;
// be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
if (current >= 'A' && current <= 'Z' ||
current >= 'a' && current <= 'z' ||
current >= '0' && current <= '9' ||
current == '+' ||
current == '-' ||
current == '_' ||
current == '@' ||
current == '*' ||
current == '$' ||
current == ';' ||
current == '.')
{
output.write(current);
}
else
{
output.write('#');
output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
}
}
}
因此,任何有趣的 Unicode 字符都会被替换为 US_ASCII 默认替换字符,我假设为“?”。
因此,幸运的是 PDF 名称通常只包含 ASCII 字符...;)
历史上
根据 PDF 1.4 引用中的实现说明,
In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.
因此,手头的示例文档似乎遵循 Acrobat 4 的约定,即上个世纪的约定。
源代码摘录自 PDFBox 2.0.0,但乍一看似乎在 2.0.1 或开发主干中没有更改。
关于character-encoding - PDF框2.0 : Overcoming dictionary key encoding,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36964496/