php - 阅读PDF,TJ算子奇怪的编码

标签 php pdf

我目前正在尝试从 PDF 文档中提取文本,但我遇到了一些使用 Tj 运算符的奇怪情况。通常我处理这样的情况:

   Tc (SOME_TEXT) TJ

现在我遇到这样的情况:
   Tm  [
        ( )1.828
        (5)1.841
        (2)1.828
        (2)1.828
        (4)1.841
        (9)1.828
        (.)1.828
        (6)1.841
        (4)
       ]
   TJ 

转换为字符串'52249.64'。现在我又遇到了另一个奇怪的案例:

我能找到的唯一信息是:传递给 Tj 的字符串始终根据字体的编码或 CMap 进行解释。 (在这种情况下,我希望它是带有 CMap 的 CIDFont)
Td  (
        \t\004\007\020\007\016\016\026\020
    )
Tj 

我还是不明白。这些是指示某种字符数组中偏移量的某种索引还是我必须解码这些值?谢谢!

最佳答案

正如@Paulo 在他的评论中已经指出的那样,您应该首先查阅 PDF 规范,即目前 ISO 32000-1,Adobe 提供了免费副本 here .

关于文本提取的主题,您会在第 9.10 节文本内容提取中找到,尤其是:

9.10.2 Mapping Character Codes to Unicode Values

A conforming reader can use these methods, in the priority given, to map a character code to a Unicode value. Tagged PDF documents, in particular, shall provide at least one of these methods (see 14.8.2.4.2, "Unicode Mapping in Tagged PDF"):

  • If the font dictionary contains a ToUnicode CMap (see 9.10.3, "ToUnicode CMaps"), use that CMap to convert the character code to Unicode.

  • If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Annex D):

    a) Map the character code to a character name according to Table D.1 and the font’s Differences array.

    b) Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.

  • If the font is a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection:

    a) Map the character code to a character identifier (CID) according to the font’s CMap.

    b) Obtain the registry and ordering of the character collection used by the font’s CMap (for example, Adobe and Japan1) from its CIDSystemInfo dictionary.

    c) Construct a second CMap name by concatenating the registry and ordering obtained in step (b) in the format registryordering–UCS2 (for example, Adobe–Japan1–UCS2).

    d) Obtain the CMap with the name constructed in step (c) (available from the ASN Web site; see the Bibliography).

    e) Map the CID obtained in step (a) according to the CMap obtained in step (d), producing a Unicode value.

NOTE Type 0 fonts whose descendant CIDFonts use the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection (as specified in the CIDSystemInfo dictionary) shall have a supplement number corresponding to the version of PDF supported by the conforming reader. See Table 3 for a list of the character collections corresponding to a given PDF version. (Other supplements of these character collections can be used, but if the supplement is higher-numbered than the one corresponding to the supported PDF version, only the CIDs in the latter supplement are considered to be standard CIDs.)

If these methods fail to produce a Unicode value, there is no way to determine what the character code represents in which case a conforming reader may choose a character code of their choosing.



如果您不知道此处的某些术语,请在 ISO 32000-1 中阅读有关它们的信息。或那里引用的其他规范。

因此,为了获得可接受的文本提取结果,请使您的文本提取器支持该部分中介绍的方法。

关于php - 阅读PDF,TJ算子奇怪的编码,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33412329/

相关文章:

php - 验证重定向后保持模式打开

php - WordPress的标题和类别的结果

php - 更新 Symfony 2.1(颠覆)

php - 如何在laravel中获得第三关系

android - pdf 中的深层链接 href

PHP - 从 PDF 读取用户选择的文本的解决方法?

javascript - PHP - 如果 div 类出现在页面上 - 隐藏单独的元素

android - Qoppa PDF Android 空指针

css - Flying Saucer PDF 不显示颜色

javascript - 将已填写表单字段的 pdf 转换为图像