java - 使用 Java 将 PDF 转换为 XML

我正在尝试创建一个原型(prototype)来将 PDF 文件转换为 XML 文件。结果有点奇怪，所有的字符都变成了符号。我认为错误在于 StringBuffer从字节数组中获取数据。有Java知识的人可以帮忙吗？

该原型(prototype)软件使用 iText API。要阅读 PDF 文件，我们使用 PDFReader类(class)。数据首先转换为字节数组，然后使用Stringbuffer ，它会再次转换为字符串。然后我们使用了StreamResult其充当 XML 中转换结果的持有者。

之后，Transformer类处理来自各种源的 XML 并将转换输出写入各种接收器。然后TransformerHandler监听 SAX ContentHandler ，解析事件并将其转换为结果。

方法startElement()和endElement()的TransformerHandler类已在 xml 文件中创建了标签。解析器调用startElement()每个元素开头的方法和 endElement() XML 文档中每个元素的末尾。

import com.lowagie.text.*;
import com.lowagie.text.pdf.*;
import java.io.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class Cp2x {

        static StreamResult streamResult;
        static TransformerHandler handler;
        static AttributesImpl atts;

        public static void main(String[] args) throws IOException {

                try {
                        Document document = new Document();
                        document.open();
                        PdfReader reader = new PdfReader("C:\\helloworld.pdf");
                        PdfDictionary page = reader.getPageN(1);
                        PRIndirectReference objectReference = (PRIndirectReference) page
                                        .get(PdfName.CONTENTS);
                        PRStream stream = (PRStream) PdfReader
                                        .getPdfObject(objectReference);
                        byte[] streamBytes = PdfReader.getStreamBytes(stream);
                        PRTokeniser tokeniser = new PRTokeniser(streamBytes);

                        StringBuffer string_buffer = new StringBuffer();
                        while (tokeniser.nextToken()) {
                                if (tokeniser.getTokenType() == PRTokeniser.TK_STRING) {
                                        string_buffer.append(tokeniser.getStringValue());
                                }
                        }
                        String test = string_buffer.toString();
                        streamResult = new StreamResult("test.xml");
                        initXML();
                        process(test);
                        closeXML();
                        document.add(new Paragraph(".."));
                        document.close();
                } catch (Exception e) {
                }
        }

        public static void initXML() throws ParserConfigurationException,
                        TransformerConfigurationException, SAXException {
                SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory
                                .newInstance();

                handler = tf.newTransformerHandler();
                Transformer serializer = handler.getTransformer();
                serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
                serializer.setOutputProperty(
                                "{http://xml.apache.org/xslt}indent-amount", "4");
                serializer.setOutputProperty(OutputKeys.INDENT, "yes");
                handler.setResult(streamResult);
                handler.startDocument();
                atts = new AttributesImpl();
                handler.startElement("", "", "Document", atts);
        }

        public static void process(String s) throws SAXException {
                String[] elements = s.split("\\|");
                atts.clear();
                handler.startElement("", "", "Note", atts);
                handler.characters(elements[0].toCharArray(), 0, elements[0].length());
                handler.endElement("", "", "Note");
        }

        public static void closeXML() throws SAXException {
                handler.endElement("", "", "Document");
                handler.endDocument();
        }
}

最佳答案

正如@sudmong所说，存在编码问题:PRTokeniser不应该用于从内部页面内容流读取字符串，它只能在页面内容流外部正常工作，因为它采用特殊的字符编码，而页面内容流内的字符串编码完全取决于内容描述的该步骤中当前字体的编码。比照。 ISO 32000-1第 7.3.4.2 节文字字符串适用于内容流外部的字符串，第 9.6.6 字符编码适用于内容流内部的字符串。

正如 @BrunoLowagie 指出的，您还完全忽略了页面内容不仅位于直接页面内容流内，而且位于从那里引用的 XObjects 中，参见。 ISO 32000-1第 8.10 节形成 XObjects。他还指出内容流中的字符串不需要按照阅读顺序，参见。 ibidem第 9.4 节文本对象。

您还忽略了页面字典的 Contents 条目的值可以是流或流数组:

The value shall be either a single stream or an array of streams. If the value is an array, the effect shall be as if all of the streams in the array were concatenated, in order, to form a single stream. Conforming writers can create image objects and other resources as they occur, even though they interrupt the content stream. The division between streams may occur only at the boundaries between lexical tokens (see 7.2, "Lexical Conventions") but shall be unrelated to the page’s logical content or organization. Applications that consume or produce PDF files need not preserve the existing structure of the Contents array. Conforming writers shall not create a Contents array containing no elements.

ISO 32000-1 中的

第 7.7.3.3 节页面对象

如果你真的很想自己编写一个解析器，你最好学习ISO 32000-1首先。

另外看看 iText 的...text.pdf.parser包，它已经是一个非常好的解析 PDF 内容的工具了。如果您愿意，您可以帮助改进它。

关于java - 使用 Java 将 PDF 转换为 XML，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14831528/

java - 使用 Java 将 PDF 转换为 XML

上一篇：java - 狮身人面像4 : getting null pointer exception at connection manager

下一篇：java - com.itextpdf.text.exceptions.InvalidPdfException : PDF header signature not found