java - 使用 PDFBox 读取 PDF 文件的前 N 个字符

我编写了以下函数，使用 PDFBox 工具打印出 PDF 中的文本:

private String readFirstNChars(int N) { // N has not been used
    PDFTextStripper pdfTextStripper = null;
    PDDocument pdDocument = null;
    COSDocument cosDocument = null;
    File currentFile = this.pdfFile;

    try {
        PDFParser parser = new PDFParser(new RandomAccessBufferedFileInputStream(currentFile));
        parser.parse();
        cosDocument = parser.getDocument();
        pdfTextStripper = new PDFTextStripper();
        pdDocument = new PDDocument(cosDocument);
        pdfTextStripper.setStartPage(1);
        pdfTextStripper.setEndPage(1);
        String parsedText = pdfTextStripper.getText(pdDocument);
        return parsedText;
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

我正在考虑打印 parsedText 的前 N 个字符，但后来我想知道我可以读取的文件是否非常大，这种方法没有任何意义，即将整个文本加载到内存中，然后获取前 N 个字符。有没有办法让我只能从 PDF 中读取 N 个字符？

最佳答案

您可能需要 PDFParser 的源代码，以便您可以编写适当的方法或编写您自己的方法。 PDF 不仅仅是可读文本，因此本质上您需要解析文档，丢弃不可读的文本，然后对您找到的实际文本进行计数。

关于java - 使用 PDFBox 读取 PDF 文件的前 N 个字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31346445/