c# - 如何从 PDF 中提取文本并解码字符？

我正在使用 itextsharp 使用以下代码从 pdf 文档中提取文本:

public static bool does_document_text_have_keyword(string keyword, 
                       string pdf_src, Report report_object)  // TEST
{
    try
    {
        PdfReader pdfReader = new PdfReader(pdf_src);
        string currentText;
        int count = pdfReader.NumberOfPages;
        for (int page = 1; page <= count; page++)
        {
           ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
           currentText = PdfTextExtractor.GetTextFromPage
                           (pdfReader, page, strategy);
           currentText = Encoding.UTF8.GetString
                           (ASCIIEncoding.Convert
                             (Encoding.Default,                                 
                              Encoding.UTF8, 
                              Encoding.Default.GetBytes(currentText)));

           report_object.log(currentText);  // TEST

           if (currentText.IndexOf
                (keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
        }
        pdfReader.Close();
        return false;
    }
    catch
    {
        return false;
    }
}

但问题是，当我提取文本时，文本没有空格，就好像空格已被空字符串替换。然而在pdf文档中，其中有空格。有谁知道这里发生了什么吗？

最佳答案

我相信您的问题是 SimpleTextExtractionStrategy。来自 API 文档 http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html

If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.

尝试使用 LocationTextExtractionStrategy。它的文档指出:

A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.

关于c# - 如何从 PDF 中提取文本并解码字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13976233/

c# - 如何从 PDF 中提取文本并解码字符？

上一篇：c# - 访问串行端口时信号量超时？

下一篇：c# - 如何使用正则表达式和 C# 忽略字符串中的额外空格？