我正在使用 itextsharp 使用以下代码从 pdf 文档中提取文本:
public static bool does_document_text_have_keyword(string keyword,
string pdf_src, Report report_object) // TEST
{
try
{
PdfReader pdfReader = new PdfReader(pdf_src);
string currentText;
int count = pdfReader.NumberOfPages;
for (int page = 1; page <= count; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
currentText = PdfTextExtractor.GetTextFromPage
(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString
(ASCIIEncoding.Convert
(Encoding.Default,
Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
report_object.log(currentText); // TEST
if (currentText.IndexOf
(keyword, StringComparison.OrdinalIgnoreCase) != -1) return true;
}
pdfReader.Close();
return false;
}
catch
{
return false;
}
}
但问题是,当我提取文本时,文本没有空格,就好像空格已被空字符串替换。然而在pdf文档中,其中有空格。有谁知道这里发生了什么吗?
最佳答案
我相信您的问题是 SimpleTextExtractionStrategy。来自 API 文档 http://api.itextpdf.com/itext/com/itextpdf/text/pdf/parser/SimpleTextExtractionStrategy.html
If the PDF renders text in a non-top-to-bottom fashion, this will result in the text not being a true representation of how it appears in the PDF. This renderer also uses a simple strategy based on the font metrics to determine if a blank space should be inserted into the output.
尝试使用 LocationTextExtractionStrategy。它的文档指出:
A text extraction renderer that keeps track of relative position of text on page The resultant text will be relatively consistent with the physical layout that most PDF files have on screen.
关于c# - 如何从 PDF 中提取文本并解码字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13976233/