我正在使用 iText7 读取 pdf 文件中的文本。这对于第一页效果很好。之后,页面的内容不知何故变得困惑。因此,在文档的第 3 页,我有包含第 1 页和第 3 页内容的行。第 2 页的文本显示与第 1 页完全相同的行(但在“现实”中它们完全不同)。
- 第 1 页,实际:约 36 行,结果 36 行 -> 太棒了
- 第 2 页,实际:>50 行,结果 36 行(==第 1 页)
- 第 3 页,实际:~16 行,结果 47 行(与第 1 页的行添加和混合)
https://www.dropbox.com/s/63gy5cg1othy6ci/Dividenden_Microsoft.pdf?dl=0
为了阅读该文档,我使用以下代码:
using System;
using System.Collections.Generic;
using System.Linq;
namespace StockMarket
{
class PdfReader
{
/// <summary>
/// Reads PDF file by a given path.
/// </summary>
/// <param name="path">The path to the file</param>
/// <param name="pageCount">The number of pages to read (0=all, 1 by default) </param>
/// <returns></returns>
public static DocumentTree PdfToText(string path, int pageCount=1 )
{
var pages = new DocumentTree();
using (iText.Kernel.Pdf.PdfReader reader = new iText.Kernel.Pdf.PdfReader(path))
{
using (iText.Kernel.Pdf.PdfDocument pdfDocument = new iText.Kernel.Pdf.PdfDocument(reader))
{
var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
// set up pages to read
int pagesToRead = 1;
if (pageCount > 0)
{
pagesToRead = pageCount;
}
if (pagesToRead > pdfDocument.GetNumberOfPages() || pageCount==0)
{
pagesToRead = pdfDocument.GetNumberOfPages();
}
// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
// get the page and save it
var page = pdfDocument.GetPage(i);
var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
pages.Add(txt);
}
pdfDocument.Close();
reader.Close();
}
}
return pages;
}
}
/// <summary>
/// A class representing parts of a PDF document.
/// </summary>
class DocumentTree
{
/// <summary>
/// Constructor
/// </summary>
public DocumentTree()
{
Pages = new List<DocumentPage>();
}
private List<DocumentPage> _pages;
/// <summary>
/// The pages of the document
/// </summary>
public List<DocumentPage> Pages
{
get { return _pages; }
set { _pages = value; }
}
/// <summary>
/// Adds a <see cref="DocumentPage"/> to the document.
/// </summary>
/// <param name="page">The text of the <see cref="DocumentPage"/>.</param>
public void Add(string page)
{
Pages.Add(new DocumentPage(page));
}
}
/// <summary>
/// A class representing a single page of a document
/// </summary>
class DocumentPage
{
/// <summary>
/// Constructor
/// </summary>
/// <param name="pageContent">The pages content as text</param>
public DocumentPage(string pageContent)
{
// set the content to the input
CompletePage = pageContent;
// split the content by lines
var splitter = new string[] { "\n" };
foreach (var line in CompletePage.Split(splitter, StringSplitOptions.None))
{
// add lines to the page if the content is not empty
if (!string.IsNullOrWhiteSpace(line))
{
_lines.Add(new Line(line));
}
}
}
private List<Line> _lines = new List<Line>();
/// <summary>
/// The lines of text of the <see cref="DocumentPage"/>
/// </summary>
public List<Line> Lines
{
get
{
return _lines;
}
}
/// <summary>
/// The text of the complete <see cref="DocumentPage"/>.
/// </summary>
private string CompletePage;
}
/// <summary>
/// A class representing a single line of text
/// </summary>
class Line
{
/// <summary>
/// Constructor
/// </summary>
public Line(string lineContent)
{
CompleteLine = lineContent;
}
/// <summary>
/// The words of the <see cref="Line"/>.
/// </summary>
public List<string> Words
{
get
{
return CompleteLine.Split(" ".ToArray()).Where((word)=> { return !string.IsNullOrWhiteSpace(word); }).ToList();
}
}
/// <summary>
/// The complete text of the <see cref="Line"/>.
/// </summary>
private string CompleteLine;
public override string ToString()
{
return CompleteLine;
}
}
}
页面树是一个简单的页面树,由行(读取页面以“\n”分割)和由单词组成的行(行以“”分割)组成,但txt循环已经包含困惑的内容(所以我的树不会引起问题)。
感谢您的帮助。
最佳答案
一些解析事件监听器,特别是大多数文本提取策略,并不意味着可以在多个页面上重复使用。相反,您应该为每个页面创建一个新实例。
根据经验,每个此类监听器在解析页面时收集信息,然后允许您访问该数据(例如文本提取策略允许您访问收集的页面文本),很可能必须单独实例化如果您不希望累积所有页面的数据,请在每个页面上添加。
因此,在您的代码中移动策略实例化
var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
进入for
循环:
// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.LocationTextExtractionStrategy();
// get the page and save it
var page = pdfDocument.GetPage(i);
var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
pages.Add(txt);
}
或者您可以将循环缩短为
// for each page to read...
for (int i = 1; i <= pagesToRead; ++i)
{
// get the page and save it
var page = pdfDocument.GetPage(i);
var txt = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page);
pages.Add(txt);
}
此 PdfTextExtractor.GetTextFromPage
重载每次都会在内部创建一个新的 LocationTextExtractionStrategy
实例。
关于c# - 为什么我在阅读 PDF 时页面内容会变得困惑?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59615341/