pdf - 如何通过xpdf或mupdf获取指定的文本pos？

我想提取pdf文件中的一些指定文本和文本位置。

我知道 xpdf 和 mupdf 可以解析 pdf 文件，所以我认为它们可以帮助我完成这项任务。

但是如何使用这两个库来获取文本位置呢？

最佳答案

如果您不介意对 MuPDF 使用 Python 绑定(bind)，这里有一个使用 PyMuPDF 的 Python 解决方案(我是它的开发者之一):

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

如果您有兴趣，我们在 GitHub 上。

关于pdf - 如何通过xpdf或mupdf获取指定的文本pos？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7512674/

上一篇：security - 马文 : security problems by repository and pluginRepositories defined in pom?

下一篇：laravel - 如何在 Laravel 中使用 ->append v. protected $appends

相关文章：

java - 在java中提取PDF文件并渲染为HTML

python - 计算一个角色在电影剧本中说的话

rpm - 如何从rpm提取文件到当前目录？

javascript - Readability 使用什么算法从 URL 中提取文本？

ios - 如何禁用 pdf View (PDFKIT) 内部滚动？

javascript - pdf-lib 使用什么颜色格式？

python - PDF Miner PDF加密错误

android - 更改在 Java [Android] 中创建的文本的布局

qt - 如何使链接在QTextEdit中可点击？

java - 从字符串中提取以特定字符开头的单词