python - 如何在 Refextract 上一起运行多个文件

标签 python python-3.x reference pdftotext

我是Python新手，我需要从科学文献中提取引用文献。以下是我正在使用的代码

from refextract import extract_references_from_file

import pandas as pd

references = extract_references_from_file('1503.07589.pdf')

dfref = pd.DataFrame(references)

dfref.to_excel('./refs.xlsx')

使用此命令一次只能从一个文件中提取引用，但我需要从多个文件中一起提取引用。所以，请指导我是否可能以及如何做。非常感谢!

最佳答案

docs声称提取的引用以 dict 形式返回。

Returns a dictionary with extracted references and stats.

这不太准确；返回 dict 的 list，每个引用文献一本字典。

所以你只需要建立一个更长的列表。

from refextract import extract_references_from_file

higgs_papers = ['1503.07589', '2008.05492']
references = []
for paper in higgs_papers:
    references.extend(extract_references_from_file(f'/tmp/{paper}.pdf'))

现在您有了一个更大的列表，references，您可以将其变成更大的df。

您可能还会发现glob方便:

import glob

files = glob.glob('/tmp/*.pdf')

关于python - 如何在 Refextract 上一起运行多个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63428294/

上一篇：scheduler - 最短剩余时间优先 (SRTF) 如何运作？

下一篇：python - Ansible 动态 list - "([Errno 2] No such file or directory:"

相关文章：

python - 如何比较元组列表？

python - Matplotlib 表格行标签字体颜色和大小

python-3.x - 如何在JupyterLab中解决 'nbconvert failed: Inkscape svg to pdf conversion failed`

python-3.x - TypeError : Invalid shape (100, 100, 1) for image data 当plot image

c++ - std::move 返回和输入引用参数

c# - 如何将源文件添加到外部库？

java - 使用线程更新 ArrayList 对象实例的好方法是什么？

python - pandas 在两列匹配嵌套列表值的地方放置

python - 将来自 Twisted `enterprise.adbapi` 的查询添加到由 `twistd` 守护程序创建的 react 器循环

python - 检查文件中是否存在值