python - 使用 Tabula 从 PDF 中以字符串形式读取表格

标签 python tabula

我在 python 3.7 上使用 tabula-py 2.0.4、pandas 1.17.4。我正在尝试使用 tabula.read_pdf 将 PDF 表读取到数据框

from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])

问题是值被读取为 float 而不是字符串。

我需要将其读取为字符串,因此如果值为 20.0000,我知道精确到小数点后第四位。现在它返回 20.0 而不是 20.0000。

PDF格式的输入数据看起来像 enter image description here

上面代码的输出是

enter image description here

最佳答案

您需要向 tabula.read_pdf 添加几个选项。解析 pdf 文件并以不同方式解释找到的列的示例:

import tabula

print(tabula.environment_info())

fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
         "data.pdf")

# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2str,
          'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)

print(df1[0].dtypes)
print(df1[0].head())

# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
          'pandas_options': col2val,
          'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)

print(df2[0].dtypes)
print(df2[0].head())

输出如下:

Python version:
    3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
    openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
    uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan  9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')

None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0    object
mpg           object
cyl           object
disp          object
hp            object
drat          object
wt            object
qsec          object
vs            object
am            object
gear          object
carb          object
dtype: object
          Unnamed: 0   mpg cyl   disp   hp  drat     wt   qsec vs am gear carb
0          Mazda RX4  21.0   6  160.0  110  3.90  2.620  16.46  0  1    4    4
1      Mazda RX4 Wag  21.0   6  160.0  110  3.90  2.875  17.02  0  1    4    4
2         Datsun 710  22.8   4  108.0   93  3.85  2.320  18.61  1  1    4    1
3     Hornet 4 Drive  21.4   6  258.0  110  3.08  3.215  19.44  1  0    3    1
4  Hornet Sportabout  18.7   8  360.0  175  3.15  3.440  17.02  0  0    3    2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0     object
mpg           float64
cyl             int64
disp          float64
hp              int64
drat          float64
wt            float64
qsec          float64
vs              int64
am              int64
gear            int64
carb            int64
dtype: object
          Unnamed: 0   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
0          Mazda RX4  21.0    6  160.0  110  3.90  2.620  16.46   0   1     4     4
1      Mazda RX4 Wag  21.0    6  160.0  110  3.90  2.875  17.02   0   1     4     4
2         Datsun 710  22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
3     Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1
4  Hornet Sportabout  18.7    8  360.0  175  3.15  3.440  17.02   0   0     3     2


关于python - 使用 Tabula 从 PDF 中以字符串形式读取表格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60448160/

相关文章:

python - 如何在Python中使用tabula提取PDF文件中存在的多个表格?

python - 禁止或删除python表格警告

python - 将 PDF 转换为 XLS

python - Vertex AI - 查看管道输出

python - 除了面部编码数组之外的其他数据存储方式

python - tabula-py 的奇怪行为

python - 表格 python : Getting subprocess. CalledProcessError : Command '[' java', '-Dfile.encoding=UTF8',错误

python - 使用表格功能过滤 Pandas 数据框

python - 在 Windows 7x64 上从 Python3.x 调用 Matlab2013

python - 从数据框中获取字典的有效方法