python - 如何将带有文本信息的1.3 GB csv文件读取到Python的pandas对象中?

标签 python pandas csv text large-data

我正在尝试使用“pd.read_csv”将一个包含两列和 19,333 行的 1.3 GB 的 csv 文件读取到 Python 的 pandas 数据框中,但它不断生成错误消息,提示“CParserError:错误标记数据。” C 错误:内存不足”,我尝试了网上发布的许多建议,例如使用“chunksize”,但它似乎不起作用,只产生“内核死亡,重新启动”。这是运行“pd.read_csv”时的输出。

import pandas as pd
import numpy as np
import os

os.chdir("/home/swhan/Downloads")

CORPUS = pd.read_csv('10k_2005_2008_file.csv')
Traceback (most recent call last):

  File "<ipython-input-1-8136c4f0354a>", line 7, in <module>
    CORPUS = pd.read_csv('10k_2005_2008_file.csv')

  File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()

  File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)

  File "/home/swhan/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)

  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)

  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)

  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)

  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)

  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)

CParserError: Error tokenizing data. C error: out of memory

事实上,csv 文件由两列组成,一列用于 ID,另一列用于每个 ID 的长文本信息,其子集如下所示或:

id  text
12  python pandas read data of the form ...
13  how to remove file does not exist error ...
41  pandas unable to find files ...
99  issue with python is not a simple problem ...

csv file picture

有没有办法将此文件读入 pandas 的 dataframe 对象?顺便说一句,我的台式机有 32GB RAM。预先感谢您!

尝试使用带有“chunksize”的 Python 代码

df = pd.DataFrame()
reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3)
for chunk in reader:
    df = pd.concat([df, chunk], ignore_index=True)

df
Out[6]: 
           ID                                               text
0      255618  ['ITEM1.BUSINESSIn this annual report onForm10...
1       94740  ['Item 1. Business.GeneralCommunity CapitalCor...
2      145200  ['ITEM 1.BUSINESSGeneralCommunityBank Shares o...
3      145201  ['ITEM 1. BUSINESSGeneralCommunity  Bank Share...
4      145202  ['Item 1. BusinessGeneralCommunity Bank Shares...
5      145203  ['Item1.BusinessGeneralCommunityBank Shares of...
6      221548  ['Item1.BusinessOverviewTravelzoo Inc. (the Co...
7      121633  ['Item1. BusinessGeneralSterling Financial Cor...
8      172796  ['Item 1. BusinessGeneralWe are a Maryland cor...
9      172797  ['Item 1. BusinessGeneralWe are a Maryland cor...
10     121632  ['Item 1.BusinessGeneralCompanyGrowthProfitabi...
11      28995  ['ITEM 1. Business.(Dollars in millions)We res...
12      28994  ['ITEM 1. Business.GeneralAt December31, 2004,...
13      28997  ['Item1.Business.GeneralService Corporation In...
14      28996  ['ITEM 1. Business.GeneralAt December31, 2004,...
15     118636  ['Item1.BusinessWe are a broadcast company pri...
16      28993  ['ITEM 1. Business.GeneralAt December31, 2004,...
17     101760  ['ITEM1.BUSINESSCorporateProfileCognex Corpora...
18     145752  ['Item 1: Election of Directors; Nomineesfor D...
19      94744  ['ITEM1.BUSINESS.GeneralCommunityCapital Corpo...
20      28999  ['Item1.Business.GeneralService Corporation In...
21      28998  ['Item1.Business.GeneralService Corporation In...
22       1868  ['ITEM1.BUSINESSCompany OverviewWe are a world...
23     269745  ['Item1"BusinessThe CompanyThe 2004 Reorganiza...
24     181343  ['ITEM 1.  BUSINESSMKS Instruments, Inc. ("the...
25     220768  ['ITEM1. BUSINESS  General  The Company  Sierr...
26     181345  ['Item1.BusinessMKS Instruments, Inc. (the Com...
27     145750  ['Item1. Business   BurlingtonNorthern Santa F...
28     181346  ['Item1.BusinessMKS Instruments, Inc. (the Com...
29     145751  ['Item 1: Election of Directors; Nominees for ...
      ...                                                ...
19303   26477  ['ITEM1.BUSINESS  Precision Castparts Corp. (P...
19304  256145  ['Item1 Business,Item1A Risk Factors, and Item...
19305  222814  ['Item1. Business.  General  Our company, Rock...
19306   73641  ['ITEM 1. BUSINESSGENERALTexas Regional Bancsh...
19307   66997  ['ITEM 1. BUSINESSOur CompanyWe are a leading ...
19308   66996  ['ITEM 1. BUSINESSOur CompanyWe are a leading ...
19309   66994  ['ITEM1. BUSINESS  Our Company  We are a leadi...
19310   66993  ['ITEM 1. BUSINESS   Our CompanyWe are a leadi...
19311    7929  ['Item1. Business(a)General development of bus...
19312  114251  ['Item1.BusinessGeneralTerra Nitrogen Company,...
19313  114250  ['Item1 BusinessGeneralTerra Nitrogen Company,...
19314  198077  ['Item1. BusinessGeneral DescriptionTeam Finan...
19315  162197  ["ITEM 1. BUSINESSWintrust Financial Corporati...
19316   25524  ['Item 1. BusinessEnvironmental. Contamination...
19317  190015  ['Item 1. Description of Business.GeneralEVCI ...
19318    5634  ['Item 1.BusinessGeneral  CDI Corp. (the Compa...
19319    5635  ['Item 1.BusinessGeneral  CDI Corp. (the Compa...
19320  190932  ['ITEM 1.   BUSINESSORGANIZATION AND GENERAL B...
19321  190933  ['ITEM 1.   BUSINESSORGANIZATION AND GENERAL B...
19322    5632  ['Item 1.BusinessGeneral  CDI Corp., (the Comp...
19323    5633  ['Item 1.BusinessGeneral  CDI Corp. (the Compa...
19324   38349  ['Item 1. BusinessThe CompanyNatures SunshineP...
19325  222816  ['Item1 above.Weoperate on a 52/53 week fiscal...
19326  222815  ['Item1. Business.GeneralOur company, Rockwell...
19327  213793  ['Item1.BusinessTvia,Inc. is a fabless semicon...
19328    8489  ['ITEM1.BusinessCrown Crafts, Inc. (the Compan...
19329  224247  ['Item1.Business   GENERAL   We are asolutions...
19330  198076  ['Item1. BusinessGeneral DescriptionTeam Finan...
19331   34149  ['Item1. BusinessVF Corporation, organized in ...
19332   34148  ['Item1 in PartI, Items 5, 6, 7, 7A, 8 and 9A ...

[19333 rows x 2 columns]

最佳答案

Pandas docs says :

Note It is worth noting however, that concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. If you need to use the operation over several datasets, use a list comprehension.

frames = [ process_your_file(f) for f in files ]
result = pd.concat(frames)

所以尝试这种方法:

reader = pd.read_csv("10k_2005_2008_file.csv", chunksize=10**3)
df = pd.concat([x for x in reader], ignore_index=True)

关于python - 如何将带有文本信息的1.3 GB csv文件读取到Python的pandas对象中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45085876/

相关文章:

python - Query.all() 到 pandas 数据帧或没有 for 循环的列表

python - 从嵌套Python字典导出csv

Java 8 流首先调用 forEach(...)

postgresql - pgadmin 4,导入csv,错误代码1,无错误信息

python - 如何将 Python 列表列表的所有值设置为特定值?

python - 具有内部关系的 Django Rest_framework 序列化器

python - 使用 scikit onehotencoder 的向量上的虚拟变量

python - numpy.empty() 返回的值是否随机?

python - 使用 Python Paramiko 将 .csv 文件从 SFTP 服务器读取到内存

python - 将一列转换为行和列