python - 属性错误 : 'generator' object has no attribute 'to_sql' While creating datframe using generator

标签 python python-3.x pandas dataframe generator

我正在尝试从固定宽度文件创建一个 datafrmae 并加载到 postgresql 数据库中。我的输入文件非常大(~16GB)和 2000 万条记录。因此,如果我创建数据框,它会消耗大部分可用 RAM。需要很长时间才能完成。所以我想到了使用 chunksize(使用 python 生成器)选项并将记录提交到表中。但它因 'AttributeError: 'generator' object has no attribute 'to_sql' 错误而失败。

受到这里答案的启发 https://stackoverflow.com/a/47257676/2799214

输入文件:test_file.txt

XOXOXOXOXOXO9
AOAOAOAOAOAO8
BOBOBOBOBOBO7
COCOCOCOCOCO6
DODODODODODO5
EOEOEOEOEOEO4
FOFOFOFOFOFO3
GOGOGOGOGOGO2
HOHOHOHOHOHO1

示例.py

import pandas.io.sql as psql
import pandas as pd
from sqlalchemy import create_engine

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
    for chunk in pd.read_fwf(filename, colspecs=[[0,12],[12,13]],index_col=False,header=None, iterator=True, chunksize=chunk_size):
        yield (chunk)

def _generator( engine, filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    chunk.to_sql('sample_table', engine, if_exists='replace', schema='sample_schema', index=False)
    yield row

if __name__ == "__main__":
    filename = r'test_file.txt'
    engine = create_engine('postgresql://ABCD:ABCD@ip:port/database')
    c = engine.connect()
    conn = c.connection
    generator = _generator(engine=engine, filename=filename)
    while True:
       print(next(generator))
    conn.close()

错误:

    chunk.to_sql('sample_table', engine, if_exists='replace', schema='sample_schema', index=False)
AttributeError: 'generator' object has no attribute 'to_sql'

我的主要目标是提高性能。请帮助我解决问题或建议更好的方法。提前致谢。

最佳答案

“chunck_generator”将返回一个“generator”对象,而不是 block 的实际元素。您需要迭代对象以从中取出 block 。

>>> def my_generator(x):
...     for y in range(x):
...         yield y
...
>>> g = my_generator(10)
>>> print g.__class__
<type 'generator'>
>>> ele = next(g, None)
>>> print ele
0
>>> ele = next(g, None)
>>> print ele
1

所以要修复你的代码,你只需要遍历生成器

for chunk in chunck_generator(filename, header=False,chunk_size = 10 ** 5):
    yield chunk.to_sql()

但看起来很复杂。我会这样做:

import pandas.io.sql as psql
import pandas as pd
from sqlalchemy import create_engine

def sql_generator(engine, filename, header=False,chunk_size = 10 ** 5):
    frame = pd.read_fwf(
        filename, 
        colspecs=[[0,12],[12,13]],
        index_col=False,
        header=None, 
        iterator=True, 
        chunksize=chunk_size
    ):
   
    for chunk in frame:
        yield chunk.to_sql(
            'sample_table', 
            engine, 
            if_exists='replace', 
            schema='sample_schema', 
            index=False
        )


if __name__ == "__main__":
    filename = r'test_file.txt'
    engine = create_engine('postgresql://USEE:PWD@IP:PORT/DB')
    for sql in sql_generator(engine, filename):
        print sql

关于python - 属性错误 : 'generator' object has no attribute 'to_sql' While creating datframe using generator,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50119668/

相关文章:

python - 在我的情况下如何在Docker中启动容器

python - 将分类列添加到预测模型中

python - 为什么 super().__init__ 没有自引用?

python - 如何在 Python 中找到两个矩阵之间的差异,结果不应该有任何带负号的值

Python - RGB LED 颜色褪色

python - 如何为 suds Web 服务对象设置 "text"值

python-3.x - 如何在Django中创建一个文件夹来存储用户输入

python - Python 中的静态变量?

python - 如何处理奇怪的 Pandas 错误 "unable to open hashtable..."

Python Pandas 复制列