python - 使用 df.to_sql() 将 block 写入数据库时出现 Pandas 错误

现有数据库和期望结果:

我有一个更大的 SQLite 数据库(12GB，包含 4400 万行以上的表)，我想使用 Python3 中的 Pandas 对其进行修改。

Example Objective: I hope to read one of these large tables (44 million rows) into a DF in chunks, manipulate the DF chunk, and write the result to a new table. If possible, I would like to replace the new table if it exists, and append each chunk to it.

Because my manipulations only add or modify columns, the new table should have the same number of rows as the original table.

问题:

主要问题似乎源于以下代码中的以下行:

df.to_sql(new_table, con=db, if_exists = "append", index=False)

当此行在下面的代码中运行时，我似乎始终获得了 size=N 的额外 block ，加上一个超出我预期的观察结果。
第一次使用新表名运行此代码时，出现错误:

 Traceback (most recent call last):
  File "example.py", line 23, in <module>
    for df in df_generator:
  File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
    data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database

如果我随后使用相同的新表名称重新运行脚本，它将针对每个 block 以及一个额外的 block 运行 +1 行。

当df.to_sql()行被注释掉时，循环将运行预期数量的 block 。

使用完整代码测试问题示例:

完整代码:example.py

import pandas as pd
import sqlite3

#Helper Functions Used in Example
def ren(invar, outvar, df):
    df.rename(columns={invar:outvar}, inplace=True)
    return(df)

def count_result(c, table):
    ([print("[*] total: {:,} rows in {} table"
        .format(r[0], table)) 
        for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])


#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()
new_table = "new_table"

#Load Data in Chunks
df_generator = pd.read_sql_query("select * from test_table limit 10000;", con=db, chunksize = 5000)

for df in df_generator:
    #Functions to modify data, example
    df = ren("name", "renamed_name", df)
    print(df.shape)
    df.to_sql(new_table, con=db, if_exists = "append", index=False)


#Count if new table is created
try:
    count_result(c, new_table)
except:
    pass

1. Result when #df.to_sql(new_table, con=db, if_exists = "append", index=False)

(the problem line is commented out):

$ python3 example.py 
(5000, 22)
(5000, 22)

这是我所期望的，因为示例代码将我的大表限制为 10k 行。

2. Result when df.to_sql(new_table, con=db, if_exists = "append", index=False)

a. the problem line is not commented out

b. this is the first time the code is run with a new_table:

$ python3 example.py 
(5000, 22)
Traceback (most recent call last):
  File "example.py", line 23, in <module>
    for df in df_generator:
  File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
    data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database

3. Result when df.to_sql(new_table, con=db, if_exists = "append", index=False)

a. the problem line is not commented out

b. the above code is run a second time with the new_table:

$ python3 example.py 
(5000, 22)
(5000, 22)
(5000, 22)
(1, 22)
[*] total: 20,001 rows in new_table table

因此，我遇到的问题是，第一次运行时代码被破坏(结果 2)，第二次运行时的总行数(结果 3)是我预期的两倍多。

任何有关如何解决此问题的建议将不胜感激。

最佳答案

您可以尝试指定:

db = sqlite3.connect("test.db", isolation_level=None)
#  ---->                        ^^^^^^^^^^^^^^^^^^^^

除此之外，您可以尝试增加 block 大小，因为否则提交之间的时间对于 SQLite DB 来说太短了 - 我猜这会导致此错误...我还建议使用 PostgreSQL、MySQL/MariaDB 或类似的东西 - 它们更加可靠并且适合这样的数据库大小......

关于python - 使用 df.to_sql() 将 block 写入数据库时出现 Pandas 错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49039734/

python - 使用 df.to_sql() 将 block 写入数据库时出现 Pandas 错误

上一篇：python - 在 for 循环中返回一次迭代？

下一篇：python - 无法获取与 Neo4j 数据库的连接

python - 使用 df.to_sql() 将 block 写入数据库时​​出现 Pandas 错误

上一篇：python - 在 for 循环中返回一次迭代？

下一篇：python - 无法获取与 Neo4j 数据库的连接

python - 使用 df.to_sql() 将 block 写入数据库时出现 Pandas 错误