我正在尝试通过读取由“#####”5 个哈希值分隔的 csv 文件来创建 DataFrame
代码是:
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()
错误是:
dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns. These first 1,000 rows led us to an incorrect
guess.
For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.
You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.
df = dd.read_csv(..., dtype={'my-column': float})
Pandas has given us the following error when trying to parse the file:
"The 'dtype' option is not supported with the 'python' engine"
Traceback
---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)
那么如何摆脱它。
如果我遵循错误,那么我将不得不为每一列提供数据类型,但如果我有 100 多列,那将毫无用处。
如果我在没有分隔符的情况下阅读,那么一切正常,但到处都是#####。那么在将它计算到 pandas DataFrame
之后,有没有办法摆脱它?
所以请帮帮我。
最佳答案
以 dtype=object
的形式读取整个文件,这意味着所有列都将被解释为 object
类型。这应该正确读入,去掉每一行中的 #####
。从那里你可以使用 compute()
方法将它变成一个 pandas 框架。一旦数据位于 pandas 框架中,您就可以使用 pandas infer_objects
方法更新类型,而无需对其进行硬编码。
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()
关于python - 在 python dask 中读取带分隔符的 csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34266263/