我有一个这样的txt文件:
`Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
0 1 2 \
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
3 4
46 NaN NaN
47 NaN NaN
48 NaN NaN
49 NaN NaN
50 NaN NaN`
我想将其读入一个 3 列的数组中。现在我尝试使用 pd.read_csv(self.filename,delim_whitespace=True) ,但这在尝试读取空 DataFrame 部分时给我带来了很多错误。如何让程序忽略这部分?
编辑: 最佳解决方案是如果我的文件中没有空数据帧。该文件是在很多文件中查找的结果,其中有一些是空的。我以为我已经通过给出异常来过滤空文件,这样在空文件中搜索的效果就不会存储在结果中。我想我是以错误的方式做的。有人可以纠正我吗?
from numpy import numpy.mean as nm
def find_same_direction_chain(self, results):
separation= lambda x: pd.Series([i for i in x.split('/')])
left_chain=self.data[0].apply(separation)
right_chain=self.data[1].apply(separation)
i=1
try:
while i<len(self.data[:])-5:
if nm(left_chain[2][i:i+3])>=nm(left_chain[2][i+2:i+5]) and nm(right_chain[2][i:i+3])>=nm(right_chain[2][i+2:i+5]) and len(self.data[:])>0:
if nm(left_chain[2][i+2:i+5])>=nm(left_chain[2][i+4:i+7]) and nm(right_chain[2][i+2:i+5])>=nm(right_chain[2][i+4:i+7]):
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
else: pass
i+=1
except ValueError:
results.bin.append(self.filename)
except TypeError:
results.data_structure_error.append(self.filename)
最佳答案
您可以使用:
import pandas as pd
import io
temp=u"""Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
Empty DataFrame
Columns: [0, 1, 2, 3, 4]
Index: []
0 1 2 \
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
3 4
46 NaN NaN
47 NaN NaN
48 NaN NaN
49 NaN NaN
50 NaN NaN"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7))
#remove rows with NaN in columns 0 - 3
df = df.dropna(subset=[0,1,2,3])
#remove rows where first column contains text 'Columns'
df = df[~df.iloc[:,0].str.contains('Columns')]
#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)
#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
1 2 3
0
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
或者使用 read_csv
中的 skiprows
解决方案:
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), delim_whitespace=True, names=range(7), skiprows=6)
#remove rows with NaN
df = df.dropna(subset=[0,1,2,3])
#shift first row
df.iloc[0,:] = df.iloc[0,:].shift(-3)
#set first column to index
df = df.set_index(df.iloc[:,0])
#remove unnecessary columns
df = df.drop([0,4,5,6], axis=1)
print df
1 2 3
0
46 RNA/4v6p.csv,46AA/U/551 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
47 RNA/4v6p.csv,46AA/G/550 RNA/4v6p.csv,46AA/C/34 RNA/4v6p.csv,46WW_cis
48 RNA/4v6p.csv,46AA/A/553 RNA/4v6p.csv,46AA/U/30 RNA/4v6p.csv,46WW_cis
49 RNA/4v6p.csv,46AA/U/552 RNA/4v6p.csv,46AA/A/33 RNA/4v6p.csv,46WW_cis
50 RNA/4v6p.csv,46AA/U/1199 RNA/4v6p.csv,46AA/G/1058 RNA/4v6p.csv,46WW_cis
编辑:
您可以尝试更改(我没有示例数据,因此未经测试):
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
至:
if len(self.data[0:3][i:i+5]) > 0:
results.chains.append(str(self.filename+", "+str(i)+self.data[0:3][i:i+5]))
关于python - 在Python中使用panda读取文件时忽略空DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36130029/