python - 为什么我不能在 pandas 中将列分成两列？

我有一个像这样的数据框['anno']:

df.anno

0         type I secretion outer membrane protein, TolC...
1         conserved hypothetical protein [Shigella boyd...
2              Transposase [Congregibacter litoralis KT71]
3         Chain A, The Crystal Structure Of Chlorite Di...
4         chlorite dismutase, partial [uncultured bacte...
5         carbamoyl-phosphate synthase, small subunit [...
6         anthranilate synthase component 1 [endosymbio...
7         chlorite dismutase, partial [bacterium enrich...
8         peptidase dimerization domain protein [Myroid...
9         MULTISPECIES: MFS transporter [Enterobacteria...
10        CAAX amino terminal protease family protein [...
11        Fe-S oxidoreductase [Desulfovibrio africanus ...
12        phosphoenolpyruvate synthase/pyruvate phospha...

因为每行有两部分:1:蛋白质名称。 2.带有“[......]”的微生物种类。

我想提取蛋白质名称部分并丢弃微生物种类，因此我决定首先在“[”位置将列分成两列。

df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])

它返回一个错误:

TypeError: object of type 'NoneType' has no len()

我也尝试过:

df[['protein','species']] =  df['anno'].str.split('[', expand=True)

它还返回了一个错误:

ValueError: Columns must be same length as key

如何做到这一点？还有其他方法提取蛋白质名称吗？谢谢!

最佳答案

我认为多个 [ 存在问题，因此将 n=1 添加到 split用于按第一个 [ 分割。要删除最后一个 ] 使用 rstrip :

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)

要按最后一个 [ 进行剥离，请使用 rsplit :

df[['protein','species']] =  df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1)

另一个解决方案 extract用于最后 [] 的提取:

df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)

示例:

df[['protein','species']] =  df['anno'].str.rstrip(']').str.split('[', expand=True, n=1) 
df['species'] = df['species'].str.replace('\]\[',',')
df['protein'] = df['protein'].str.strip()
print (df)
                 anno      protein species
0     protein [q][sd]      protein    q,sd
1             protein      protein    None
2  Transposase [KT71]  Transposase    KT71
3                None         None    None

关于python - 为什么我不能在 pandas 中将列分成两列？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46267181/

python - 为什么我不能在 pandas 中将列分成两列？

上一篇：python - scipy interpolate 给出无界值

下一篇：python - 从网页中提取 URL 并保存到磁盘