我有一个像这样的数据框['anno']:
df.anno
0 type I secretion outer membrane protein, TolC...
1 conserved hypothetical protein [Shigella boyd...
2 Transposase [Congregibacter litoralis KT71]
3 Chain A, The Crystal Structure Of Chlorite Di...
4 chlorite dismutase, partial [uncultured bacte...
5 carbamoyl-phosphate synthase, small subunit [...
6 anthranilate synthase component 1 [endosymbio...
7 chlorite dismutase, partial [bacterium enrich...
8 peptidase dimerization domain protein [Myroid...
9 MULTISPECIES: MFS transporter [Enterobacteria...
10 CAAX amino terminal protease family protein [...
11 Fe-S oxidoreductase [Desulfovibrio africanus ...
12 phosphoenolpyruvate synthase/pyruvate phospha...
因为每行有两部分:1:蛋白质名称。 2.带有“[......]”的微生物种类。
我想提取蛋白质名称部分并丢弃微生物种类,因此我决定首先在“[”位置将列分成两列。
df2 = pd.DataFrame(df.anno.str.split("[", 1).tolist(), columns = ['protein','species'])
它返回一个错误:
TypeError: object of type 'NoneType' has no len()
我也尝试过:
df[['protein','species']] = df['anno'].str.split('[', expand=True)
它还返回了一个错误:
ValueError: Columns must be same length as key
如何做到这一点?还有其他方法提取蛋白质名称吗? 谢谢!
最佳答案
我认为多个 [
存在问题,因此将 n=1
添加到 split
用于按第一个 [
分割。要删除最后一个 ]
使用 rstrip
:
df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)
要按最后一个 [
进行剥离,请使用 rsplit
:
df[['protein','species']] = df['anno'].str.rstrip(']').str.rsplit('[', expand=True, n=1)
另一个解决方案 extract
用于最后 []
的提取:
df[['protein','species']] = df['anno'].str.extract('(.*)\[(.*)\]', expand=True)
示例:
df[['protein','species']] = df['anno'].str.rstrip(']').str.split('[', expand=True, n=1)
df['species'] = df['species'].str.replace('\]\[',',')
df['protein'] = df['protein'].str.strip()
print (df)
anno protein species
0 protein [q][sd] protein q,sd
1 protein protein None
2 Transposase [KT71] Transposase KT71
3 None None None
关于python - 为什么我不能在 pandas 中将列分成两列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46267181/