我有一个很多句子的数据,以一个例子作为下面的句子,我想把它分成2个子句子:
Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B) |T:**1SP3E3| ; |I:**1SP3E3| |L:**1SP3E3| in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position. |T:**1SN3E3| |I:**1SN3E3| |L:**1SN3E3|
拆分为:
Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B)
和
in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position.
我的代码是:
newData =[]
for item in Data:
test2= re.split(r" (?:\|.*?\| ?)+", item[0])
test2 =test2[:-1]
for tx in test2:
newData.append(tx)
print len(newData)
print newData
但是,我在结果中得到了 3 个项目,包括一个 ;
。我查看了原句,发现 ;
在 |T:**1SP3E3| 中; |I:**1SP3E3|
,所以我需要从结果中删除这个;
。我将我的代码修改为
test2= re.split(r" (?:\|.*?\| ?;?)+", item[0])
但是我得不到正确的结果。谁能帮忙?非常感谢。
最佳答案
[i.strip() for i in re.sub(r'\|\w:\*\*\w*\|', '', re.sub(r' +', r' ', s.strip())).split(';')]
返回
['Both whole plasma and the d < 1.006 g/ml density fraction of plasma from 2/2 mice show this broad beta-migration pattern (Fig. 1 B)', 'in contrast, 3/3 plasma shows virtually no lipid staining at the beta-position.']
但请保持谨慎,因为这取决于您的文本是否与您的示例一致。
关于python - 在Python中使用re包将句子分成子句,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34736440/