python - 删除python中的单词扩展

标签 python string

我收到了一条包含多个单词的文本。我想删除所有单词的派生扩展。例如,我想删除扩展名 -ed -ing 并保留初始动词。如果我有验证或验证以保持验证 f.e.我在 python 中找到了 strip 方法,它从字符串的开头或结尾删除了一个特定的字符串,但这并不是我想要的。例如,是否有任何库在 python 中做这样的事情?

我已经尝试执行建议的帖子中的代码,但我注意到有几个词出现了奇怪的修剪。例如我有以下文字

 We goin all the way βπƒβ΅οΈβ΅οΈ        
 Think ive caught on to a really good song ! Im writing π       
 Lookin back on the stuff i did when i was lil makes me laughh π‚       
 I sneezed on the beat and the beat got sicka       
 #nashnewvideo http://t.co/10cbUQswHR       
 Homee βοΈβοΈβοΈπ΄      
 So much respect for this man , truly amazing guy βοΈ @edsheeran  
 http://t.co/DGxvXpo1OM"        
 What a day ..      
 RT @edsheeran: Having some food with @ShawnMendes      
 #VoiceSave  christina π        
 Im gunna make the βοΈ sign my signature pose       
 You all are so beautiful .. π soooo beautiful      
 Thought that was a really awesome quote        
 Beautiful things don't ask for attention"""

并且在使用下面的代码之后(我也删除了非拉丁字符和 url)

 we  goin  all  the  way 
 think  ive  caught  on  to  a  realli  good  song  im  write 
 lookin  back  on  the  stuff  i  did  when  i  wa  lil  make  me  laughh 
 i  sneez  on  the  beat  and  the  beat  got  sicka 
 nashnewvideo 
 home 
 so  much  respect  for  thi  man  truli  amaz  guy 
 what  a  day 
 rt  have  some  food  with 
 voicesav  christina 
 im  gunna  make  the  sign  my  signatur  pose 
 you  all  are  so  beauti  soooo  beauti 
 thought  that  wa  a  realli  awesom  quot 
 beauti  thing  dont  ask  for  attent 

例如,它修剪美丽到美丽,引用真正到真实。我的代码如下:

 reader = csv.reader(f)
    print doc
    for row in reader:
        text =  re.sub(r"(?:\@|https?\://)\S+", "", row[2])
        filter(lambda x: x in string.printable, text)
        out = text.translate(string.maketrans("",""), string.punctuation)
        out = re.sub("[\W\d]", " ", out.strip())
        word_list = out.split()
        str1 = ""
        for verb in word_list:
                 verb = verb.lower()
                 verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
                 str1 = str1+" "+verb+" " 
        list.append(str1)
        str1 = "\n"

最佳答案

您可以使用 lemmatizer 而不是 stemmer。这是一个使用 python NLTK 的示例:

from nltk.stem import WordNetLemmatizer

s = """
 You all are so beautiful soooo beautiful
 Thought that was a really awesome quote
 Beautiful things don't ask for attention
 """

wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention

在某些情况下,它可能不会如您所愿:

print wnl.lemmatize('going') #going

然后您可以结合这两种方法:词干提取词形还原

关于python - 删除python中的单词扩展,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23732057/

相关文章:

python - 组合 DataFrame 时保留索引和列顺序

Python + Tornado 编辑文件后重启

python - WebHook 接收带有 +(加号)的字符串数据,其中空格应该是

ruby - 根据字符串的最后一个单词对数组进行排序

java - Java中的美元货币格式

javascript - 如何将看起来像 JSON 或 JS 对象的 String 转换为实际的 JS 对象?

python - 限制 MNIST 训练数据的大小

c# - 为什么在使用 tocap() 函数时只有第一个单词大写?

regex - Scala 正则表达式命名捕获组

python - 在 python 列表中抓取唯一的元组,不管顺序如何