python - 从列表中的列中提取 2gram 字符串

我有一个名为 df 的数据框

Gender  Country      Comments
male    USA        machine learning and fraud detection are a must learn
male    Canada     monte carlo method is great and so is hmm,pca, svm and neural net
female  USA        clustering and cloud computing
female  Germany    logistical regression and data management and fraud detection
female  Nigeria    nltk and supervised machine learning
male    Ghana      financial engineering and cross validation and time series

和一个名为算法的列表

algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']

因此从技术上讲，对于“注释”列的每一行，我尝试提取出现在算法列表中的单词。这就是我想要实现的目标

Gender  Country      algorithms
male    USA        machine learning, fraud detection 
male    Canada     monte carlo method, hmm,pca, svm, neural net
female  USA        clustering, cloud computing
female  Germany    logistical regression, data management, fraud detection
female  Nigeria    nltk, supervised machine learning
male    Ghana      financial engineering, cross validation, time series

但是，这就是我得到的

Gender  Country      algorithms
male    USA         
male    Canada     hmm pca svm  
female  USA        clustering
female  Germany    
female  Nigeria    nltk
male    Ghana

诸如机器学习和欺诈检测之类的词不会出现。基本上都是2克的话

这是我使用的代码

df['algorithms'] = df['Comments'].apply(lambda x: " ".join(x for x in x.split() if x in algorithms))

最佳答案

您可以pandas.Series.str.findall与join结合使用。

import pandas as pd
import re

df['algo_new'] = df.algo.str.findall(f"({ '|'.join(ml) })")

>> out

    col1    gender  algo                                                algo_new
0   usa     male    machine learning and fraud detection are a mus...   [machine learning, fraud detection, clustering]
1   fr      female  monte carlo method is great and so is hmm,pca,...   [monte carlo method]
2   arg     male    logistical regression and data management and ...   [logistical regression, data management, fraud..

我们使用 join 将字符串连接到 ml 列表中，并在每个字符串之间添加 | 以捕获值 1 OR value 2 等。然后我们使用 findall 查找所有出现的情况。

请注意，它使用 f 字符串，因此您需要 python 3.6+。如果您有较低版本的 python，请告诉我。

对于任何对基准测试感兴趣的人(因为我们有 3 个答案)，使用具有 960 万行的每个解决方案并连续运行每个解决方案 10 次，我们将得到以下结果:

亚历克斯K:
- 平均值:14.94 秒
- 分钟:12.43 秒
- 最长:17.08 秒
泰迪:
- 平均值:22.67 秒
- 分钟:18.25 秒
- 最长:27.64 秒
绝对空间
- 平均值:24.12 秒
- 分钟:21.25 秒
- 最长:27.53 秒

关于python - 从列表中的列中提取 2gram 字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55269359/

python - 从列表中的列中提取 2gram 字符串

上一篇：python - 什么时候使用 Python Ellipsis 优于 'pass' ？

下一篇：python - 根据其他两个词典中的匹配术语创建词典 - Python