python - 如何提高对 Pandas 数据框的列表理解速度

除了列表理解之外，是否有更快的方法从集合中过滤项目，对于大型数据集，列表理解运行时间有点慢。
我已经转换了 list_stopwords到一个集合，与列表相比花费的时间更少。

             date      description
0        2018-07-18    payment receipt
1        2018-07-18    ogsg s.u.b.e.b june 2018 salar
2        2018-07-18    sal admin charge
3        2018-07-19    sms alert charge outstanding
4        2018-07-19    vat onverve*issuance 


list_stopwords = set(stop_words.get_stop_words('en'))

data['description'] =  data['description'].apply(lambda x: " ".join([word for word in x.split() if word not in (list_stopwords)]))

最佳答案

也许使用正则表达式工作得更快:
拳头创建您的比赛案例正则表达式:


list_stopwords = set(stop_words.get_stop_words('en'))
re_stopwords= r"\b["
for word in list_stopwords: 
    re_stopwords+= "("+word+")"
re_stopwords+=r"]\b"

现在，申请列:

data['description'] =  data['description'].apply(lambda x: re.sub(re_stopwords,'',x))

这将用 '' 替换所有停用词(空字符串)。
我相信它更快，因为正则表达式直接对字符串进行操作，而不是您的代码在拆分时得到一个循环。

要了解有关正则表达式库的更多信息:w3schools .

更多关于\b表达式:regular-expressions .

关于python - 如何提高对 Pandas 数据框的列表理解速度，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67164182/

上一篇：android - Flutter:我们检测到您的应用在您的 1 个或多个 app bundle 或 APK 的 list 文件中包含 requestLegacyExternalStorage 标志

下一篇：java - JDK14 无法运行 "java --add-opens"

相关文章：

python - 如何使用列表对 Pandas 数据框进行子集化

python - Pandas/Python 等价于 R 中的复杂 ifelse 匹配

python - 如何在python中将十进制转换为二进制列表

Python 3 的 Signal.sigwaitinfo 与 Python 2.7 等效吗？

python - 使情节在 IE 中可见

python - Pandas Groupby 仅相同 ID 且列值为 false 时

python - 如何滚动浏览(大量)pandas 数据框？

python - 在函数中输入列表有什么问题？

python - Pandas:使用正则表达式清理包含单引号和括号的字符串列？

python - 标记接下来 X 行的合格行