python - 基于标点符号列表替换数据框中的标点符号

<分区>

使用 Canopy 和 Pandas，我的数据框 a 定义为:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt 是一个单列文件，包含一个包含文本、数字和标点符号的字符串列表。

假设 df 看起来像:

test

%hgh&12

abc123!!!

porkyfries

我希望我的结果是:

test

hgh12

abc123

porkyfries

到目前为止的努力:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

上面的命令基本上只是返回相同的数据集。感谢任何线索。

编辑:我使用 Pandas 的原因是因为数据量很大，跨越大约 100 万行，并且编码的 future 使用将应用于多达 3000 万行的列表。长话短说，我需要以非常有效的方式清理大数据集的数据。

最佳答案

使用 replace 和正确的正则表达式会更容易:

In [41]:

import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
         text
0        test
1     %hgh&12
2   abc123!!!
3  porkyfries

[4 rows x 1 columns]

使用带有非字母数字/空格模式的正则表达式

In [49]:

df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
         text
0        test
1       hgh12
2      abc123
3  porkyfries

[4 rows x 1 columns]

关于python - 基于标点符号列表替换数据框中的标点符号，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21672514/

上一篇：python - 如何从 Python 中的 url 中删除方案？

下一篇：python - 如何通过 sklearn 的 adaboost 方法使用您自己的自定义分类器？

相关文章：

python-3.x - 在Python中从中文字符串日期中提取日期

pandas - 根据一列的值划分数据帧的行

list - R:向数据框列表中的列添加不同的值

r - 子集数据框以仅包含在另一个因子的两个级别中都具有值的一个因子的级别

python - 关于 XPath 选择器的问题(针对 Scrapy)

python - 在循环条件中评估表达式

python - Pandas + sklearn 线性回归失败

python - 如何通过selenium停止点击同一个按钮，而该按钮始终存在

python - Pandas:包含变量名称和值的多列:如何使用 Pivot？

python - 在 Python Pandas DataFrame 中设置索引名称的最佳方法