python - 如何根据另一个 csv 文件中的关键字提取 csv 文件中的句子并将其从主文件中删除

标签 python string pandas csv

我有 2 个 csv 文件,其中一个包含如下所示的句子

how are you

I want to die

I was home

I went to sleep at work

he has a bad reputation

it was me who went to him

have a good sleep home

另一个 csv 文件包含频率如下所示的单词

word freq

and 500

you 450

me 300

have 250

your 240

sleep 200

work 150

home 100

die 50

我试图将频率在 300 到 100 之间的单词的句子提取到一个新的 csv 文件中,并在从主 csv 文件中提取后删除该句子,因为有时在搜索新关键字或字,这是我设法构建的代码,但没有给我我想要的输出:

import sys
import pandas as pd
import re
import string
if len(sys.argv) == 1:
    print("please provide a CSV file to analys")
else:
    fileinput = sys.argv[1]
    dic = sys.argv[2]

wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)

data = wdata['sentences'].str.replace('[^\w\s]', ' ')
keywords=pd.read_csv(dic)
keywords=keywords.loc[keywords['freq'].between(100, 300, inclusive=False), 'word']
df1 = data[data['sentences'].str.split(expand=True).isin(keywords).any(axis=1)]
#deleted rows by keywords
df2 = data[~data['sentences'].str.split(expand=True).isin(keywords).any(axis=1)]
print(df1)


而且我不知道解压后如何删除主文件中的句子,我期望的输出是这样的

enter image description here

最佳答案

我认为你需要Series.between对于选择关键字:

keywords=keywords.loc[keywords['freq'].between(100, 300, inclusive=False), 'word']
print (keywords)
3     have
4     your
5    sleep
6     work
Name: word, dtype: object

然后选择Series.str.splitDataFrame.isinDataFrame.any

df1 = data[data.str.split(expand=True).isin(keywords.tolist()).any(axis=1)]
print (df1)
3    I went to sleep at work
6     have a good sleep home
Name: sentences, dtype: object

#deleted rows by keywords
df2 = data[~data.str.split(expand=True).isin(keywords.tolist()).any(axis=1)]
print (df2)
0                  how are you
1                I want to die
2                   I was home
4      he has a bad reputation
5    it was me who went to him
Name: sentences, dtype: object

关于python - 如何根据另一个 csv 文件中的关键字提取 csv 文件中的句子并将其从主文件中删除,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60023876/

相关文章:

python - 是否可以在 Folium map 中绘制带箭头的线条?

python - 如何在python3上打印当前日期?

python - 在文本文件中找不到字符串

Python 将文本文件中的数字 block (由单行文本分隔)读取到 Numpy 数组或 Pandas DataFrame 中

python - 如何通过列的某种转换对数据框进行分组

python - 将 JSON 文件转换为 Pandas 数据帧

python - 在 pandas 中使用 blosc 压缩会导致堆损坏

java - Java中汉字(UTF-8编码)的模式匹配

java - 如何使用单个 replaceAll 查找 Java 中两个字符串的公共(public)字符?

java - 通过 RandomStringUtils 生成随机绘制字符串数组的简短方法