python - 尝试从正则表达式生成 pandas 数据框列时出现问题?

标签 python regex python-3.x pandas nlp

我正在处理分配在目录中的多个 .txt 文件。从所有这些文件中,我应该如何提取特定的单词或文本 block (即正则表达式定义的句子、段落和标记)并将它们放入 pandas 数据框(即表格格式)中,并保留一个包含每个名称的列文件?到目前为止,我创建了这个函数来完成这个任务(我知道......它并不完美):

在:

import glob, os, re
import pandas as pd
regex = r'\<the regex>\b'
ind = 'path/dir'
out = 'path/dir'
f ='path/redirected/output/'


def foo(ind, reg, out):
    for filename in glob.glob(os.path.join(in_directory, '*.txt')):
        with open(filename, 'r') as file:
            stuff = re.findall(a_regex, file.read(), re.M)
            #my_list = [str([j.split()[0] for j in i]) for i in stuff]

            lis = [t[::2] for t in stuff]
            cont = ' '.join(map(str, lis))
            print(cont)
            with open(out, 'a') as f:
                print(filename.split('/')[-1] + '\t' + cont, file = f)


foo(directory, regex, out)

然后输出被重定向到第三个文件:

输出:

fileName1.txt       
fileName2.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
fileName3.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
....
fileNameN.txt       stringOrChunk

这就是我从之前的文件创建数据框的方式(是的,我知道这很糟糕):

import pandas as pd
df = pd.read_csv(/path/of/f/, sep='\t', names = ['file_names','col1'])
df.to_csv('/pathOfNewCSV.csv', index=False, sep='\t')

最后:

    file_names  col1
0   fileName1.txt   NaN
1   fileName2.txt   stringOrChunk stringOrChunk stringOrChunk...
2   fileName3.txt   stringOrChunk stringOrChunk stringOrChunk...
3   fileName4.txt   stringOrChunk
.....
N   fileNameN.txt   stringOrChunk

那么,知道如何以更 Pythonic 和更高效的方式做到这一点吗?

更新

我上传了一个带有一些文档的 .zip 作为 data ,所以如果我们想从文档中提取所有副词,我们应该这样做:

a_regex = r"\w+ly"
directory = '/Users/user/Desktop/Docs/'
output_dir = '/Users/user/Desktop/'

foo(ind, reg, out)

然后,它应该创建一个包含文档的所有副词的表:

Files            words
doc1.txt    
doc2.txt    
doc3.txt     DIRECTLY PROBABLY EARLY 
doc4.txt    

知道如何增强上述功能吗?此外,我不知道这是否是执行此操作的最佳方法 information extraction task (即仅使用正则表达式)。使用像 woosh 这样的字符串索引器怎么样?项目或者 nltk 呢?

更新

例如,考虑创建一个 dataframe提取所有包含单词的句子:JESUITS:

    Files   words1  words2  words3  words4
0   doc1.txt    A GOVERNMENT SPOKESMAN HAS ANNOUNCED THAT WITH...   NaN     NaN     NaN
1   doc2.txt    11/12/98 "THERE WAS NO TORTURE OR MISTREATMENT...   NaN     NaN     NaN
2   doc3.txt    WHAT WE HAD PREDICTED HAS OCCURRED. CRISTIANI ...   SO, THE QUESTION IS: WHO GAVE THE ORDER TO KIL...   THE MASSACRE OF THE JESUITS WAS NOT A PERSONAL...   LET US REMEMBER THAT AFTER THE MASSSACRE OF TH...
3   doc4.txt    IN 11/12/98 OUR VIEW, THE ASSASSINS OF THE JES...   THE ASSASSINATION OF THE JESUITS AGAIN CONFIRM...   NaN     NaN

最佳答案

我不完全确定我是否理解这个问题,但此处的代码片段是使用 nltk 解决此问题的最佳努力。

from glob import glob
from os.path import join, split

import nltk
import pandas as pd

dir_name = '/tmp/stackovflw/Docs'
file_to_adverb_dict = {}
nltk_adverb_tags = {'RB', 'RBR', 'RBS'}  # taken from nltk.help.upenn_tagset()

for full_file_path in glob(join(dir_name, '*.txt')):
    with open(full_file_path, 'rb') as f:
        _, file_name = split(full_file_path)
        tokens = nltk.word_tokenize(f.read().lower()) # lower -> seems that nltk behaves differently when the text is uppercase - try it...
        adverbs_in_file = [token for token, tag in nltk.pos_tag(tokens) if tag in nltk_adverb_tags]
        # consider using a "set" here to remove duplicates
        file_to_adverb_dict[file_name] = ' '.join(adverbs_in_file).upper()  #converting it back to uppercase (your input is all uppercase)

print pd.DataFrame(file_to_adverb_dict.items(), columns=['file_names', 'col1'])
#   file_names                                               col1
# 0   doc4.txt  PROBABLY ABROAD ALFONSO HOWEVER ALWAYS ALREADY...
# 1   doc1.txt                                                NOT
# 2   doc3.txt  DIRECTLY NOT SO SOLELY NOT PROBABLY NOT EVEN N...
# 3   doc2.txt

请注意,如果您只是想在特定文件夹中查找以“ly”结尾的单词,grep 是您的 friend :

$ grep  -o -i -E  '\w+ly' *.txt
doc3.txt:DIRECTLY
doc3.txt:SOLELY
doc3.txt:PROBABLY
doc3.txt:EARLY
doc4.txt:PROBABLY

-o 只给你匹配而不是整行 -i 忽略大小写 -E 扩展正则表达式

使用awk按文件名归约:

 $ grep  -o -i -E  '\w+ly' *.txt | awk -F':' '{a[$1]=a[$1] " "  $2}END{for( i in a ) print  i,"," a[i]}'
doc4.txt , PROBABLY
doc3.txt , DIRECTLY SOLELY PROBABLY EARLY

关于python - 尝试从正则表达式生成 pandas 数据框列时出现问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40073691/

相关文章:

python - 如何将两个特征/分类器组合成一个统一且更好的分类器?

python - 在 Tkinter 中运行窗口最小化命令

python - 是否可以在 discord.py 中对不同的前缀使用不同的命令?

arrays - 将字符串(基于逗号)拆分为数组,在列表末尾添加一个空项

Python 3.1.1 字符串转十六进制

python-3.x - Pandas数据框按ID合并文本行组

python - Python JSON解析不存储值

python - 应用函数不适用于数据框列

regex - 用 sed 替换字符串

python - 使用 Python-xmlrpc 从外部图像链接设置 WordPress 帖子缩略图