python - 在 Python 中运行大型函数时出现类型错误

标签 python csv split

我正在尝试运行一个大型函数,该函数将处理大型文本文件,方法是将其拆分为演讲者及其语音,然后将语音进一步处理为组成段落。这是代码:

import os
import re
import csv
from bs4 import BeautifulSoup

def driver(folder, input_filename, output_filename1, output_filename2):
    os.chdir(folder)
    with open(input_filename, 'r') as f:
        Hearing = f.read()
    hearing = BeautifulSoup(Hearing)
    hearing = hearing.get_text()
    hearing = hearing.split("RESPONSE TO WRITTEN")
    str (hearing)
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
    speakers = list(set(speakers))
    #print speakers
    position = []
    for speaker in speakers:
        x = hearing.find(speakers)
        position.append(x)
        def find_speaker(hearing, speakers):
            position = []
            for speaker in speakers:
                x = hearing.find(speaker)
                if x==-1:
                    x += 1000000
                position.append(x)
                first = min(position)
                name = speakers[position.index(min(position))]
            name_length = len(name)
            chunk = [name, hearing[0:first], hearing[first+name_length:]]
            #return chunk
            chunks = []
            #print hearing
            names = []
            while len(hearing)>10:
                chunk_try = find_speaker(hearing, speakers)
                hearing = chunk_try[2]
                chunks.append(chunk_try[1])
                names.append(chunk_try[0].strip())
                print len(hearing)#0
                chunks.append(hearing)
                chunks = chunks[1:]
                print len(names) 
                print len(chunks)
                data = zip(names, chunks)
                with open(output_filename1,'wb') as f:
                    w=csv.writer(f)
                    w.writerow(['Speaker','Speech'])
                    for row in data:
                        w.writerow(row)
                        paragraphs = str(chunks)
                        print (paragraphs)
                        Paragraphs = paragraphs.split("\\n")
                        data1 = zip(Paragraphs)
                        with open(output_filename2,'wb') as f:
                            w=csv.writer(f)
                            w.writerow(['Paragraphs'])
                            for row in data1:
                                w.writerow(row)
                                return True 
driver("C:/Users/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')

但是,当我运行驱动程序函数时,出现以下错误:

Traceback (most recent call last):
  File "<pyshell#159>", line 1, in <module>
    driver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
  File "<pyshell#158>", line 9, in driver
    speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)
  File "C:\Python27\lib\re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

我认为这是指文件“hearing”不带字符串,但是当我尝试 str(hearing) 时,它没有解决错误。我也很困惑为什么它指的是三行单独的代码。任何建议将不胜感激 - 我已经在这方面坚持了一段时间了!

最佳答案

您的代码结构有点令人困惑,但我会尽力解释发生了什么。

当你到达这一行时:

speakers = re.findall("\\n    Mr. [A-Z][a-z]+\.|\\n    Ms. [A-Z][a-z]+\.|\\n    Congressman [A-Z][a-z]+\.|\\n   Congresswoman [A-Z][a-z]+\.|\\n   Chairwoman [A-Z][a-z]+\.|\\n   Chairman [A-Z][a-z]+\.", hearing)

hearing 是一个列表,因为您使用 str.split 将其分成上面的两行

hearing = hearing.split("RESPONSE TO WRITTEN")

因此,您会收到错误,因为 re.findall 不支持列表作为其第二个参数。相反,它需要一个字符串或缓冲区。

<小时/>

现在,这就是问题所在。解决方案是将 re.findall 的第二个参数设置为字符串。该字符串来自哪里取决于您想要做什么。

从这一行来看:

str (hearing)

认为您希望将列表hearing变成其自身的字符串表示形式。如果是这样,那么您需要像这样重新分配听证会:

hearing = str(hearing)

关于python - 在 Python 中运行大型函数时出现类型错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20385262/

相关文章:

python - 每当 mysql 服务重新启动时,uwsgi 如何重新建立与远程 mysql 数据库的连接

python - 查询具有多个条件的mongodb数组字段

javascript - 使用 D3.JS 为 Nest 树状图着色

javascript - JS : Capitalize first letter of every word entered into input

arrays - 如何将字节值拆分为多个较小的字节值?

java - 在java中分割一个包含html的字符串

python - 将电子邮件附件保存到指定存储桶

python - 从网页中抓取 pdf

c++ - 计算 .csv 中的行数时出现循环问题

postgresql - 类型 uuid : "" or "null" in PosgreSQL copy command 的输入语法无效