我正在尝试运行一个大型函数,该函数将处理大型文本文件,方法是将其拆分为演讲者及其语音,然后将语音进一步处理为组成段落。这是代码:
import os
import re
import csv
from bs4 import BeautifulSoup
def driver(folder, input_filename, output_filename1, output_filename2):
os.chdir(folder)
with open(input_filename, 'r') as f:
Hearing = f.read()
hearing = BeautifulSoup(Hearing)
hearing = hearing.get_text()
hearing = hearing.split("RESPONSE TO WRITTEN")
str (hearing)
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
speakers = list(set(speakers))
#print speakers
position = []
for speaker in speakers:
x = hearing.find(speakers)
position.append(x)
def find_speaker(hearing, speakers):
position = []
for speaker in speakers:
x = hearing.find(speaker)
if x==-1:
x += 1000000
position.append(x)
first = min(position)
name = speakers[position.index(min(position))]
name_length = len(name)
chunk = [name, hearing[0:first], hearing[first+name_length:]]
#return chunk
chunks = []
#print hearing
names = []
while len(hearing)>10:
chunk_try = find_speaker(hearing, speakers)
hearing = chunk_try[2]
chunks.append(chunk_try[1])
names.append(chunk_try[0].strip())
print len(hearing)#0
chunks.append(hearing)
chunks = chunks[1:]
print len(names)
print len(chunks)
data = zip(names, chunks)
with open(output_filename1,'wb') as f:
w=csv.writer(f)
w.writerow(['Speaker','Speech'])
for row in data:
w.writerow(row)
paragraphs = str(chunks)
print (paragraphs)
Paragraphs = paragraphs.split("\\n")
data1 = zip(Paragraphs)
with open(output_filename2,'wb') as f:
w=csv.writer(f)
w.writerow(['Paragraphs'])
for row in data1:
w.writerow(row)
return True
driver("C:/Users/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
但是,当我运行驱动程序函数时,出现以下错误:
Traceback (most recent call last):
File "<pyshell#159>", line 1, in <module>
driver("C:/Users/mboogie/Documents/Congressional Hearings/NHTF Project/Test Set", 'CHRG-107hhrg70750.htm', 'CHRG-107hhrg70750.csv', 'Paragraphs.csv')
File "<pyshell#158>", line 9, in driver
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
File "C:\Python27\lib\re.py", line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
我认为这是指文件“hearing”不带字符串,但是当我尝试 str(hearing) 时,它没有解决错误。我也很困惑为什么它指的是三行单独的代码。任何建议将不胜感激 - 我已经在这方面坚持了一段时间了!
最佳答案
您的代码结构有点令人困惑,但我会尽力解释发生了什么。
当你到达这一行时:
speakers = re.findall("\\n Mr. [A-Z][a-z]+\.|\\n Ms. [A-Z][a-z]+\.|\\n Congressman [A-Z][a-z]+\.|\\n Congresswoman [A-Z][a-z]+\.|\\n Chairwoman [A-Z][a-z]+\.|\\n Chairman [A-Z][a-z]+\.", hearing)
hearing
是一个列表,因为您使用 str.split
将其分成上面的两行
hearing = hearing.split("RESPONSE TO WRITTEN")
因此,您会收到错误,因为 re.findall
不支持列表作为其第二个参数。相反,它需要一个字符串或缓冲区。
现在,这就是问题所在。解决方案是将 re.findall 的第二个参数设置为字符串。该字符串来自哪里取决于您想要做什么。
从这一行来看:
str (hearing)
我认为您希望将列表hearing
变成其自身的字符串表示形式。如果是这样,那么您需要像这样重新分配听证会
:
hearing = str(hearing)
关于python - 在 Python 中运行大型函数时出现类型错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20385262/