我想找到从给定文本中匹配的正面和负面单词的总数。我在 positive.txt
文件中有正面词列表,在 negative.txt
文件中有负面词列表。如果一个词是从肯定词列表中匹配的,那么我想要一个简单的整数变量,其中的值递增 1,对于否定匹配词也是如此。从我给定的代码中,我得到了 @class=[story-hed]
下的一段。这是我想与正面和负面单词列表以及单词总数进行比较的文本。我的代码是,
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem
class dawnSpider(BaseSpider):
name = "dawn"
allowed_domains = ["dawn.com"]
start_urls = [
"http://dawn.com/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3[@class="story-hed"]//a/text()').extract()
items=[]
for site in sites:
item=DawnItem()
item['title']=site
items.append(item)
return items
最佳答案
下面的独立代码可以解决这个问题:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
positive = readwords('positive.txt')
negative = readwords('negative.txt')
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
这是我在两个输入文件中的内容:
阳性.txt:
good
awesome
负.txt:
bad
ugly
输出是: 2 1
要在 scrapy 中实现这一点,您可能需要使用项目管道 http://doc.scrapy.org/en/latest/topics/item-pipeline.html
关于python - 如何从文本中找到正面和负面单词的总数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22094224/