python - 美丽汤的谷歌新闻标题标签

标签 python beautifulsoup sentiment-analysis

我正在尝试从 Google 新闻(例如疫苗)中提取搜索结果,并根据收集的标题提供一些情绪分析。

到目前为止,我似乎找不到正确的标签来收集头条新闻。

这是我的代码:

from textblob import TextBlob
import requests
from bs4 import BeautifulSoup

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
        for h in headline_results:
            blob = TextBlob(h.get_text())
            self.sentiment += blob.sentiment.polarity / len(headline_results)
            self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

情感结果始终为 0,主观性结果始终为 0。我觉得问题出在 class_="phYMDf nDgy9d"上。

最佳答案

如果您浏览该链接,您将看到页面的完成状态,但 requests.get 不会执行或加载除您请求的页面之外的任何其他数据。幸运的是,有一些数据,您可以抓取它们。我建议您使用 html prettifier 服务,例如 codebeautify更好地了解页面结构。

此外,如果您看到像 phYMDf nDgy9d 这样的类,请务必避免使用它们进行查找。它们是类的缩小版本,因此,如果它们随时更改 CSS 代码的一部分,您要查找的类将获得一个新名称。

我所做的可能有点矫枉过正,但是,我设法深入挖掘特定部分,并且您的代码现在可以工作了。

enter image description here

当您查看请求的 html 文件的更漂亮版本时,必要的内容位于上面显示的 id 为 main 的 div 中。然后它的子元素以 Google Search 的 div 元素开始,接着是 style 元素,在一个空的 div 元素之后是 post div 元素。该子列表中的最后两个元素是 footerscript 元素。我们可以用 [3:-2] 切断它们,然后在该树下我们就有纯数据(几乎)。如果你检查 posts 变量后面的代码的其余部分,我想你可以理解它。

这是代码:

from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint

class Analysis:
    def __init__(self, term):
        self.term = term
        self.subjectivity = 0
        self.sentiment = 0
        self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)

    def run (self):
        response = requests.get(self.url)
        #print(response.text)
        soup = BeautifulSoup(response.text, 'html.parser')
        mainDiv = soup.find("div", {"id": "main"})
        posts = [i for i in mainDiv.children][3:-2]
        news = []
        for post in posts:
            reg = re.compile(r"^/url.*")
            cursor = post.findAll("a", {"href": reg})
            postData = {}
            postData["headline"] = cursor[0].find("div").get_text()
            postData["source"] = cursor[0].findAll("div")[1].get_text()
            postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
            postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
            news.append(postData)
        pprint(news)
        for h in news:
            blob = TextBlob(h["headline"] + " "+ h["description"])
            self.sentiment += blob.sentiment.polarity / len(news)
            self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()

print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)

一些输出:

[{'description': 'It comes after US health officials said last week they had '
                 'started a trial to evaluate a possible vaccine in Seattle. '
                 'The Chinese effort began on...',
  'headline': 'China embarks on clinical trial for virus vaccine',
  'source': 'The Star Online',
  'timeAgo': '5 saat önce'},
 {'description': 'Hanneke Schuitemaker, who is leading a team working on a '
                 'Covid-19 vaccine, tells of the latest developments and what '
                 'needs to be done now.',
  'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
              'coronavirus’',
  'source': 'The Guardian',
  'timeAgo': '20 saat önce'},
 .
 .
 .
Vaccine Subjectivity:  0.34522727272727277 Sentiment:  0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
                 'Wi-Fi and cell phone signal boosters, to noise-cancelling '
                 'headphones and gadgets...',
  'headline': '10 Cool Tech Gadgets To Survive Working From Home',
  'source': 'CRN',
  'timeAgo': '2 gün önce'},
 {'description': 'Over the past few years, smart home products have dominated '
                 'the gadget space, with goods ranging from innovative updates '
                 'to the items we...',
  'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
  'source': 'Entrepreneur',
  'timeAgo': '2 hafta önce'},
 .
 .
 .
Home Gadgets Subjectivity:  0.48007305194805205 Sentiment:  0.3114683441558441

我使用标题和描述数据来执行操作,但如果您愿意,您可以使用它们。您现在已经有了数据:)

关于python - 美丽汤的谷歌新闻标题标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60792898/

相关文章:

python - 没有 GUI 的浏览器控制

python - 如何在 python doctest 结果字符串中包含特殊字符(制表符、换行符)?

Python BeautifulSoup XML 解析

sql - 将(非 CSV)文本数据导入 PostgreSQL,以空格和一个大写字母分隔

python - 如何使用DistilBERT Huggingface NLP模型对新数据进行情感分析?

python - 如何将包含字符串和数字的值列表写入文本文件

python - %load filename.py 命令在 Jupyter 笔记本中的哪里查找?

python - 如何在 beautifulsoup 中获取带标签的 td 内容?

python - 在 BeautifulSoup 标签上使用正则表达式

Python错误: TypeError: Expected string or bytes-like object