python - 从 IMDb 收集某些剧集的所有电影评论

标签 python web-scraping imdb

我正在尝试使用 python 从 IMDb 收集数据,但无法获取所有评论。我有以下有效的代码,但没有给出所有可用的评论:

from imdb import IMDb

ia = IMDb()

ia.get_movie_reviews('13433812') 

输出:

`{'data': {'reviews': [{'content': 'Just finished watching the episode 4. Wow, it was so good. Well made mixture of thriller and comedy.I saw a few negative reviews here written after eps 1 or 2. I recommend watching at least up to eps 3 and 4. The real story starts from eps 3. Eps 4 is like a complete well made movie. You will surely enjoy it.',
'helpful': 0,
'title': '',
'author': 'ur129930427',
'date': '28 February 2021',
'rating': None,
'not_helpful': 0},


`{'content': 'You can see the cast had a lot of fun making this Italian/Korean would-be mafia thriller, the sort of fun NOT experienced in Hollywood since the days of Burt Reynolds. Vincenzo contains a very absorbing plot, a cast star-struck by designer clothes, interspersed with Italian (and other) Classical music excerpts to set in relief some well written suspense and intrigue. The plot centers on, if we really are to believe it, the endemically CORRUPT upper echelons of S. Korean society. Is it a coincidence that many of the systemic abuses of power and institutional vice that constitute Vincenzo\'s Main Plot are now also going on, this very moment in the USA? It is certainly food for thought. A clear advantage this Korean drama has over mediocre US shows, however is a much softer-handed use of violence, resorting more often to satire to keep the plot moving as opposed to gratuitous savagery now so common in so-called "hit" US shows. So far, so good, Binjenzo!'``

我也尝试过 Scrapy 代码,但没有得到任何评论:

from scrapy.http import TextResponse
import urllib.parse
from urllib.parse import urljoin
base_url = "https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv"
r=requests.get(base_url)
response = TextResponse(r.url, body=r.text, encoding='utf-8')
reviews = response.xpath('//*[contains(@id,"1")]/p/text()').extract()
len(reviews)
output : 0

最佳答案

这应该为您提供该页面上的所有审阅者姓名,耗尽所有加载更多按钮。您可以根据您的要求随意定义其他字段来获取它们。

import requests
from bs4 import BeautifulSoup

start_url = 'https://www.imdb.com/title/tt13433812/reviews?ref_=tt_urv'
link = 'https://www.imdb.com/title/tt13433812/reviews/_ajax'

params = {
    'ref_': 'undefined',
    'paginationKey': ''
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(start_url)

    while True:
        soup = BeautifulSoup(res.text,"lxml")
        for item in soup.select(".review-container"):
            reviewer_name = item.select_one("span.display-name-link > a").get_text(strip=True)
            print(reviewer_name)


        try:
            pagination_key = soup.select_one(".load-more-data[data-key]").get("data-key")
        except AttributeError:
            break
        params['paginationKey'] = pagination_key
        res = s.get(link,params=params)

关于python - 从 IMDb 收集某些剧集的所有电影评论,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68243944/

相关文章:

mysql select 包含范围信息

python - HTML5播放器显示在网页上,但未播放mp3声音

python - Scipy 安装到 Mountain Lion 的错误位置?

python - 更新 Google 文档中的电子表格

python - 使用 BeautifulSoup 和 Python 获取元标记内容属性

javascript - 有没有办法将对象传递给 casper.js 的 evaluate()?

python - 用户 'root' @'localhost' 使用密码 : NO 的访问被拒绝

python - Scrapy:如何构建一个从多个 URL 收集信息的项目?

python - 如何使用 Imdbpy 库获取电影的长度?

web-services - IMDb是否提供api或数据转储以获取所有标题ID?