python - 使用python从div中抓取h3

标签 python html web-scraping beautifulsoup scrape

我想使用 Python 3.6 从 DIV 内抓取 H3 标题 - 从页面:

https://player.bfi.org.uk/search/rentals?q=&sort=title&page=1

注意页码发生变化,增量为1。

我很难返回或识别标题。

from requests import get
url = 'https://player.bfi.org.uk/search/rentals?q=&sort=title&page=1'
response = get(url)
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'lxml')
type(html_soup)
movie_containers = html_soup.find_all('div', class_ = 'card card--rentals')
print(type(movie_containers))
print(len(movie_containers))

我也尝试过循环它们:

for dd in page("div.card__content"):
    print(div.select_one("h3.card__title").text.strip())

任何帮助都会很棒。

谢谢

我期待每个页面上每部电影标题的结果,包括电影的链接。例如。 https://player.bfi.org.uk/rentals/film/watch-akenfield-1975-online

最佳答案

该页面正在通过 xhr 将内容加载到另一个网址,因此您错过了这一点。您可以模仿页面使用的 xhr POST 请求并更改发送的 post json。如果更改大小,您将获得更多结果。

import requests

data = {"size":1480,"from":0,"sort":"sort_title","aggregations":{"genre":{"terms":{"field":"genre.raw","size":10}},"captions":{"terms":{"field":"captions"}},"decade":{"terms":{"field":"decade.raw","order":{"_term":"asc"},"size":20}},"bbfc":{"terms":{"field":"bbfc_rating","size":10}},"english":{"terms":{"field":"english"}},"audio_desc":{"terms":{"field":"audio_desc"}},"colour":{"terms":{"field":"colour"}},"mono":{"terms":{"field":"mono"}},"fiction":{"terms":{"field":"fiction"}}},"min_score":0.5,"query":{"bool":{"must":{"match_all":{}},"must_not":[],"should":[],"filter":{"term":{"pillar.raw":"rentals"}}}}}
r = requests.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
for film in r['hits']['hits']:
    print(film['_source']['title'], 'https://player.bfi.org.uk' + film['_source']['url'])

rentals 的实际结果计数位于 json 中,r['hits']['total'],因此您可以执行初始请求,从如果数字远高于您的预期,请检查是否需要另一个请求,然后通过更改 fromsize 来收集任何额外的内容,以清除任何未完成的请求。

import requests
import pandas as pd

initial_count = 10000
results = []

def add_results(r):
    for film in r['hits']['hits']:
        results.append([film['_source']['title'], 'https://player.bfi.org.uk' + film['_source']['url']])

with requests.Session() as s:
    data = {"size": initial_count,"from":0,"sort":"sort_title","aggregations":{"genre":{"terms":{"field":"genre.raw","size":10}},"captions":{"terms":{"field":"captions"}},"decade":{"terms":{"field":"decade.raw","order":{"_term":"asc"},"size":20}},"bbfc":{"terms":{"field":"bbfc_rating","size":10}},"english":{"terms":{"field":"english"}},"audio_desc":{"terms":{"field":"audio_desc"}},"colour":{"terms":{"field":"colour"}},"mono":{"terms":{"field":"mono"}},"fiction":{"terms":{"field":"fiction"}}},"min_score":0.5,"query":{"bool":{"must":{"match_all":{}},"must_not":[],"should":[],"filter":{"term":{"pillar.raw":"rentals"}}}}}
    r = s.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
    total_results = int(r['hits']['total'])
    add_results(r)

    if total_results > initial_count :
        data['size'] = total_results - initial_count
        data['from'] = initial_count
        r = s.post('https://search-es.player.bfi.org.uk/prod-films/_search', json = data).json()
        add_results(r)

df = pd.DataFrame(results, columns = ['Title', 'Link'])
print(df.head())

关于python - 使用python从div中抓取h3,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56089613/

相关文章:

javascript - 是否可以使用我计算机上的程序填写在线表格?

python - 如何加速 JavaScript 网页抓取的过程?

python - 使用 python 从 json 文件读取

python - 使用请求库临时检索图像

python - 最高后密度区和中央可信区

javascript - 使用execCommand插入后如何获取图像元素?

javascript - 选中复选框的行表中的表单中没有复选框数据

html - Angular - 无法绑定(bind)到 href

python - 如何在 python 中使用 awswrangler 从 S3 读取所有 Parquet 文件

python - Flask 提交表单时返回 'Method not allowed'