python - 无法通过 BeautifulSoup 抓取

标签 python web-scraping beautifulsoup

我正在尝试从这个 website 中抓取图像和新闻 url .我定义的标签是

root_tag=["div", {"class":"ngp_col ngp_col-bottom-gutter-2 ngp_col-md-6 ngp_col-lg-4"}]
image_tag=["div",{"class":"low-rez-image"},"url"]
news_url=["a",{"":""},"href"]

网址是url ,我抓取网站的代码是。

ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
           'Accept': 'text/html,application/xhtml+xml,application/xml;' \
                     'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get(url, headers=headers)
webContent = response.content
bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])

result=[]
for div in all_tab_data:
    try:
        news_url=None
        news_url = div.find(news_tag[0], news_tag[1]).get(news_tag[2])
        
    except Exception as e:
        news_url= None
    
    try:
        image_url = None
        div_img = str(div)
        match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
        if match != None:
            image_url = str(match.group(0))
        else:
            image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])

    except Exception as e:
        image_url=None
        pass
    result.append([news_url,image_url])

我调试代码并发现 all_tab_data 是空的,但我选择了正确的 root_tag。所以我不知道我做错了什么

最佳答案

内容是从 JSON 加载的。

可以这样获取所有图片的url:

import requests

url = "https://www.nationalgeographic.com/magazine/_jcr_content/content/promo-carousel.promo-carousel.json"

data = requests.get(url).json()

for item in data:
    for sub_item in item['promo_carousel']:
        p_img = sub_item['promo_image']
        if p_img is not None:
            print(p_img['image']['uri'])

输出:

https://www.nationalgeographic.com/content/dam/animals/2020/09/african-cheetah-snow/african-cheetah-snow-2.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/09/wallaby-atrazine/wallaby-og-a0xh8r-01.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/09/elephant-tuberculosis/r40bfj.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/08/handfish/01-handfish-minden_90392182.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/08/cal-fire-update/california-fire-palley-mm9468_200905_000229.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/11/face-mask-recognition/20200901_002_out_mp4_00_00_03_18_still003.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/10/winds-fires-california/winds-fires-california-2019.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/10/fire-air-quality/fire-air-pollution-20253854760329.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/02/autopsy/mm9412_200717_000522.jpg
https://www.nationalgeographic.com/content/dam/magazine/rights-exempt/2020/10/departments/explore/stellar-map-milky-way-og.png
https://www.nationalgeographic.com/content/dam/science/2020/07/31/vaccine/vaccine_20209514426186.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/history-magazine/2020/09-10/metric-system/og-french-metric-system.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/OG/red-terror-explainer-og.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/OG/promo-medieval-pandemic.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/2020/09/Asian-American-COVID/og_asianamerican.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/2020/08/goodbye-hong-kong/19-hong-kong-security-law-china.jpg
https://www.nationalgeographic.com/content/dam/travel/commercial/2020/samsung/wyoming/samsung-wyoming-mountain.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/kissing-tourism-sites/gettyimages-3332297.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/thinking-about-traveling/nationalgeographic_1085186.jpg
https://www.nationalgeographic.com/content/dam/science/commercial/2019/domestic/wyss-foundation/wyss-foundation_cfn_natgeo-image-collection_1971120.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/least-visited-US-national-parks/nationalgeographic_2466315.jpg

编辑:要获取标题和文章数据,请使用:

for item in data:
    for sub_item in item['promo_carousel']:
        print(f"{sub_item['components'][0]['title']['text']}"
              f"\n{sub_item['uri']}")
        p_img = sub_item['promo_image']
        if p_img is not None:
            print(f"{p_img['image']['uri']}")
        print("-" * len(sub_item['uri']))

打印(为简洁起见缩短):

Rare photographs show African cheetahs in snowstorm
https://www.nationalgeographic.com/animals/2020/09/cheetahs-snow-south-africa/
https://www.nationalgeographic.com/content/dam/animals/2020/09/african-cheetah-snow/african-cheetah-snow-2.jpg
------------------------------------------------------------------------------
Wallabies exposed to common weed killer have reproductive abnormalities
https://www.nationalgeographic.com/animals/2020/09/wallaby-sexual-development-impaired-by-atrazine-herbicide/
https://www.nationalgeographic.com/content/dam/animals/2020/09/wallaby-atrazine/wallaby-og-a0xh8r-01.jpg
-------------------------------------------------------------------------------------------------------------
...

关于python - 无法通过 BeautifulSoup 抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63859155/

相关文章:

python - 从网络数据中提取两个表的内容

ios - 如何以编程方式填写网络表单以获取过去的身份验证页面? (在 iOS 中)

python - 如何在scrapy中提交表单?

python - 下载 PDF 到子目录

python - 将值添加到子图的条形中

python - 3D 图显示错误的轴标签(X 轴有 Y 轴名称,Y 轴有 X 轴名称)

python - 调用 cython 时没有成员 pylint 错误

python - pyspark 在分组的 applyInPandas 中添加多列(更改架构)

Python:使用 BeautifulSoup 列表中 <li> 的内容

python - BeautifulSoup 无法按类别找到标签