我正在尝试从这个 website 中抓取图像和新闻 url .我定义的标签是
root_tag=["div", {"class":"ngp_col ngp_col-bottom-gutter-2 ngp_col-md-6 ngp_col-lg-4"}]
image_tag=["div",{"class":"low-rez-image"},"url"]
news_url=["a",{"":""},"href"]
网址是url ,我抓取网站的代码是。
ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
'Accept': 'text/html,application/xhtml+xml,application/xml;' \
'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get(url, headers=headers)
webContent = response.content
bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])
result=[]
for div in all_tab_data:
try:
news_url=None
news_url = div.find(news_tag[0], news_tag[1]).get(news_tag[2])
except Exception as e:
news_url= None
try:
image_url = None
div_img = str(div)
match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
if match != None:
image_url = str(match.group(0))
else:
image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])
except Exception as e:
image_url=None
pass
result.append([news_url,image_url])
我调试代码并发现 all_tab_data 是空的,但我选择了正确的 root_tag。所以我不知道我做错了什么
最佳答案
内容是从 JSON 加载的。
可以这样获取所有图片的url:
import requests
url = "https://www.nationalgeographic.com/magazine/_jcr_content/content/promo-carousel.promo-carousel.json"
data = requests.get(url).json()
for item in data:
for sub_item in item['promo_carousel']:
p_img = sub_item['promo_image']
if p_img is not None:
print(p_img['image']['uri'])
输出:
https://www.nationalgeographic.com/content/dam/animals/2020/09/african-cheetah-snow/african-cheetah-snow-2.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/09/wallaby-atrazine/wallaby-og-a0xh8r-01.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/09/elephant-tuberculosis/r40bfj.jpg
https://www.nationalgeographic.com/content/dam/animals/2020/08/handfish/01-handfish-minden_90392182.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/08/cal-fire-update/california-fire-palley-mm9468_200905_000229.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/11/face-mask-recognition/20200901_002_out_mp4_00_00_03_18_still003.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/10/winds-fires-california/winds-fires-california-2019.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/10/fire-air-quality/fire-air-pollution-20253854760329.jpg
https://www.nationalgeographic.com/content/dam/science/2020/09/02/autopsy/mm9412_200717_000522.jpg
https://www.nationalgeographic.com/content/dam/magazine/rights-exempt/2020/10/departments/explore/stellar-map-milky-way-og.png
https://www.nationalgeographic.com/content/dam/science/2020/07/31/vaccine/vaccine_20209514426186.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/history-magazine/2020/09-10/metric-system/og-french-metric-system.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/OG/red-terror-explainer-og.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/rights-exempt/OG/promo-medieval-pandemic.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/2020/09/Asian-American-COVID/og_asianamerican.jpg
https://www.nationalgeographic.com/content/dam/archaeologyandhistory/2020/08/goodbye-hong-kong/19-hong-kong-security-law-china.jpg
https://www.nationalgeographic.com/content/dam/travel/commercial/2020/samsung/wyoming/samsung-wyoming-mountain.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/kissing-tourism-sites/gettyimages-3332297.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/thinking-about-traveling/nationalgeographic_1085186.jpg
https://www.nationalgeographic.com/content/dam/science/commercial/2019/domestic/wyss-foundation/wyss-foundation_cfn_natgeo-image-collection_1971120.jpg
https://www.nationalgeographic.com/content/dam/travel/2020-digital/least-visited-US-national-parks/nationalgeographic_2466315.jpg
编辑:要获取标题和文章数据,请使用:
for item in data:
for sub_item in item['promo_carousel']:
print(f"{sub_item['components'][0]['title']['text']}"
f"\n{sub_item['uri']}")
p_img = sub_item['promo_image']
if p_img is not None:
print(f"{p_img['image']['uri']}")
print("-" * len(sub_item['uri']))
打印(为简洁起见缩短):
Rare photographs show African cheetahs in snowstorm
https://www.nationalgeographic.com/animals/2020/09/cheetahs-snow-south-africa/
https://www.nationalgeographic.com/content/dam/animals/2020/09/african-cheetah-snow/african-cheetah-snow-2.jpg
------------------------------------------------------------------------------
Wallabies exposed to common weed killer have reproductive abnormalities
https://www.nationalgeographic.com/animals/2020/09/wallaby-sexual-development-impaired-by-atrazine-herbicide/
https://www.nationalgeographic.com/content/dam/animals/2020/09/wallaby-atrazine/wallaby-og-a0xh8r-01.jpg
-------------------------------------------------------------------------------------------------------------
...
关于python - 无法通过 BeautifulSoup 抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63859155/