python - 如何使用 BeautifulSoup 和 Python 抓取页面?

标签 python python-2.7 web-scraping

我正在尝试从 BBC Good Food 网站提取信息,但在缩小正在收集的数据范围时遇到一些问题。

这是我到目前为止所拥有的:

from bs4 import BeautifulSoup
import requests

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("a")

for anchor in links:
    print(anchor.get('href')), anchor.text

这会返回相关页面中的所有链接以及链接的文本描述,但我想从页面上的“文章”类型对象中提取链接。这些是特定食谱的链接。

通过一些实验,我成功地从文章中返回了文本,但我似乎无法提取链接。

最佳答案

我看到的与文章标签相关的唯一两件事是 href 和 img.src:

from bs4 import BeautifulSoup
import requests

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)
links = soup.find_all("article")

for ele in links:
    print(ele.a["href"])
    print(ele.img["src"])

链接位于“class=node-title”

from bs4 import BeautifulSoup
import requests

webpage = requests.get('http://www.bbcgoodfood.com/search/recipes?query=tomato')
soup = BeautifulSoup(webpage.content)


links = soup.find("div",{"class":"main row grid-padding"}).find_all("h2",{"class":"node-title"})

for l in links:
    print(l.a["href"])

/recipes/681646/tomato-tart
/recipes/4468/stuffed-tomatoes
/recipes/1641/charred-tomatoes
/recipes/tomato-confit
/recipes/1575635/roast-tomatoes
/recipes/2536638/tomato-passata
/recipes/2518/cherry-tomatoes
/recipes/681653/stuffed-tomatoes
/recipes/2852676/tomato-sauce
/recipes/2075/tomato-soup
/recipes/339605/tomato-sauce
/recipes/2130/essence-of-tomatoes-
/recipes/2942/tomato-tarts
/recipes/741638/fried-green-tomatoes-with-ripe-tomato-salsa
/recipes/3509/honey-and-thyme-tomatoes

要访问,您需要在前面添加http://www.bbcgoodfood.com:

for l in links:
       print(requests.get("http://www.bbcgoodfood.com{}".format(l.a["href"])).status
200
200
200
200
200
200
200
200
200
200

关于python - 如何使用 BeautifulSoup 和 Python 抓取页面?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29421840/

相关文章:

hadoop - Nutch v Solr v Nutch+Solr

python - 如何通过python将数据框复制到excel中的某个位置?

具有大量 RAM 的 Python 2.7 MemoryError(64 位,Ubuntu)

Python:使用map函数打印元素

python - Python 上的归并排序 : Unusual pattern of result obtained

python - 使用子进程的 pip freeze 调用 - 没有这样的文件或目录

javascript - 循环访问远程数据

python - 如何用Python下载网页上的PDF文件

python - 如何在 python 中构造 OneNote COM API 的类型常量?

python - Django REST - 无法访问我的 API 的编辑页面