Python 3 与 Bs4 的抓取

标签 python beautifulsoup

我正在尝试抓取此网站的 HTML:https://www.idealista.com/venta-viviendas/madrid-madrid/使用 python 3(使用 PyCharm)。 我只对房屋的价格感兴趣,因此我将搜索范围缩小到某些范围,如下所示:

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.idealista.com/venta-viviendas/madrid-madrid/')

soup = BeautifulSoup(page.text, 'html.parser')


prices=soup.findAll("span", {"class": "item-price h2-simulated"})

print(len(prices))
print(prices)

当我运行它时,我得到:0 []

这意味着它没有找到任何东西。此外,如果我打印所有内容: print(soup) 对于那么大的页面,我得到的 html 内容非常少,所以它显然无法获取所有内容。

最佳答案

该网站是动态的,因此,您将需要使用浏览器操作工具,例如 selenium :

from bs4 import BeautifulSoup as soup
from selenium import webdriver
import re, collections, itertools
d = webdriver.Chrome('/Users/jamespetullo/Downloads/chromedriver')
d.get('https://www.idealista.com/venta-viviendas/madrid-madrid/')
homes = soup(d.page_source, 'html.parser').find_all('div', {'class':'item-info-container'})
results = [i.find('div', {'class':re.compile('price\-row')}).text for i in homes]
price = collections.namedtuple('price', ['original', 'current', 'drop'])
new_prices = [list(filter(None, re.split('\s{2,}', i))) for i in results]
final_prices = [price(a, *[a for a, _ in itertools.zip_longest(b, [None, None])]) for a, *b in new_prices]

输出:

[price(original=' 369.000€', current='395.000 €', drop='7%'), price(original=' 1.250.000€ Garaje incluido ', current=None, drop=None), price(original=' 1.590.000€ Garaje incluido', current='1.650.000 €', drop='4%'), price(original=' 1.750.000€ Garaje incluido', current='1.875.000 €', drop='7%'), price(original=' 1.090.000€ Garaje incluido', current='1.195.000 €', drop='9%'), price(original=' 795.000€ ', current=None, drop=None), price(original=' 1.095.000€ Garaje incluido ', current=None, drop=None), price(original=' 355.000€ Garaje incluido ', current=None, drop=None), price(original=' 995.000€ Garaje incluido ', current=None, drop=None), price(original=' 1.130.000€ Garaje incluido', current='1.190.000 €', drop='5%'), price(original=' 850.000€ Garaje incluido ', current=None, drop=None), price(original=' 1.200.000€ Garaje incluido ', current=None, drop=None), price(original=' 990.000€ Garaje incluido ', current=None, drop=None), price(original=' 2.100.000€ ', current=None, drop=None), price(original=' 830.000€ Garaje incluido ', current=None, drop=None), price(original=' 2.390.000€ Garaje incluido ', current=None, drop=None), price(original=' 685.000€ ', current=None, drop=None), price(original=' 1.150.000€ Garaje incluido', current='1.200.000 €', drop='4%'), price(original=' 915.000€ ', current=None, drop=None), price(original=' 1.590.000€ Garaje incluido ', current=None, drop=None), price(original=' 625.000€ Garaje incluido', current='640.000 €', drop='2%'), price(original=' 735.000€', current='760.000 €', drop='3%'), price(original=' 890.000€ Garaje incluido', current='950.000 €', drop='6%'), price(original=' 925.000€', current='999.000 €', drop='7%'), price(original=' 975.000€', current='1.100.000 €', drop='11%'), price(original=' 850.000€ Garaje incluido', current='870.000 €', drop='2%'), price(original=' 1.200.000€ ', current=None, drop=None), price(original=' 1.500.000€ ', current=None, drop=None), price(original=' 1.200.000€ ', current=None, drop=None), price(original=' 1.359.000€ ', current=None, drop=None)]

关于Python 3 与 Bs4 的抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52629595/

相关文章:

python - 使用 BeautifulSoup4 抓取网页

python - 如何使用 Python BeautifulSoup 将输出写入 html 文件

python - 表没有正确抓取 python BeautifulSoup

python - 使用 psycopg2 执行 SQL 查询

python - 创建函数,向范围添加参数

python - 为什么admin.autodiscover()在使用admin时在Django中没有自动调用,为什么它被设计成显式调用?

python - 如何抓取两个 URL 并将每个 URL 的元素放入一个表中?

python - 如何在Python中迭代或处理网格或坐标

python - 动态改变 HTML 源代码

python - 在 Python 中打印 selenium 网络元素的 HTML 文本