python - 带有换行符的网页抓取数据

标签 python web web-scraping beautifulsoup

我正在尝试从这个网址中抓取数据:https://www.apple.com/ca/shop/browse/home/specialdeals/mac/macbook_pro/13

我正在尝试检索显示“

”的行

8GB 2133MHz LPDDR3 板载内存

16GB 2133MHz LPDDR3 板载内存

containers = soup.findAll('tr', {'class': 'product'}) 中的每个容器中使用 BeautifulSoup。问题是它周围有换行符和多个换行符,这使我很难解析。我怎样才能找回这个?

最佳答案

查看源代码,最好的选择是将 BeautifulSoup正则表达式结合起来:

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.apple.com/ca/shop/browse/home/specialdeals/mac/macbook_pro/13"

r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for td in soup.select('td.specs'):
    m = re.search('^(8|16).*?onboard memory.*?$', td.text, flags=re.M|re.I)
    if not m:
        continue
    print(td.select_one('h3').text.strip())
    print('Full text: {} | Memory: {}'.format(m[0].strip(), m[1]))
    print('-' * 80)

此代码查找所有 8 或 16 GB 的产品并打印它们:

Refurbished 13.3-inch MacBook Pro 2.3GHz dual-core Intel Core i5 with Retina display - Space Grey
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 2.3GHz dual-core Intel Core i5 with Retina display - Silver
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 2.0GHz Dual-core Intel Core i5 with Retina Display — Space Grey
Full text: 8GB of 1866MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 2.3GHz dual-core Intel Core i5 with Retina display - Silver
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 2.3GHz dual-core Intel Core i5 with Retina display - Space Grey
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch Macbook Pro 2.9GHz Dual-core Intel Core i5 with Retina Display - Space Grey
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch Macbook Pro 2.9GHz Dual-core Intel Core i5 with Retina Display - Silver
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch Macbook Pro 2.9GHz Dual-core Intel Core i5 with Retina Display - Silver
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 3.1GHz dual-core Intel Core i5 with Retina display - Silver
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch MacBook Pro 3.1GHz dual-core Intel Core i5 with Retina display - Space Grey
Full text: 8GB of 2133MHz LPDDR3 onboard memory | Memory: 8
--------------------------------------------------------------------------------
Refurbished 13.3-inch Macbook Pro 3.3GHz Dual-core Intel Core i7 with Retina Display - Space Grey
Full text: 16GB of 2133MHz LPDDR3 onboard memory | Memory: 16
--------------------------------------------------------------------------------

关于python - 带有换行符的网页抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51712984/

相关文章:

python - 运行时警告 : overflow encountered in np. exp(x**2)

python - 谁能给我解释一下 numpy.indices()?

PHP - 有没有办法获取客户端用户名(AD)

jquery - 使用 Cheerio 和 NodeJS 进行抓取时,对象 #<Object> 没有方法 'attr'

python - 对非结构化列表中的日期字符串和关联值的数据进行格式化

python - python中根据名称对文件进行排序

python - 有没有一种有效的方法可以只获得列表的 K 组合?

html - 在一个序列中放置两个元素

java - 通过 java API 注销 Facebook 需要点击什么 URL?

excel - 如何通过标签、标签名称获取元素,然后点击标签位置(不工作)//镜像iMacros程序(工作)