python - 网络抓取从网页中提取产品名称

标签 python web-scraping beautifulsoup

我正在尝试获取此网站上包的名称- http://www.barneys.com/barneys-new-york/women/bags . 到目前为止,我有这段代码:

 from urllib.request import urlopen
    from bs4 import BeautifulSoup
    url="http://www.barneys.com/barneys-new-york/women/bags"
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read(),"html.parser")  
    product_name = bsObj.findAll("a",{"class":"name-link"})
    print(product_name)

我尝试了 renderContents() 和 get_text(),但它们给我错误 (AttributeError)。

最佳答案

名称在 product-name div 中:

from bs4 import BeautifulSoup
import  requests

soup = BeautifulSoup(requests.get("http://www.barneys.com/barneys-new-york/women/bags").content)

print([prod.text.strip() for prod in  soup.select("div.product-name")])

这给了你:

['Lizard iPhone® 6 Plus Case', 'Lizard iPhone® 6 Case', 'Peekaboo Large Satchel', 'Embellished Shoulder Bag', 'Rockstud Reversible Tote', 'Hava Shoulder Bag', 'Rockstud Crossbody', 'PS1 Tiny Shoulder Bag', 'Faye Medium Shoulder Bag', 'Rockstud Crossbody', 'Large Shopper Tote', 'Flat Clutch', 'Wicker Small Crossbody', 'Jotty Duffel', 'P.Y.T. Shoulder Bag', 'Hadley Baby Satchel', 'Beckett Small Crossbody', 'Squarit PM Satchel', 'Double Baguette Micro', 'City Victoria Small Satchel', 'Large Zip Pouch', 'Jotty Duffel', 'Jen Small Crossbody', 'Mini Trouble Shoulder Bag', 'Midi Clutch', 'Midi Clutch', 'Two For One Pouch 10', 'Guitar Rockstud Medium Backpack', 'Embellished Large Messenger', 'Papier A4 Side-Zip Tote', 'Nightingale Micro-Satchel', 'Hand-Carved Atlas Clutch', 'Emerald-Cut Minaudière', 'Trouble II Shoulder Bag', 'Intrecciato Olimpia Small Shoulder Bag', 'Rockstud Large Tote', 'Baguette Micro', 'Bindu Small Clutch', 'Emerald-Cut Minaudière', 'Gotham City Hobo', 'Brillant Sellier PM Satchel', 'Flight Weekender Duffel', 'Sac Mesh Bucket Bag', 'Seema Small Satchel', 'Madison Shoulder Bag', 'Sporty Smiley Crossbody', 'Monogram Large Wallet', 'Monogram Card Case']

如果你想要所有的信息,你可以从带有 thumb-link 类的 anchor 标签中获取它,在 div 中带有 id primary:

print(soup.select("#primary a.thumb-link"))

它给你这样的输出:

<a class="thumb-link" href="http://www.barneys.com/vianel-lizard-iphone%C2%AE-6-plus-case-504475332.html" title="Lizard iPhone® 6 Plus Case">
<img alt="Vianel Lizard iPhone® 6 Plus Case" class="gridImg" data-image-alter="http://product-images.barneys.com/is/image/Barneys/504475332_2_detail?$grid_new_fixed$" data-original="http://product-images.barneys.com/is/image/Barneys/504475332_1_tabletop?$grid_new_fixed$" height="370" onerror="this.src='http://demandware.edgesuite.net/aasv_prd/on/demandware.static/Sites-BNY-Site/-/default/dwd89468c5/images/browse_placeholder_image.jpg'" title="Lizard iPhone® 6 Plus Case" width="231"/>
<noscript>
<img alt="Vianel Lizard iPhone® 6 Plus Case" src="http://product-images.barneys.com/is/image/Barneys/504475332_1_tabletop?$grid_new_fixed$" title="Lizard iPhone® 6 Plus Case?$grid_new_fixed$"/>
</noscript>

您可以从每个返回的 a 中解析图像、标题等。

使用您自己的代码,您需要像上面那样访问 .text 属性:

product_name = [a.text.strip() for a in  bsObj.findAll("a",{"class":"name-link"})]
print(product_name)

这会给你和第一个选择一样的结果:

['Lizard iPhone® 6 Plus Case', 'Lizard iPhone® 6 Case', 'Peekaboo Large Satchel', 'Embellished Shoulder Bag', 'Rockstud Reversible Tote', 'Hava Shoulder Bag', 'Rockstud Crossbody', 'PS1 Tiny Shoulder Bag', 'Faye Medium Shoulder Bag', 'Rockstud Crossbody', 'Large Shopper Tote', 'Flat Clutch', 'Wicker Small Crossbody', 'Jotty Duffel', 'P.Y.T. Shoulder Bag', 'Hadley Baby Satchel', 'Beckett Small Crossbody', 'Squarit PM Satchel', 'Double Baguette Micro', 'City Victoria Small Satchel', 'Large Zip Pouch', 'Jotty Duffel', 'Jen Small Crossbody', 'Mini Trouble Shoulder Bag', 'Midi Clutch', 'Midi Clutch', 'Two For One Pouch 10', 'Guitar Rockstud Medium Backpack', 'Embellished Large Messenger', 'Papier A4 Side-Zip Tote', 'Nightingale Micro-Satchel', 'Hand-Carved Atlas Clutch', 'Emerald-Cut Minaudière', 'Trouble II Shoulder Bag', 'Intrecciato Olimpia Small Shoulder Bag', 'Rockstud Large Tote', 'Baguette Micro', 'Bindu Small Clutch', 'Emerald-Cut Minaudière', 'Gotham City Hobo', 'Brillant Sellier PM Satchel', 'Flight Weekender Duffel', 'Sac Mesh Bucket Bag', 'Seema Small Satchel', 'Madison Shoulder Bag', 'Sporty Smiley Crossbody', 'Monogram Large Wallet', 'Monogram Card Case']

关于python - 网络抓取从网页中提取产品名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36828121/

相关文章:

python - python utf-8字符支持中从 `\U`和 `\u`开始的unicode字符有什么区别

Python添加到MySQL数据库

java - 使用indexOf从网站获取信息

javascript - Axios url get html响应请求为200,但response.data为空

python - 使用 Beautiful Soup 查找包含 unicode 字形的元素

python - 抓取并下载 png 和 jpeg

python - 根据 Pandas 中另一列的索引从一列获取数据

python - circleci:pip install dlib 失败

python - python 中的 list.insert() 实际上做了什么?

python - Web 抓取表可以正确读取错误数据