python - BeautifulSoup:从 anchor 标签中提取文本

我要提取:

来自 image 标记和
div 类数据中的 anchor 标记文本

我成功地提取了 img src，但无法从 anchor 标记中提取文本。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

这里是整个 HTML page 的链接.

这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

我要做的是提取图像src(链接)和div class=data中的标题，例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

最佳答案

这会有所帮助:

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

如果您正在研究亚马逊产品，那么您应该使用官方 API。至少有 one Python package这将缓解您的抓取问题并将您的事件保持在使用条款范围内。

关于python - BeautifulSoup:从 anchor 标签中提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11716380/

python - BeautifulSoup:从 anchor 标签中提取文本

上一篇：Python - 检查字符串中的最后一个字符是否是数字

下一篇：python - django 静态文件版本控制