python - 用 BeautifulSoup 解析,报错 TypeError : coercing to Unicode: need string or buffer, NoneType found

标签 python web-scraping beautifulsoup web-crawler html5lib

所以我试图从亚马逊页面上抓取数据,但在尝试解析卖家所在位置时出现错误。这是我的代码:

#getting the html
request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/')
opener = urllib2.build_opener()
#hiding that I'm a webscraper
request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko')
#opening it up, putting into soup form
html = opener.open(request).read()
soup = BeautifulSoup(html, "html5lib")

#parsing for the seller info
sellers = soup.findAll('div', {'class' : 'a-row a-spacing-medium olpOffer'})
for eachseller in sellers:
    #parsing for price
    price = eachseller.find('span', {'class' : 'a-size-large a-color-price olpOfferPrice a-text-bold'})
    #parsing for shipping costs
    shippingprice = eachseller.find('span'
    , {'class' : 'olpShippingPrice'})
    #parsing for condition
    condition = eachseller.find('span', {'class' : 'a-size-medium'})
    #parsing for seller name
    sellername = eachseller.find('b')
     #parsing for seller location
    location = eachseller.find('div', {'class' : 'olpAvailability'})

    #printing it all out
    print "price, " + price.string + ", shipping price, " + shippingprice.string + ", condition," + condition.string + ", seller name, " + sellername.string + ", location, " + location.string

我收到错误消息,与末尾的“打印”命令有关: TypeError:强制转换为 Unicode:需要字符串或缓冲区,找不到 NoneType

我知道它来自这一行 - location = eachseller.find('div', {'class' : 'olpAvailability'}) - 因为没有该行代码也能正常工作,并且我知道我得到的是 NoneType,因为这条线没有找到任何东西。这是我要解析的部分的 html:

<div class="olpAvailability">
    In Stock. 
        Ships from WI, United States.
    <br/><a href="/gp/aag/details/ref=olp_merch_ship_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=shipping-rates#aag_shipping">Domestic shipping rates</a>
         and <a href="/gp/aag/details/ref=olp_merch_return_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=returns#aag_returns">return policy</a>.
</div>

我不明白“位置”代码行有什么问题,或者为什么它没有提取我想要的数据。

编辑:我想通了,但我不知道为什么。如果我将打印命令更改为 打印 location.find(text=True) 它输出我想要的位置。希望有一天这对某人有所帮助。

最佳答案

你好像搜索错了类名

<div class="a-column a-span3 olpDeliveryColumn" role="gridcell">
<p class="a-spacing-mini olpAvailability">
<ul class="a-unordered-list a-vertical olpFastTrack">
<li><span class="a-list-item">
            Ships from WI, United States.
        </span></li>
<li><span class="a-list-item">
<a href="/gp/aag/details?ie=UTF8&amp;asin=0393934241&amp;seller=A263RIO308P3G8&amp;sshmPath=shipping-rates#aag_shipping">Shipping rates</a>
                   and <a href="/gp/aag/details?ie=UTF8&amp;asin=0393934241&amp;seller=A263RIO308P3G8&amp;sshmPath=returns#aag_returns">return policy</a>.
        </span></li>
</ul>
</p>
</div>

更改代码中的这一行:

location = eachseller.find('div', {'class' : 'olpDeliveryColumn'})

关于python - 用 BeautifulSoup 解析,报错 TypeError : coercing to Unicode: need string or buffer, NoneType found,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17382233/

相关文章:

python - 如何在 python 中运行两个 while True 语句?

python - 如何加载 BeautifulSoup 页面解析器?

python - 使用 Mechanize 登录

python - file.write 仅在交互式 python session 退出时执行?

python - 我如何将这个 (100, 100) numpy 数组转换为 pygame 中的灰度 Sprite ?

python - 我如何将单词转换为 python 3 中的数字(自己的键和值)?

python - 抓取跨多个页面的数据时遇到问题

python - 如何使用 BeautifulSoup 从 SEC N-Q 文档中提取表

node.js - NodeJS 网页抓取 - 表单提交

python - 如何将网络抓取的段落与维基百科中最新抓取的标题配对