我对 Python 和 Scrapy 比较陌生。我正在尝试废弃“购买此商品的客户也购买了”中的链接。 例如:http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/ 。 “购买该商品的顾客也购买了”共有 17 页。如果我要求 scrapy 废弃该 url,它只会废弃第一页(6 个项目)。如何让scrapy按下“下一步按钮”来废弃17页中的所有项目?示例代码(只是crawler.py中重要的部分)将不胜感激。感谢您的宝贵时间!
好的。这是我的代码。正如我所说,我是 Python 新手,因此代码可能看起来很愚蠢,但它可以废弃第一页(6 项)。我主要使用 Fortran 或 Matlab 进行工作。如果有时间的话我很想系统地学习Python。
# Code of my crawler.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )
def parse_item(self, response):
sel = Selector(response)
stuff = BetaItem()
isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
isbn10 = []
if len(isbn10R) > 0:
isbn10 = [(isbn10R[0].split(' '))[1]]
stuff['isbn10'] = isbn10
starsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/span/@title').extract()
stars = []
if len(starsR) > 0:
stars = [(starsR[0].split(' '))[0]]
stuff['stars'] = stars
reviewsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/a[contains(@href,"showViewpoints=1")]/text()').extract()
reviews = []
if len(reviewsR) > 0:
reviews = [(reviewsR[0].split(' '))[0]]
stuff['reviews'] = reviews
copsR = sel.xpath('//a[@class="sim-img-title"]/@href').extract()
ncops = len(copsR)
cops = [None] * ncops
if ncops > 0:
for idx, cop in enumerate(copsR):
cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
stuff['cops'] = cops
return stuff
最佳答案
据我所知,您能够抓取这些“购买此商品的客户也购买了”的产品详细信息。正如您可能看到的,这些都在 ul
内。在 div
与类“shoveler-content”:
<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
<a class="back-button" onclick="return false;" style="" href="#Back">
<div class="shoveler-content">
<ul tabindex="-1">
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
<div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
...
</div>
</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
</ul>
</div>
<a class="next-button" onclick="return false;" style="" href="#Next">
<span class="auiTestSprite s_shvlNext">...</span>
</a>
</div>
</div>
当您检查所选浏览器的网络事件(通过 Firebug 或 Chrome Inspect 工具)时,当您单击下一个建议产品的“下一步”按钮时,您将看到对此类 URL 的 AJAX 查询:
http://www.amazon.com
/gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
&pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
&shovelerName=purchase
(我正在使用此产品页面:http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)
id
中有什么查询参数是 ASIN 列表,它们是下一个建议的产品。显示 6 个 12 个 ASIN?可能会为用户可能进行的下一次“下一次”点击进行一些页内缓存。
从这个 AJAX 查询中得到什么?仍然在浏览器的检查工具中,您会看到响应类型为 application/json
,响应数据是一个包含 12 个元素的 JSON 数组,每个元素都是一些 HTML 片段,类似于:
<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
<a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
<div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div> Home Game: An Accidental Guide to Fatherhood
</a>
<div class="byline">
<span class="carat">›</span>
<a href="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a>
</div>
<div class="rating-price">
<span class="rating-stars">
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" name="B00261OOWQ">
<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
<span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
<span>4.1 out of 5 stars</span>
</span>
</a>
</span>
(<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
</span>
</span>
</div>
<div class="binding-platform"> Kindle Edition </div>
<div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div>
</div>
因此,您基本上可以在每个 <li>
中获得之前建议产品的原始页面部分中的内容。来自<div class="shoveler-content"><ul>
但是如何将这些 ASIN 代码附加到 AJAX 查询的 id
中参数?
好吧,在产品页面中,您会注意到此部分
<div id="purchaseSimsData"
class="sims-data" style="display:none"
data-baseAsin="B005CRQ2OE" data-featureId="pd_sim"
data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
data-wdg="ebooks_display_on_website" data-widgetName="purchase">
B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>
看起来像所有建议的产品 ASIN。
因此,我建议您模拟连续的 AJAX 查询来获取建议的产品,一次 12 个 ASIN,使用 json
解码响应打包,然后解析每个 HTML 片段以提取您想要的产品信息。
关于python - 如何在 "Next"按钮之后使用 scrapy 获取 Amazon.com 链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20455961/