我是网络抓取新手,我想从网站上抓取所有产品的信息。
我编写了一个示例代码来抓取数据,其内容如下:
def start_requests(self):
urls = [
'https://www.trendyol.com/camasir-deterjani-x-c108713',
'https://www.trendyol.com/yumusaticilar-x-c103814',
'https://www.trendyol.com/camasir-suyu-x-c103812',
'https://www.trendyol.com/camasir-leke-cikaricilar-x-c103810',
'https://www.trendyol.com/camasir-yan-urun-x-c105534',
'https://www.trendyol.com/kirec-onleyici-x-c103806',
'https://www.trendyol.com/makine-kirec-onleyici-ve-temizleyici-x-c144512'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, meta=meta, dont_filter=True)
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
data = re.search(r"__SEARCH_APP_INITIAL_STATE__=(.*?});", response.text)
data = json.loads(data.group(1))
for p in data["products"]:
item=TeknosaItem()
item['rowid'] = hash(str(datetime.datetime.now()) + str(p["id"]))
item['date'] = str(datetime.datetime.now())
item['listing_id'] = p["id"]
item['product_id'] = p["id"]
item['product_name'] = p["name"]
item['price'] = p["price"]["sellingPrice"]
item['url'] = p["url"]
yield item
我编写的代码能够抓取第一页上列出的所有产品的数据,但是当我们向下滚动页面时,页面会通过 Ajax GET 请求动态加载更多数据,并且无法抓取该数据数据。我观看了一些视频并阅读了一些文章,但我无法弄清楚如何滚动滚动时动态生成的数据。对此的任何帮助将不胜感激。
我在目标网站上找到了无限页面示例:
最佳答案
我不使用Scrapy,购买后你可以调整下一个示例如何从类别中获取所有产品(使用他们的Ajax API):
import requests
categories = [
"camasir-deterjani-x-c108713",
"yumusaticilar-x-c103814",
"camasir-suyu-x-c103812",
"camasir-leke-cikaricilar-x-c103810",
"camasir-yan-urun-x-c105534",
"kirec-onleyici-x-c103806",
"makine-kirec-onleyici-ve-temizleyici-x-c144512",
]
# iterate over categories to construct api_url
# here I will only get products from first category:
api_url = (
"https://public.trendyol.com/discovery-web-searchgw-service/v2/api/infinite-scroll/"
+ categories[0]
)
payload = {
"pi": 1,
"culture": "tr-TR",
"userGenderId": "1",
"pId": "0",
"scoringAlgorithmId": "2",
"categoryRelevancyEnabled": "false",
"isLegalRequirementConfirmed": "false",
"searchStrategyType": "DEFAULT",
"productStampType": "TypeA",
"fixSlotProductAdsIncluded": "false",
}
page = 1
while True:
payload["pi"] = page
data = requests.get(api_url, params=payload).json()
if not data["result"]["products"]:
break
for p in data["result"]["products"]:
name = p["name"]
id_ = p["id"]
price = p["price"]["sellingPrice"]
u = p["url"]
print("{:<10} {:<50} {:<10} {}".format(id_, name[:49], price, u[:60]))
page += 1
这将获取该类别中的所有产品:
...
237119563 Organik Sertifikalı Çamaşır Deterjanı 63 /eya-clean/organik-sertifikali-camasir-deterjani-p-237119563
90066873 Toz Deterjan Sık Yıkananlar 179 /bingo/toz-deterjan-sik-yikananlar-p-90066873
89751820 Sıvı Çamaşır Deterjanı 2 x3L (100 Yıkama) Renkli 144.9 /perwoll/sivi-camasir-deterjani-2-x3l-100-yikama-renkli-siya
112627101 Sıvı Çamaşır Deterjanı (95 Yıkama) 3L Renkli + 2, 144.9 /perwoll/sivi-camasir-deterjani-95-yikama-3l-renkli-2-7l-cic
95398460 Toz Çamaşır Deterjanı Active Beyazlar Ve Renklile 180.99 /omo/toz-camasir-deterjani-active-beyazlar-ve-renkliler-10-k
...
关于python - 使用scrapy从无限滚动页面中抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72609636/