Python Scrape with requests 和 beautifulsoup

我正在尝试使用 python requests 和 beautifulsoup 进行抓取。基本上我正在爬行亚马逊网页。我能够毫无问题地抓取第一页。

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing

但是当我尝试抓取网址中包含“#2”的第二页时

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")

我看到 r 仍然具有相同的值，相当于 1 页的值。

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")

不知道#2 在向第二页发出请求时会造成任何麻烦。我还用谷歌搜索了这些问题，但找不到解决方案。使用 #values 向 url 发出请求的正确方法是什么。如何解决这个问题。请指教。

最佳答案

“#2”是 fragment identifier ，它在服务器端不可见。打开“http://someurl.com/page#123”得到的html内容与“http://someurl.com/page”的内容相同。

在浏览器中，您会看到第二页，因为页面的 javascript 会看到片段标识符，创建 ajax 请求并将新内容注入(inject)页面。你应该find ajax 请求的 url 并使用它:

enter image description here

看起来我们的网址是:

我们很容易理解，我们只需要更改“pg”参数值即可获取其他页面。

关于Python Scrape with requests 和 beautifulsoup，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30435923/

相关文章：

python - 创建 Pycharm 项目时出现问题