Python BeautifulSoup - 循环多个页面

我尝试首先从页面获取所有链接，然后获取“下一个”按钮的 URL 并继续循环，直到没有更多页面为止。一直试图获得一个嵌套循环来实现这一点，但由于某种原因，BeautifulSoup 永远不会解析第二页.. 只解析第一个页面，然后停止..

很难解释，但这里的代码应该更容易理解我想要解释的内容:)

#this site holds the first page that it should start looping on.. from this page i want to reach page 2, 3, etc.
   webpage = urlopen('www.first-page-with-urls-and-next-button.com').read()

soup = BeautifulSoup(webpage)

for tag in soup.findAll('a', { "class" : "next" }):

    print tag['href']
    print "\n--------------------\n"


#next button is relative url so append it to main-url.com
    soup = BeautifulSoup('http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))

#for some reason this variable only holds the tag['href']
    print soup

    for taggen in soup.findAll('a', { "class" : "homepage target-blank" }):
        print tag['href']

        # Read page found
        sidan = urlopen(taggen['href']).read()

# get title
        Titeln = re.findall(patFinderTitle, sidan)

        print Titeln

有什么想法吗？很抱歉英语不好，我希望我不会受到打击:)请询问我是否解释得不好，我会尽力解释更多。哦，我是 Python 新手 - 从今天开始(正如您可能已经想到的:)

最佳答案

如果您在新网址上调用 urlopen 并将生成的文件对象传递给 BeatifulSoup，我想您就已经准备好了。即:

wepage = urlopen(http://www.main-url.com/'+ re.sub(r'\s', '', tag['href']))
soup = BeautifulSoup(webpage)

关于Python BeautifulSoup - 循环多个页面，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10340290/

上一篇：sql - 使用 JMeter 进行负载测试时应用程序出现死锁错误

下一篇：spring - spring 中的命名空间处理程序

python - 如何使用带有 gevent 的 redis 或 Python 中的线程来为多个任务构建我的应用程序

python - 网页抓取时如何切换框？

python - 刚刚安装了 BeautifulSoup Python 3.3.0

python - 抓取的内容与我在浏览器检查器中看到的不同 - Python scraper with Selenium

python - 如何从笔记本电脑托管服务器？

python - 从 Blaze 调用 SQL 函数

java - 使用 JSoup for Java 从网页中提取特定行

css - 网络抓取(抓取)时， "li: nth-child (n)"如何将数字 n 增加 +1？

python - BeautifulSoup 未找到全部