python - 使用网络爬虫访问链接时遇到问题

标签 python beautifulsoup

我正在尝试创建一个网络爬虫，它解析页面上的所有 html，抓取指定的(通过 raw_input)链接，跟踪该链接，然后重复此过程指定的次数(再次通过raw_input)。我能够捕获第一个链接并成功打印它。但是，我在“循环”整个过程时遇到问题，并且通常会捕获错误的链接。这是第一个链接

https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html

(完全公开，此问题与 Coursera 类(class)的作业有关)

这是我的代码

import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
    tags=soup.findAll('a')
    for tag in tags:
        url= tag.get('href')
        count=count + 1
        if count== rpt:
            break
counts=counts + 1
if counts==cnt:        
    x==1       
else: continue
print  url

最佳答案

根据 DJanssens 的回复，我找到了解决方案；