Python: Mechanize 随机无限地停止程序

标签 python mechanize infinite

我正在编写一些使用 mechanize 访问网站的代码,但很多时候,当我运行 Python 代码时,它会无限期地停在我使用了 mechanize.ParseResponse 的行。它没有给我一个错误,相反,我必须通过 CTRL+C 中断它。此外,我相信我正在为该方法使用正确的参数。但是,我很困惑为什么我的程序会突然停止运行。有什么想法吗?

作为额外背景,我在 Mac 上运行。

如有任何帮助,我们将不胜感激!

编辑:以下是我的代码

注意:我调用了 python bikes.py,它偶尔会在以下行停止:

form = mechanize.ParseResponse(response, backwards_compat=False)

有时,它也会停止在:

text = response.read()

# bikes.py
import re
import webbrowser
import mechanize
import urllib

brands = ["cannondale", "felt", "fuji", "giant", "specialized", "trek"]
keywords = ["52", "53", "54", "shimano", "sora", "tiagra", "105", "ultegra", \
"road", "allez", "defy"]
avoid = ["bmx", "mountain", "kids", "fixie", "jacket", "clothing", "fixed gear", \
"hybrid", "mtb"]

def openLink(text):
    text = text.lower()
    open = False
    for word in avoid:
        if word in text:
            return False
    for word in keywords:
        if word in text:
            open = True

    return open

def scourPage(text, fileRead, fileWrite):
    links = re.findall(r'class="row".+?href="(.+?)"', text)

    for link in links:
        if "http:" in link:
            url = link
        else:
            url = homePage + link

        page = urllib.urlopen(url)
        pageText = page.read()
        title = re.search(r'"postingtitle">.{0,10}<span.+?>[\s\'"]+(.+?)[\s\'"]{0,10}</h2>', \
        pageText, re.DOTALL)
        body = re.search(r'"postingbody">(.+?)</section>', pageText, re.DOTALL)
        openBody = False
        openTitle = False

        if body != None:
            body = body.group(1)
            openBody = openLink(body)

        if title != None:
            title = title.group(1)
            openTitle = openLink(title)

        if (openTitle and openBody) and (url not in fileRead) and (title not in fileRead):
            fileWrite.write(title + "\n" + url + "\n")

        fileWrite.close()

homePage = "http://sfbay.craigslist.org"
request = mechanize.Request(homePage)
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
form = forms[0]

request = form.click()
response = mechanize.urlopen(request)
emptySearch = response.geturl()
request = mechanize.Request(emptySearch)
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
form = forms[0]

form["catAbb"] = ["bik"]
form["maxAsk"] = "500"
form.find_control("hasPic").items[0].selected = True

for brand in brands:
    form["query"] = brand

    request = form.click()
    response = mechanize.urlopen(request)
    text = response.read()

    fileR = open('bikes.txt', 'r').read()
    fileA = open('bikes.txt', 'a')

    scourPage(text, fileR, fileA)

    fileA.close()

    next = re.findall(r'class="nplink next".{0,50}<a href=\'(.+?)\'>', text, re.DOTALL)

    while len(next) != 0:
        text = urllib.urlopen(next[0]).read()

        fileR = open('bikes.txt', 'r').read()
        fileA = open('bikes.txt', 'a')

        scourPage(text, fileR, fileA)

        fileA.close()

        next = re.findall(r'class="nplink next".{0,50}<a href=\'(.+?)\'>', text, re.DOTALL)

这段代码梳理了 Craigslist 广告,试图剔除我不想要的广告。在这种情况下,我试图找到一辆公路自行车,并避免任何山地自行车和其他元素。

更新:

等了好久,终于又被键盘打断了运行,停在了form = mechanize.ParseResponse(response, backwards_compat=False)行。我尝试再次运行它并收到此错误:

Traceback (most recent call last):
  File "bikes.py", line 97, in <module>
    forms = mechanize.ParseResponse(response, backwards_compat=False)
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 945, in ParseResponse
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 981, in _ParseFileEx
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 758, in feed
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_sgmllib_copy.py", line 110, in feed
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_sgmllib_copy.py", line 192, in goahead
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 654, in handle_charref
  File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 149, in unescape_charref
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

最佳答案

您的 while 循环可能会无限循环,这就解释了这种行为。你确定不是吗?

当您 CTRL-C 您的代码时遇到的运行时错误并不一定意味着代码已损坏。

关于Python: Mechanize 随机无限地停止程序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17894711/

相关文章:

python - 如果两列中的连续值相同,如何在 python 中删除重复项?

python - 尝试访问 Django 页面时,在 Heroku 上部署 Django 应用程序时出现错误代码 H13

Ruby Mechanize - 无法获取特定的选择列表

python - 使用 mechanize 将文件上传到 Sharepoint 失败,我不明白为什么

Haskell -- 意外的预期类型

c - C中for循环被抽象为Macro时的无限循环

python - 在 Keras 中仅保存模型和权重(在单个文件中)

python - Python语法错误?

ruby-on-rails - 使用 Mechanize 和 Nokogiri 从 <a> 标签中提取 href 参数

javascript - JavaScript 中的无限循环?