python - 使用 Python mechanize 循环下载文件

标签 python loops download beautifulsoup mechanize

我在使用 Python mechanize 循环下载多个文件时遇到问题。我也在使用 Beautiful Soup 4。这两个包的文档似乎都没有答案。

这是我的代码 - 请跳到实际循环。我包括了所有内容以供引用:

import mechanize, cookielib, os, time
from bs4 import BeautifulSoup


fcList = ['abandoned mine land inventory points', 'abandoned mine land inventory polygons', \
          'abandoned mine land inventory sites', 'coal mining operations', 'coal pillar location-mining', \
          'industrial mineral mining operations', 'longwall mining panels', 'mine drainage treatment/land recycling project locations', \
          'mined out areas', 'residual waste operations', 'underground mining permit']

dlLink = 'FTP Download'
dloadPath = 'C:\\Users\\SomeGuy\\Downloads'

# Browser
br = mechanize.Browser()

# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)

# Select the first (index zero) form
br.select_form(nr=0)

# Input form data
br.form['Keyword']='mining'
br.submit()
html = br.response().read()

# Pass html to beautiful soup for parse
soup = BeautifulSoup(html)
htmlinks = soup.findAll("a")

# Find links with desired text
for htmlink in htmlinks:
    string = str(htmlink.string)
    if string.lower() in fcList:
        print "Matched link!", string + ". attempting download...\n"
        try:
            req = br.click_link(text = string)
            br.open(req)
            print "URL: " + str(br.geturl)
            html = br.response().read()
            soup = BeautifulSoup(html)
            the_tag = soup.find('a', text=dlLink)
            fileURL = the_tag.get('href')
            print fileURL
            # attempt download
            fnam = string.replace(" ", "_")
            fnam = fnam.replace("/", "_")
            f = br.retrieve(fileURL, os.path.join(dloadPath, fnam + ".zip"))
            print f + "\n"
            br.back()
        except:
            print "An unknown error occurred."

输出:
>>> 
Matched link! Abandoned Mine Land Inventory Points. attempting download...

URL: <bound method Browser.geturl of <mechanize._mechanize.Browser instance at 0x02D9D7B0>>
http://www.pasda.psu.edu/data/dep/AMLInventoryPoints2013_04.zip
An unknown error occurred.
Matched link! Abandoned Mine Land Inventory Polygons. attempting download...

An unknown error occurred.
Matched link! Abandoned Mine Land Inventory Sites. attempting download...

An unknown error occurred.
Matched link! Coal Mining Operations. attempting download...

An unknown error occurred.
Matched link! Coal Pillar Location-Mining. attempting download...

An unknown error occurred.
Matched link! Industrial Mineral Mining Operations. attempting download...

An unknown error occurred.
Matched link! Longwall Mining Panels. attempting download...

An unknown error occurred.
Matched link! Mine Drainage Treatment/Land Recycling Project Locations. attempting     download...

An unknown error occurred.
Matched link! Mined Out Areas. attempting download...

An unknown error occurred.
Matched link! Residual Waste Operations. attempting download...

An unknown error occurred.
Matched link! Underground Mining Permit. attempting download...

An unknown error occurred.
>>> 

我相信问题可能是由于下载之间没有等待时间。无论我选择哪个文件,此代码都会成功下载循环中的第一个文件。或者可能是我不知道的其他错误 - 我昨天刚刚下载了 mechanize 和 beautifulsoup!

最佳答案

尝试这个:

f = br.retrieve(fileURL, os.path.join(dloadPath, fnam + ".zip"))[0]  

如果这不起作用,请删除 try..catch并发布您遇到的实际错误

关于python - 使用 Python mechanize 循环下载文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16594607/

相关文章:

python - gdb 内的 ipython shell

c++ - R 和 C++ 迭代

java - 在 for 循环内部中断 while 循环 Java

jquery - MVC导出并下载csv文件

php - 使用 php 从 mysql 下载页面获取 PDF 文件

download - 我想下载一个 pdf 文件,该文件存储在项目 WebContent 的文件夹中

python - 单独启动时的测试工作方式与使用 unittest 时的工作方式不同

python - 通过聚合查找 pandas 组中的频繁项的最有效方法是什么

python - 如何使用cookies和user-agent登录网页?

php - 总结 Laravel 集合中所有 "amount"字段