python - 使用 python requests 伪装成浏览器并下载文件

标签 python python-requests

我正在尝试使用 python requests 库从此链接下载文件: http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download

仅当使用浏览器时,单击此链接才会为您提供一个文件 (nasdaq.csv)。我使用 Firefox 网络监视器 Ctrl-Shift-Q 检索 Firefox 发送的所有 header 。所以现在我终于得到了 200 服务器响应,但仍然没有文件。该脚本生成的文件包含纳斯达克网站的部分内容,而不是 csv 数据。我在这个网站上查看了类似的问题,没有任何理由让我相信请求库不可能做到这一点。

代码:

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

# Fake Firefox headers 
headers = {"Host" : "www.nasdaq.com", \
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0", \
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \
        "Accept-Language": "en-US,en;q=0.5", \
        "Accept-Encoding": "gzip, deflate", \
        "DNT": "1", \
        "Cookie": "clientPrefs=||||lightg; userSymbolList=EOD+&DIT; userCookiePref=true; selectedsymbolindustry=EOD,; selectedsymboltype=EOD,EVERGREEN GLOBAL DIVIDEND OPPORTUNITY FUND COMMON SHARES OF BENEFICIAL INTEREST,NYSE; c_enabled$=true", \
        "Connection": "keep-alive", }

# Get the list
response = requests.get(url, headers, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

最佳答案

您不需要提供任何 header :

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

response = requests.get(url, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
    for chunk in response.iter_content(chunk_size=1024):
        if chunk: # filter out keep-alive new chunks
            f.write(chunk)

您也可以只写内容:

import requests

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
       f.write(requests.get(url).content)

或者使用 urlib:

urllib.urlretrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download","nasdaq.csv")

所有方法都会为您提供 3137 行 csv 文件:

"Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote",
"TFSC","1347 Capital Corp.","9.79","58230920","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfsc",
"TFSCR","1347 Capital Corp.","0.15","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscr",
"TFSCU","1347 Capital Corp.","10","41800000","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscu",
"TFSCW","1347 Capital Corp.","0.178","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscw",
"PIH","1347 Property Insurance Holdings, Inc.","7.51","46441171.61","n/a","2014","Finance","Property-Casualty Insurers","http://www.nasdaq.com/symbol/pih",
"FLWS","1-800 FLOWERS.COM, Inc.","7.87","510463090.04","n/a","1999","Consumer Services","Other Specialty Stores","http://www.nasdaq.com/symbol/flws",
"FCTY","1st Century Bancshares, Inc","7.81","80612492.62","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/fcty",
"FCCY","1st Constitution Bancorp (NJ)","12.39","93508122.96","n/a","n/a","Finance","Savings Institutions","http://www.nasdaq.com/symbol/fccy",
"SRCE","1st Source Corporation","30.54","796548769.38","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/srce",
"VNET","21Vianet Group, Inc.","20.26","1035270865.78","51099253","2011","Technology","Computer Software: Programming, Data Processing","http://www.nasdaq.com/symbol/vnet",
   ...................................

如果由于某种原因它不适合您,那么您可能需要升级您的请求版本。

关于python - 使用 python requests 伪装成浏览器并下载文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34243835/

相关文章:

python - 如何在 pygame 中连续生成并跟踪多个具有时间延迟的随机对象?

Python POST 字符串数据

python - 将curl命令翻译为python requests.get

python - HTTPS 代理不适用于 Python 的请求模块

python - 从元组列表中删除一个元组

python - 使用鼠标模拟时出现 TypeError : a bytes-like object is required, 而不是 'str'

python - Django - 如何在多个表单上使用完全相同的 clean() 方法

python - 为什么 Geopy Distance.Distance 错误?

Python Requests/BeautifulSoup 访问分页

python - 使用 python 请求模块的 Facebook 图形 GET 请求-证书验证失败