python - 如何检查请求中的 URL 是否可下载

标签 python python-3.x url download python-requests

我正在使用 制作这个下载器应用程序tkinter 和请求 我最近在我的程序中发现了一个错误。基本上我希望我的程序在开始下载 URL 的内容之前检查给定的 URL 是否可下载。我过去常常通过获取 来做到这一点。 URL 的 header 并检查“内容长度”是否存在 它适用于某些 URL(例如: https://www.google.com )但对于其他 URL(例如指向 youtube 视频的链接)它不起作用并且它使我的程序崩溃。我看到有人说一个我可以检查的stackoverflow 标题的“Content-Disposition”中的“attachment”但它对我不起作用,并为可下载和不可下载的 URL 返回了相同的内容。做这个的最好方式是什么?
我尝试过但不起作用的另一个stackoverflow问题中提到的代码:

import requests
url = 'https://www.google.com'
headers=requests.head(url).headers
downloadable = 'attachment' in headers.get('Content-Disposition', '')
我以前的代码:
headers = requests.head(url, headers={'accept-encoding': ''}).headers
try:
    print(type(headers['Content-Length']))
    file_size = int(headers['Content-Length'])
except KeyError:
    # Just a class that I defined to raise an exception if the URL was not downloadable
    raise NotDownloadable()
更新:网址:
https://aspb1.cdn.asset.aparat.com/aparat-video/a5e07b7f62ffaad0c104763c23d7393215613675-360p.mp4?wmsAuthSign=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0b2tlbiI6IjUzMGU0Mzc3ZjRlZjVlYWU0OTFkMzdiOTZkODgwNGQ2IiwiZXhwIjoxNjExMzMzMDQxLCJpc3MiOiJTYWJhIElkZWEgR1NJRyJ9.FjMi_dkdLCUkt25dfGqPLcehpaC32dBBUNDC9cLNiu0
这个 URL 是我用来测试的。如果您打开 URL,它会直接将您引导至可以下载的视频,但在检查“内容处理”时,它返回“无”,就像我尝试过的大多数可下载和不可下载 URL 一样。

最佳答案

根据 Request for Comment (RFC) 6266内容处置 header 字段:

is not part of the HTTP standard, but since it is widely implemented, we are documenting its use and risks for implementers.


由于 Content-Disposition header 并不总是可用,您可以使用一种解决方案,不仅可以查找该特定 header ,还可以查看 Content-Type header 中的各个文件类型
这是Content-Types的列表.
下面的代码检查 Content-Disposition 的 header ,但它还会检查一些通常可下载的 Content-Type 的 header 。
我还添加了对 Content-Length 的检查,因为它在对正在下载的文件进行分块时可能很有用。
您是否考虑过创建子下载文件夹?
  • 下载文件夹/文本文件
  • 下载文件夹/pdf_files

  • 或者
  • 下载文件夹/01242021/text_files
  • 下载文件夹/01242021/pdf_files
  • import requests
    
    urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
            '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
            'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
            'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
            'https://www.google.com',
            'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
            'https://www.blank.org']
    
    for url in urls:
        headers = requests.head(url).headers
        Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
        if len(Content_Length) > 0:
            Content_Size = ''.join(map(str, Content_Length))
        else:
            Content_Size = 'The content size was not available.'
    
    
        Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
        if Content_Disposition_Exists is True:
            # do something with the file
           pass
        else:
            Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}
    
            compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                                   'application/zip', 'application/x-tar']
            compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])
    
            image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                             'image/webp']
            image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])
    
            text_formats = ['application/rtf', 'text/plain']
            text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
    
            if compressed_file is True:
                print('Compressed file')
                print(Content_Size)
            elif image_file is True:
                print('Image file')
                print(Content_Size)
            elif text_file is True:
                print('Text file')
                 print(Content_Size)
            elif 'application/pdf' in Content_Type:
                print('PDF file')
                print(Content_Size)
            elif 'text/csv' in Content_Type:
                print('CSV File')
                print(Content_Size)
    

    这是另一个带有函数的版本
    import requests
    
    urls = ['https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2019-financial'
            '-year-provisional/Download-data/annual-enterprise-survey-2019-financial-year-provisional-csv.csv',
            'http://www.pdf995.com/samples/pdf.pdf', 'https://jeroen.github.io/files/sample.rtf',
            'https://www.cnn.com/2021/01/23/opinions/biden-climate-change-gillette-wyoming-coal-sutter/index.html',
            'https://www.google.com',
            'https://thumbs-prod.si-cdn.com/d4e3zqOM5KUq8m0m-AFVxuqa5ZM=/800x600/filters:no_upscale():focal(554x699:555x700)/https://public-media.si-cdn.com/filer/a4/04/a404c799-7118-459a-8de4-89e4a44b124f/img_1317.jpg',
            'https://www.blank.org']
    
    
    def query_headers(webpage):
        response = requests.get(webpage, stream=True)
        headers = response.headers
        file_name = webpage.rsplit('/', 1)[-1]
    
        Content_Disposition_Exists = bool({key: value for key, value in headers.items() if key == 'Content_Disposition'})
        if Content_Disposition_Exists is True:
            # do something with the file
            pass
        else:
            Content_Type = {value for key, value in headers.items() if key == 'Content-Type'}
    
            compression_formats = ['application/gzip', 'application/vnd.rar', 'application/x-7z-compressed',
                                   'application/zip', 'application/x-tar']
            compressed_file = bool([file_format for file_format in compression_formats if file_format in Content_Type])
    
            image_formats = ['image/bmp', 'image/gif', 'image/jpeg', 'image/png', 'image/svg+xml', 'image/tiff',
                             'image/webp']
            image_file = bool([file_format for file_format in image_formats if file_format in Content_Type])
    
            text_formats = ['application/rtf', 'text/plain']
            text_file = bool([file_format for file_format in text_formats if file_format in Content_Type])
            nl = '\n'
    
            if compressed_file is True:
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: Compressed file, File size: {content_size}, File name: {file_name}'
            elif image_file is True:
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: Image file, File size: {content_size}, File name: {file_name}'
            elif text_file is True:
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: Text file, File size: {content_size}, File name: {file_name}'
            elif 'application/pdf' in Content_Type:
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: PDF file, File size: {content_size}, File name: {file_name}'
            elif 'text/csv' in Content_Type:
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: CSV file, File size: {content_size}, File name: {file_name}'
            elif 'text/html' in "".join(str(Content_Type)):
                download_file(file_name, response)
                content_size = get_content_size(headers)
                return f'File Information: file_type: HTML file, File size: {content_size}, File name: {file_name}'
            else:
                content_size = get_content_size(headers)
                return f'File Information: file_type:  no file type found, File size: {content_size}, File name: {file_name}'
    
    
    def get_content_size(headers):
        Content_Length = [value for key, value in headers.items() if key == 'Content-Length']
        if len(Content_Length) > 0:
            Content_Size = ''.join(map(str, Content_Length))
            return int(Content_Size)
        else:
            return 0
    
    
    def download_file(filename, file_stream):
        with open(f'{filename}', 'wb') as f:
            f.write(file_stream.content)
    
    
    for url in urls:
        download_info = query_headers(url)
        print(download_info)
        # output
        File Information: file_type: CSV file, File size: 253178, File name: annual-enterprise-survey-2019-financial-year-provisional-csv.csv
        File Information: file_type: PDF file, File size: 433994, File name: pdf.pdf
        File Information: file_type: Text file, File size: 9636, File name: sample.rtf
        File Information: file_type: HTML file, File size: 185243, File name: index.html
        File Information: file_type: HTML file, File size: 0, File name: www.google.com
        File Information: file_type: Image file, File size: 78868, File name: img_1317.jpg
        File Information: file_type: HTML file, File size: 170, File name: www.blank.org
    
    

    关于python - 如何检查请求中的 URL 是否可下载,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65797228/

    相关文章:

    python-3.x - 无法让 SSL 用于安全 Websocket 连接

    url - 将搜索链接添加到 Google 站点地图

    python - 如何在python中删除一定数量的单词后的字符串

    python - 如何在 psycopg2 中使用字典中的元组作为命名参数

    python - 如何找到3维数组中最接近的数组

    python - lambda 函数闭包捕获了什么?

    python - 如何仅用 Pandas 数据框中的另一个数字替换单个数字?

    python - 类型错误 : can only concatenate str (not "list") to str

    url - 在 url 中分离查询字符串参数的正确方法是什么?

    php用全新的url替换字符串中的url