python - 下载多个 PDF 时出现问题

标签 python pdf web-scraping pull-request

运行以下代码后,我无法打开下载的 PDF。尽管代码运行成功,但下载的 PDF 文件已损坏。

我的计算机的错误消息是

Unable to open file. it may be damaged or in a format Preview doesn't recognize.

为什么它们会损坏以及如何解决这个问题?

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"

#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):

    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content) 

最佳答案

此问题是当您需要 'raw' 链接时,您正在请求 github 'blob' 中的链接:

'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'

但你想要:

'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'

所以只需调整一下即可。完整代码如下:

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"

#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    pdf_link = link['href'].replace('blob','raw')
    pdf_file = requests.get('https://github.com' + pdf_link)
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(pdf_file.content)

关于python - 下载多个 PDF 时出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58542013/

相关文章:

ios - 生成从 CoreData 填充的 PDF

python - 使用selenium python从不同的html中获取href标签下的链接

python - 如何获取内置 Python 类构造函数的参数列表?

python - Matplotlib 在 savefig 和 close() 后不释放内存

python - Django/DjangoRestFramework - unittest 未验证使用 ORM 创建的用户

android - 如何在使用 pdf(Android) 加载 url 的 Web View 上启用捏合缩放?

macos - CGPDFContext 和 CGPDFDocument 是什么关系?

python - 如何使用Scrapy使用 "skype_c2c_container"来抓取电话号码?

javascript - 识别并提取图像的标题/说明(数据剪贴Pinterest)

python - pip freeze 列出未安装的包