python - 使用 urllib urllib2 python 向 SciHub 发送表单请求不再有效

标签 python pdf urllib2 urllib

我有一个我非常满意的小脚本,它可以从剪贴板读取一个或多个引用文献,并从 Google Scholar 获取有关学术论文的信息,然后将其输入 SciHub 以获得 pdf 格式。由于某种原因它停止工作了,我花了很长时间试图找出原因。

测试表明该程序的 Google (scholarly.py) 部分工作正常,问题出在 SciHub 部分。

有什么想法吗?

这里是一个示例引用: Appleyard, S.J.、Angeloni, J. 和 Watkins, R. (2006) 澳大利亚珀斯,经历干旱和人口密度增加的城市地区富含砷的地下水。应用地球化学 21(1), 83-97.

    '''Program to automatically find and download items from a bibliography or references list.
This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads
are listed at the end'''

import scholarly
import win32clipboard
import urllib
import urllib2
import webbrowser
import re

'''Select and then copy the bibliography entries you want to download the
papers for, python reads the clipboard'''
win32clipboard.OpenClipboard()
c = win32clipboard.GetClipboardData()
win32clipboard.EmptyClipboard()

'''Cleans up the text. removes end lines and double spaces etc.'''
c = c.replace('\n', ' ')
c = c.replace('\r', ' ')
while c.find('  ') != -1:
    c = c.replace('  ', ' ')
win32clipboard.SetClipboardText(c)
win32clipboard.CloseClipboard()
print "Working..."

'''bit of regex to extract the title of the paper,
IMPORTANT: bibliography has to be in
author date format or you will need to revise this,
at the moment it looks for year date in brackets, then copies all the text until it
reaches a full-stop, assuming that this is the paper title. If it is not, it
will either fail or will be using inappropriate search terms.'''


paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
print "Analysing titles"
print "The following titles found:"
print "*************************"
list_of_titles= list()
for i in paper_info:
    print '%s...' % (i[3][:50])
    Paper_title=str(i[3])
    list_of_titles.append(Paper_title)

failed=list()
for title in list_of_titles:
    try:
        search_query = scholarly.search_pubs_query(title)

        info= (next(search_query))

        print "Querying Google Scholar"
        print "**********************"
        print "Looking up paper title:"
        print "**********************"
        print title
        print "**********************"

        url=info.bib['url']
        print "Journal URL found "
        print url
        #url=next(search_query)
        print "Sending URL: ", url


        site='http://sci-hub.cc/'
        data = urllib.urlencode({'request': url})

        print data
        results = urllib2.urlopen(site, data) #this is where it fails


        with open("results.html", "w") as f:
            f.write(results.read())

        webbrowser.open_new("results.html")


    except:
        print "**********************"
        print "No valid journal found for:"
        print title
        print "**********************"
        print "Continuing..."
        failed.append(title)
    continue

if len(failed)==0:
    print 'Complete'

else:
    print '*************************************'
    print 'The following titles did not download: '
    print '*************************************'
    print failed
    print "Please check that these are valid entries"

最佳答案

现在可以了,我添加了“User-Agent” header 并重新调整了 URLlib 内容。现在它在做什么似乎更加明显。尝试从网络上收集的许多不同代码片段的反复试验的过程。希望我的老板不要问我今天取得了什么成就。有人应该创建一个论坛,人们可以在其中获得编码问题的答案......

'''Program to automatically find and download items from a bibliography or references list here are some journal papers in bibliographic format. Just copy the text to clipboard and run the script.

Ghaffour, N., T. M. Missimer and G. L. Amy (2013). "Technical review and evaluation of the economics of water desalination: Current and future challenges for better water supply sustainability." Desalination 309(0): 197-207. 

Gutiérrez Ortiz, F. J., P. G. Aguilera and P. Ollero (2014). "Biogas desulfurization by adsorption on thermally treated sewage-sludge." Separation and Purification Technology 123(0): 200-213. 

This program uses the 'scihub' website to obtain the full-text paper where
available, if no entry is found the paper is ignored and the failed downloads are listed at the end'''

    import scholarly
    import win32clipboard
    import urllib
    import urllib2
    import webbrowser
    import re


    '''Select and then copy the bibliography entries you want to download the
    papers for, python reads the clipboard'''
    win32clipboard.OpenClipboard()
    c = win32clipboard.GetClipboardData()
    win32clipboard.EmptyClipboard()

    '''Cleans up the text. removes end lines and double spaces etc.'''
    c = c.replace('\n', ' ')
    c = c.replace('\r', ' ')
    while c.find('  ') != -1:
        c = c.replace('  ', ' ')
    win32clipboard.SetClipboardText(c)
    win32clipboard.CloseClipboard()
    print "Working..."

    '''bit of regex to extract the title of the paper,
    IMPORTANT: bibliography has to be in
    author date format or you will need to revise this,
    at the moment it looks for date in brackets, then copies all the text until it
    reaches a full-stop, assuming that this is the paper title. If it is not, it
    will either fail or will be using inappropriate search terms.'''

    paper_info= re.findall(r"(\d{4}[a-z]*)([). ]+)([ \"])+([\w\s_():,-]*)(.)",c)
    print "Analysing titles"
    print "The following titles found:"
    print "*************************"
    list_of_titles= list()
    for i in paper_info:
        print '%s...' % (i[3][:50])
        Paper_title=str(i[3])
        list_of_titles.append(Paper_title)
    paper_number=0
    failed=list()
    for title in list_of_titles:
        try:
            search_query = scholarly.search_pubs_query(title)

            info= (next(search_query))
            paper_number+=1
            print "Querying Google Scholar"
            print "**********************"
            print "Looking up paper title:"
            print title
            print "**********************"

            url=info.bib['url']
            print "Journal URL found "
            print url
            #url=next(search_query)
            print "Sending URL: ", url

            site='http://sci-hub.cc/'

            r = urllib2.Request(url=site)
            r.add_header('User-Agent','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')
            r.add_data(urllib.urlencode({'request': url}))
            res= urllib2.urlopen(r)



            with open("results.html", "w") as f:
                f.write(res.read())


            webbrowser.open_new("results.html")
            if not paper_number<= len(list_of_titles):
                print "Next title"
            else:
                continue

        except Exception as e:
            print repr(e)
            paper_number+=1
            print "**********************"
            print "No valid journal found for:"
            print title
            print "**********************"
            print "Continuing..."
            failed.append(title)
        continue

    if len(failed)==0:
        print 'Complete'

    else:
        print '*************************************'
        print 'The following titles did not download: '
        print '*************************************'
        print failed
        print "Please check that these are valid entries"

关于python - 使用 urllib urllib2 python 向 SciHub 发送表单请求不再有效,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40733716/

相关文章:

python - plt.savefig 不会覆盖现有文件

python - 如果覆盖默认值,如何为 Flask 提供静态服务?

java - 将 PdfFormField 添加到大 PDF 时出现异常

python - 如何在urllib2请求中发送utf-8内容?

python - 在 python 文件处理中混合 readline() 和行迭代器是否安全?

python - PIL - 添加半透明多边形到 JPEG

javascript - 二进制 XHR 结果到文件 blob - Jquery

pdf - 将 pdf 文件转换为 SWF 的工具

Python:将大文件下载到本地路径并设置自定义 http header

python - 如何使用 urllib2.urlopen 发出没有数据参数的 POST 请求