python - 如何获取页面内的直接下载链接?

标签 python html python-2.7 html-parsing beautifulsoup

我有这个代码:

import urllib
from bs4 import BeautifulSoup

f = open('log1.txt', 'w')

url ='http://www.brothersoft.com/tamil-font-513607.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)

for a in soup.select("div.class1.coLeft a[href]"):
    try:
        suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
        f.write ('http://www.brothersoft.com'+a['href']+'\n')
    except:
        print 'cannot read'
        f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

        pass

    content = urllib.urlopen(suburl)
    soup = BeautifulSoup(content)
    for a in soup.select("div.Sever1.coLeft a[href]"):
        try:
            suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
            f.write ('http://www.brothersoft.com'+a['href']+'\n')
        except:
            print 'cannot read'
            f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

            pass

        content = urllib.urlopen(suburl2)
        soup = BeautifulSoup(content)
        for a in soup.select("span.p a[href]"):
            try:
                print (a['href']).encode('utf-8','replace')
                f.write ('http://www.brothersoft.com'+a['href']+'\n')
            except:
                print 'cannot read'
                f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

                pass




f.close()

当我运行它时,我得到这个结果:

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brotherso
ft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font
http://ask.brothersoft.com/ask-a-question/?topic=1
http://ask.brothersoft.com/
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brother
soft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font
http://ask.brothersoft.com/ask-a-question/?topic=1
http://ask.brothersoft.com/

但我需要的只是直接下载链接,如下所示:

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font

最佳答案

而不是最后一个 block :

   for a in soup.select("span.p a[href]"):
        try:
            print (a['href']).encode('utf-8','replace')
            f.write ('http://www.brothersoft.com'+a['href']+'\n')
        except:
            print 'cannot read'
            f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

            pass

bodyonload 属性中读取 url:

print soup.find('body')['onload'][10:-2]

关于python - 如何获取页面内的直接下载链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18513375/

相关文章:

python - 如何从文本文档中删除\x00字节

python - __main__ 是否保证始终可导入?

javascript - 在 HTML 中使用未定义的标签名称?

python - 我对这个类的迭代器和生成器的理解 - 如果我错了,请纠正我

python - Matplotlib NavigationToolbar : Advanced figure options?

html - 悬停并显示更改背景颜色 h1

javascript - 使用表单 JavaScript 中的值创建弹出窗口

python - 检查当前时间是否在 python 列表中可用

python - 如何在python中定义归因的归属

python - 在 Python 中实现广义生日悖论