python - 我如何使用 Beautifulsoup 获取 url 地址

标签 python html web-scraping beautifulsoup web-crawler

我正在尝试从<a href=>抓取url地址但是这个网站的<href>是#none。我怎样才能爬取这个url地址？我已经弄清楚了很多，但我找不到提示。

像这样

<a href="#none" onclick="goDetail(519975);">
title
<a>


from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re

ssl._create_default_https_context = ssl._create_unverified_context

html = urlopen('https://www.daegu.ac.kr/article/DG159/list')
bs = BeautifulSoup(html, 'html.parser')

nameList = bs.findAll('td', {'class': 'list_left'})

for name in nameList: 
    print(name.get_text())
    print(name.get_url)
    print('\n----------------------------------------------')

最佳答案

您可以将 onclick 中的 id 连接到基本 URL(这就是 onclick 事件发生的情况)。前三个链接(不带 onclick)具有不同的基础。

from bs4 import BeautifulSoup as bs
import requests

base1 = 'https://www.daegu.ac.kr/article/DG159/detail/'
base2 = 'https://www.daegu.ac.kr/article/DG159'
r = requests.get('https://www.daegu.ac.kr/article/DG159/list')
soup = bs(r.content, 'lxml')
links = [base1 + a['onclick'].split('(')[1].split(')')[0] if a.has_attr('onclick') else base2 + a['href'] for a in soup.select('.board_tbl_list a')]
print(links)

关于python - 我如何使用 Beautifulsoup 获取 url 地址，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57343002/

上一篇：javascript - 在 Flask Web 应用程序中加载页面，同时使用 selenium 抓取另一个网站

下一篇：python - 为什么Scrapy内存使用量不断增加？

python - 网页抓取中的问题

html - margin-top css 属性的意外行为

javascript - 为什么当我将溢出属性设置为隐藏时，伪元素就会消失？

python - 在 pyspark 数据框中，当我重命名列时，以前的名称仍可用于过滤。错误或功能？

python - eclipse中使用django环境的shell，pydev给出语法错误

html - 如何将此图像居中放置在 div 中

html - 抓取 CSS 以批量检查响应能力

python - 将函数应用于 Pandas 数据框 : is there a more efficient way of doing this?

python - 如何在 Pygame 中移动 Sprite