python - 抓取时无法获取头条内容

标签 python selenium web-scraping beautifulsoup screen-scraping

我是抓取新手，但我已经尝试了各种方法来解决这个问题，但没有得到想要的结果。我想抓取这个网站https://www.accesswire.com/newsroom/我想抓取所有的标题，当我在浏览器中检查它们时，标题会显示出来，但在使用 bs4 或 selenium 抓取后，我没有得到完整的页面源代码，也没有得到标题。

我已经尝试过 time.sleep(10) 但这对我来说也不起作用。我使用 selenium 来获取页面，但这对我来说也不起作用。 div.column-15 w-col w-col-9 这是标题所在的类、div

ua     = UserAgent()
header = {'user-agent':ua.chrome}
url = "https://www.accesswire.com/newsroom/"
response = requests.get(url, headers=header)
time.sleep(12)
soup = BeautifulSoup(response.content, 'html.parser')
time.sleep(12)
headline_Div = soup.find("div",{"class":"column-15 w-col w-col-9"})
print(headline_Div)

我只想获取本页所有的头条新闻和头条新闻链接或者至少应该显示完整的页面源代码，以便我可以自己操作它。

最佳答案

你不需要 Selenium 。只需使用更高效的请求和页面使用的 API

import re
import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.accesswire.com/api/newsroom.ashx')
p = re.compile(r" \$\('#newslist'\)\.after\('(.*)\);")
html = p.findall(r.text)[0]
soup = bs(html, 'lxml')
headlines = [(item.text, item['href']) for item in soup.select('a.headlinelink')]
print(headlines)

<小时/>

正则表达式说明:

尝试正则表达式 here

关于python - 抓取时无法获取头条内容，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56264230/

上一篇：python - "TypeError: Object of type bytes is not JSON serializable"

下一篇：python - 更改 csv.DictReader 类型中字典键的值

相关文章：

python - 如何向 pycurl 发出 HEAD 请求

c# selenium 通过导航验证所有链接 StaleElementReferenceException 控制台应用程序

python - 使用 python 和 bs4 抓取后的不同数据

python - 从 DM.de 抓取客户评论

java - 带有 Tor 的 RSelenium 以及 Windows 上的新 RSelenium 版本

javascript - 当 Meteor.method 内事件触发时返回值

python - gettattr，python 中的 "attributes must be string"错误

python - 使用 numpy.unravel_index

python - 从列表中的名称中选择数据框列

selenium - Fluent Wait 和 WebDriver Wait - 差异