python - 如何保存Python网页抓取结果

标签 python csv

我正在尝试抓取 LexisNexis。我想检索新闻报道的标题、来源和日期。这是我在使用 selenium 进行搜索后编写的代码。我在将数据保存到 csv 文件时遇到问题。我不断收到编码错误。当我没有收到编码错误时,我得到的数据带有许多空格和奇怪的字符,例如\t\t\t\t\t\t\t\和\n。

这是我检索的示例:

[“\n\t\t\t\t网络继续抨击印第安纳州因宗教自由法引发‘ Storm ’\n\t\t\t", "\n\t\t\t\t所有三个网络对印第安纳州‘有争议的’法律\n\t\t\t”、“\n\t\t\t\tABC 继续猛烈抨击‘有争议的’‘宗教自由’法案\n\t\t\t", “\n\t\t\t\tABC、NBC 急于将特朗普描绘成‘温和派’、‘特朗普 2.0’\n\t\t\t", '\n\t\t\t\tCBS 击中 panic 按钮,反对北卡罗来纳州佐治亚州的宗教自由法案\n\t\t\t', '\n\t\t\t\t圣战报告 - 2016 年 10 月 7 日\n\t\t\t', '\n\t\t\t\t教育新闻综述:2016 年 5 月 2 日\n\t\t\t', '\n\t\t\t\tNBC CBS 继续攻击宗教自由法\n\t\t\t', '\n\t\t\t\tNBC 猛烈抨击印第安纳州宗教自由法...然后开始为期一周的信仰系列节目\n\t\t\t', "\n\t\t\t\t网络再次抨击印第安纳州因宗教自由法引起“全国抗议”和“骚动”\n\t\t\t"]

标题、日期和来源就是这种情况。我不确定我在这里做错了什么。

scd =browser.page_source
soup = BeautifulSoup(scd, "lxml")


headlines=[]
for headline in soup.findAll('a', attrs={"data-action":"title"}):
 head_line=headline.get_text()
 #head_line.strip('a>, <a data-action="title" href="#">')
 #head_line.encode('utf-8')
 Headlines = head_line.encode()
 headlines.append(head_line)

sources=[]        
 for sources in soup.findAll('a', attrs{"class":"rightpanefiltercontent notranslate", "href":"#"}):
source_only=sources.get_text()
source_only.encode('utf-8')
sources.append(source_only)
Sources = sources.encode()

dates=[]          
for dates in soup.findAll('a', attrs={"class":"rightpanefiltercontent"}):
date_only=dates.get_text()
date_only.strip('<a class="rightpanefiltercontent" href="#">')
date_only.encode()
dates.append(date_only)
Dates = dates.encode()

news=[Headlines,Sources,Dates]


result = "/Users/danashaat/Desktop/Tornadoes/IV Search News Results/data.csv"
with open(result, 'w') as result:
newswriter = csv.writer(result, dialect='excel') 
newswriter.writerow(News)

此外,这是我找到头条新闻时的结果:

[<a data-action="title" href="#"> Networks Continue Hammering Indiana for Sparking a 'Firestorm' Over Religious Freedom Law </a>, <a data-action="title" href="#"> All Three Networks Pile on Indiana's 'Controversial' Law </a>, <a data-action="title" href="#"> ABC Continues Obsessively Bashing 'Controversial' 'Religious Freedom' Bill </a>, <a data-action="title" href="#"> ABC, NBC Rush to Paint Trump as a 'Moderate,' 'Trump 2.0' </a>, <a data-action="title" href="#"> CBS Hits the Panic Button, Rails Against Religious Freedom Bills in Georgia, North Carolina </a>, <a data-action="title" href="#"> Jihad Report - October 7, 2016 </a>, <a data-action="title" href="#"> Education News Roundup: May 2, 2016 </a>, <a data-action="title" href="#"> NBC CBS Keep Up Attack on Religious Freedom Laws </a>, <a data-action="title" href="#"> NBC Slams Indiana Religious Freedom Law...Then Starts Week-Long Series on Faith </a>, <a data-action="title" href="#"> Networks Again Bash Indiana for Causing 'National Outcry' and 'Uproar' Over Religious Freedom Law </a>]

我花了几个小时试图解决这个问题,因此我们将不胜感激。

最佳答案

您可以将元素搜索锚定到 div class "item":

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import csv
d = webdriver.Chrome()
d.get('https://www.lexisnexis.com/en-us/home.page')
results = [[(lambda x:x['href'] if i == 'a' else getattr(x,'text', None))(c.find(i)) for i in ['a', 'time', 'h5', 'p']] for c in soup(d.page_source, 'html.parser').find_all('div', {'class':'item'})]
with open('lexisNexis.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['source', 'timestamp', 'tags', 'headline'], *[re.findall('(?<=//www\.)\w+(?=\.com)', a)+b for a, *b in results if all([a, *b])]])

输出:

source,timestamp,tags,headline
law360,04 Sep 2018,Labor & Employment Law,11th Circ. Revives Claim In Ex-Aaron's Worker FMLA Suit
law360,04 Sep 2018,Workers' Compensation,Back To School: Widener's Rod Smolla Talks Free Speech
law360,04 Sep 2018,Tax Law,Ex-Sen. Kyl Chosen To Take Over McCain's Senate Seat
law360,04 Sep 2018,Energy,Mass. Top Court Says Emission Caps Apply To Electric Cos.
lexisnexis,04 Sep 2018,Immigration Law,Suspension of Premium Processing: Another Attack On the H-1B Program (Cyrus Mehta)
law360,04 Sep 2018,Real Estate Law,Privilege Waived For Some Emails In NJ Real Estate Row
law360,04 Sep 2018,Banking & Finance,Cos. Caught Between Iran Sanctions And EU Blocking Statute
law360,04 Sep 2018,Mergers & Acquisitions,Former Paper Co. Tax VP Sues For Severance Pay After Merger

关于python - 如何保存Python网页抓取结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52175127/

相关文章:

python - 无法使用 celery 启用 SSL

python - Selenium ( python ): can't switch to iframe (name is dynamically generated)

c - 从 strtok() 获取零长度字符串

php - 在 PHP 中将 CSV 写入不带附件的文件

xml - 将 XML 转换为 CSV 不起作用

python - 在 'for' 循环期间创建并写入新的 csv 文件

java - 创建基于 libreoffice 文本的数据源并使用 java 进行设置

python - Python按顺序运行函数;如果失败,请停止

python - 按顺序计算重复条目

javascript - `Node.js` 和/或其他 Javascript 分支相对于非 JS 框架(Rails、Django...)的性能、稳定性和速度