我将数据存储在如下列表中:
date_name = [<a href="/president/washington/speeches/speech-3455">Proclamation of Neutrality (April 22, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3455">Transcript</a>,
<a href="/president/washington/speeches/speech-3456">Fifth Annual Message to Congress (December 3, 1793)</a>,
<a class="transcript" href="/president/washington/speeches/speech-3456">Transcript</a>,
<a href="/president/washington/speeches/speech-3721">Proclamation against Opposition to Execution of Laws and Excise Duties in Western Pennsylvania (August 7, 1794)</a>]
这些不是 date_name
内的 str
元素。我正在尝试获取中立声明(1793 年 4 月 22 日)
、致国会的第五次年度致辞(1793 年 12 月 3 日)
以及反对反对声明宾夕法尼亚州西部法律和消费税的执行(1794 年 8 月 7 日)
,这样我就可以得到每次演讲的日期。我想为 900 多场演讲执行此操作。这是我一直在尝试的代码,因为它解决了我在另一个列表理解场景中遇到的类似问题:
url = 'http://www.millercenter.org/president/speeches'
connection = urllib2.urlopen(url)
html = connection.read()
date_soup = BeautifulSoup(html)
date_name = date_soup.find_all('a')
del date_name[:203] # delete extraneous html before first link (for obama 4453)
# do something with the following list comprehensions
dater = [tag.get('<a href=') for tag in date_name if tag.get('<a href=') is not None]
# remove all items in list that don't contain '<a href=', as this string is unique
# to the elements in date_name that I want
speeches_dates = [_ for _ in dater if re.search('<a href=',_)]
但是,我使用 dater
变量过程得到一个空集,因此我无法继续构建 speeches_dates
。
最佳答案
您看到的是一个 ResultSet
- Tag
实例的列表。当您打印 Tag
时,您将获得 HTML 字符串表示形式。您需要的是获取文本:
date_name = date_soup.find_all('a')[:203]
print([item.get_text(strip=True) for item in date_name])
另外,据我了解,您需要演讲的链接(在包含日期的主要内容中)。在这种情况下,您需要使定位器更加具体,而不是定位所有 a
标记:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.millercenter.org/president/speeches'
date_soup = BeautifulSoup(urllib2.urlopen(url), "lxml")
speeches = date_soup.select('div#listing div.title a[href*=speeches]')
for speech in speeches:
text = speech.get_text(strip=True)
print(text)
打印:
Acceptance Speech at the Democratic National Convention (August 28, 2008)
Remarks on Election Night (November 4, 2008)
Inaugural Address (January 20, 2009)
...
Talk to the Cherokee Nation (August 29, 1796)
Farewell Address (September 19, 1796)
Eighth Annual Message to Congress (December 7, 1796)
关于python - 如果不在 '<a href' 中,则从列表中删除项目?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33090185/