python - 从维基百科表格中提取数据(剧集标题)

我正在尝试使用 BeautifulSoup 和 Python 从维基百科的表格中提取电视剧集的标题。为了解释我到目前为止所做的事情，我使用了两个表:

1:http://en.wikipedia.org/wiki/Community_(season_1)

2:http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)

现在，在表格中，每一集都包含在 <td class="summary"> 中. 在第一个表中，<td>还有一个 <a>标题名称 </a> ，并且我能够使用以下代码很好地提取数据:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Community_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)

for names in soup.select('td[class="summary"] > a'):
    print names.string

但问题出现在第二个表中，即好汉好汉两个半，标题在 <td> 中我使用这段代码来提取它们:

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.string

但是磁贴带有引号，即“”。我猜想删除引号会很容易，但是如果在一张表中，一些 <td> 怎么办？包含 <a>而有些则没有？我怎样才能让 python 决定它是否应该检查 <a>元素？

如果在第一个代码块中，我删除了 > a ，然后我得到 none 作为输出，因为父项和子项都包含字符串。如果我然后继续使用 names.strings我明白了

<generator object _all_strings at 0x01B1CDA0>

如果我使用 soup.get_text()我得到 UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 6818, character maps to <undefined>

请帮助:)

最佳答案

使用 .text 怎么样？

import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://en.wikipedia.org/wiki/Two_and_a_Half_Men_(season_1)"
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
for lel in soup.select('td[class="summary"]'):
    print lel.text.replace('"','') # remove the quote marks as well

这将打印所有不带引号的名称，并修复了 None 问题。

Pilot
Most Chicks Won't Eat Veal
Big Flappy Bastards
etc...

关于python - 从维基百科表格中提取数据(剧集标题)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25882447/

python - 从维基百科表格中提取数据(剧集标题)

上一篇：python - 用相邻天的数据平均值填补数据缺口

下一篇：python - 属性错误 : None does not have the attribute 'print'