python - 抓取维基百科信息框的一部分

我正在使用 Python 2.7、requests 和 BeautifulSoup 来抓取大约 50 个维基百科页面。我在数据框中创建了一个列，其中包含与每首歌曲的名称相关的部分 URL(这些内容之前已经过验证，在对所有歌曲进行测试时，我收到响应代码 200)。

我的代码循环并将这些单独的 URL 附加到主维基百科 URL 中。我已经能够获取页面标题或其他数据，但我真正想要的只是歌曲的长度(不需要其他所有内容)。歌曲长度包含在信息框中(此处示例:http://en.wikipedia.org/wiki/No_One_Knows)

我的代码要么拖过页面上的所有内容，要么什么也不拖。我认为主要问题是我在下面划线的部分(即 mt = ...) - 我在这里放置了不同的 html 标签，但我要么什么也没有得到，要么大部分页面都没有得到。

xyz = df.lengthlink  
#column in a dataframe containing partial strings to append to the main Wikipedia url

def songlength():
    url = ('http://en.wikipedia.org/wiki/' + xyz)
    resp = requests.get(url)
    page = resp.content
    take = BeautifulSoup(page)
    mt = take.find_all(____________)
    sign = mt
    return xyz, sign

for xyz in df.lengthlink:
    print songlength()

编辑添加: 使用下面 Martijn 的建议适用于单个网址(即 No_One_Knows)，但不适用于我的多个链接。它抛出了这个随机错误。

InvalidSchema                             Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
      2 xyz = df.lengthlink 
      3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
      5 page = resp.text
      6 

C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
     63 
     64     kwargs.setdefault('allow_redirects', True)
---> 65     return request('get', url, **kwargs)
     66 
     67 

C:\Python27\lib\site-packages\requests\api.pyc in request(method, url,    **kwargs)
     47 
     48     session = sessions.Session()
---> 49     response = session.request(method=method, url=url, **kwargs)
     50     # By explicitly closing the session, we avoid leaving sockets open which
     51     # can trigger a ResourceWarning in some cases, and look like a memory leak

C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    459         }
    460         send_kwargs.update(settings)
--> 461         resp = self.send(prep, **send_kwargs)
    462 
    463         return resp

C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
    565 
    566         # Get the appropriate adapter to use
--> 567         adapter = self.get_adapter(url=request.url)
    568 
    569         # Start time (approximately) of the request

C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
    644 
    645         # Nothing matches :-/
--> 646         raise InvalidSchema("No connection adapters were found for '%s'" % url)
    647 
    648     def close(self):

InvalidSchema: No connection adapters were found for '1     http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
 2     http://en.wikipedia.org/wiki/No_One_Knows
 3     http://en.wikipedia.org/wiki/Given_to_Fly
 4     http://en.wikipedia.org/wiki/Nothing_as_It_Seems  

Name: lengthlink, Length: 50, dtype: object'

最佳答案

不要尝试解析 HTML 输出，而是尝试解析页面的原始 MediaWiki 源；以 | 开头的第一行长度包含您要查找的信息:

url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
    if line.startswith('| Length'):
       length = line.partition('=')[-1].strip()
       break

演示:

>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
...     if line.startswith('| Length'):
...        length = line.partition('=')[-1].strip()
...        break
... 
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>

您可以根据需要进一步处理此数据，以在此处提取更丰富的数据(广播编辑与*专辑版本)。

关于python - 抓取维基百科信息框的一部分，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29725163/

python - 抓取维基百科信息框的一部分

上一篇：python - 使用python解析和匹配html的奇怪结构

下一篇：python - Pandas:聚合时重命名、重新排序列的最佳习惯用法