我正在使用 Python 2.7、requests
和 BeautifulSoup 来抓取大约 50 个维基百科页面。我在数据框中创建了一个列,其中包含与每首歌曲的名称相关的部分 URL(这些内容之前已经过验证,在对所有歌曲进行测试时,我收到响应代码 200)。
我的代码循环并将这些单独的 URL 附加到主维基百科 URL 中。我已经能够获取页面标题或其他数据,但我真正想要的只是歌曲的长度(不需要其他所有内容)。歌曲长度包含在信息框中(此处示例:http://en.wikipedia.org/wiki/No_One_Knows)
我的代码要么拖过页面上的所有内容,要么什么也不拖。我认为主要问题是我在下面划线的部分(即 mt = ...) - 我在这里放置了不同的 html 标签,但我要么什么也没有得到,要么大部分页面都没有得到。
xyz = df.lengthlink
#column in a dataframe containing partial strings to append to the main Wikipedia url
def songlength():
url = ('http://en.wikipedia.org/wiki/' + xyz)
resp = requests.get(url)
page = resp.content
take = BeautifulSoup(page)
mt = take.find_all(____________)
sign = mt
return xyz, sign
for xyz in df.lengthlink:
print songlength()
编辑添加: 使用下面 Martijn 的建议适用于单个网址(即 No_One_Knows),但不适用于我的多个链接。它抛出了这个随机错误。
InvalidSchema Traceback (most recent call last)
<ipython-input-166-b5a10522aa27> in <module>()
2 xyz = df.lengthlink
3 url = 'http://en.wikipedia.org/wiki/' + xyz
----> 4 resp = requests.get(url, params={'action': 'raw'})
5 page = resp.text
6
C:\Python27\lib\site-packages\requests\api.pyc in get(url, **kwargs)
63
64 kwargs.setdefault('allow_redirects', True)
---> 65 return request('get', url, **kwargs)
66
67
C:\Python27\lib\site-packages\requests\api.pyc in request(method, url, **kwargs)
47
48 session = sessions.Session()
---> 49 response = session.request(method=method, url=url, **kwargs)
50 # By explicitly closing the session, we avoid leaving sockets open which
51 # can trigger a ResourceWarning in some cases, and look like a memory leak
C:\Python27\lib\site-packages\requests\sessions.pyc in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
459 }
460 send_kwargs.update(settings)
--> 461 resp = self.send(prep, **send_kwargs)
462
463 return resp
C:\Python27\lib\site-packages\requests\sessions.pyc in send(self, request, **kwargs)
565
566 # Get the appropriate adapter to use
--> 567 adapter = self.get_adapter(url=request.url)
568
569 # Start time (approximately) of the request
C:\Python27\lib\site-packages\requests\sessions.pyc in get_adapter(self, url)
644
645 # Nothing matches :-/
--> 646 raise InvalidSchema("No connection adapters were found for '%s'" % url)
647
648 def close(self):
InvalidSchema: No connection adapters were found for '1 http://en.wikipedia.org/wiki/Locked_Out_of_Heaven
2 http://en.wikipedia.org/wiki/No_One_Knows
3 http://en.wikipedia.org/wiki/Given_to_Fly
4 http://en.wikipedia.org/wiki/Nothing_as_It_Seems
Name: lengthlink, Length: 50, dtype: object'
最佳答案
不要尝试解析 HTML 输出,而是尝试解析页面的原始 MediaWiki 源;以 | 开头的第一行长度
包含您要查找的信息:
url = 'http://en.wikipedia.org/wiki/' + xyz
resp = requests.get(url, params={'action': 'raw'})
page = resp.text
for line in page.splitlines():
if line.startswith('| Length'):
length = line.partition('=')[-1].strip()
break
演示:
>>> import requests
>>> xyz = 'No_One_Knows'
>>> url = 'http://en.wikipedia.org/wiki/' + xyz
>>> resp = requests.get(url, params={'action': 'raw'})
>>> page = resp.text
>>> for line in page.splitlines():
... if line.startswith('| Length'):
... length = line.partition('=')[-1].strip()
... break
...
>>> print length
4:13 <small>(Radio edit)</small><br />4:38 <small>(Album version)</small>
您可以根据需要进一步处理此数据,以在此处提取更丰富的数据(广播编辑与*专辑版本)。
关于python - 抓取维基百科信息框的一部分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29725163/