python - 从reddit获取顶级壁纸

我正在尝试从 Reddit 的 wallpaper subreddit 获取 HitTest 门的壁纸。我正在使用 beautiful soup 获取第一个壁纸的 HTML 布局然后使用 regex 从中获取 URL anchor 标签。但我经常收到一个与我的正则表达式不匹配的 URL。这是我正在使用的代码:

r = requests.get("https://www.reddit.com/r/wallpapers")
if r.status_code == 200:
    print r.status_code
    text = r.text
    soup = BeautifulSoup(text, "html.parser")

search_string = str(soup.find('a', {'class':'title'}))
photo_url = str(re.search('[htps:/]{7,8}[a-zA-Z0-9._/:.]+[a-zA-Z0-9./:.-]+', search_string).group())

有什么解决办法吗？

最佳答案

这里有一个更好的方法:
在 Reddit 的 url 末尾添加 .json 返回一个 json 对象而不是 HTML。
例如 https://www.reddit.com/r/wallpapers 将提供 HTML 内容but
https://www.reddit.com/r/wallpapers/.json 将为您提供一个 json 对象，您可以使用 python 中的 json 模块轻松利用该对象

下面是获取 HitTest 门壁纸的相同程序:

>>> import urllib
>>> import json

>>> data = urllib.urlopen('https://www.reddit.com/r/wallpapers/.json')
>>> wallpaper_dict = json.loads(data.read())

>>> wallpaper_dict['data']['children'][1]['data']['url']
u'http://i.imgur.com/C49VtMu.jpg'

>>> wallpaper_dict['data']['children'][1]['data']['title']
u'Space Shuttle'

>>> wallpaper_dict['data']['children'][1]['data']['domain']
u'i.imgur.com'

它不仅更简洁，如果 reddit 更改了它的 HTML 布局或有人发布了您的正则表达式无法处理的 URL，它还可以防止您头疼。
根据经验法则，通常更明智的做法是使用 json 而不是抓取 HTML

PS:[children]里面的列表是壁纸编号。第一个是最上面的，第二个是第二个，依此类推。因此 ['data']['children'][2]['data']['url'] 将为您提供第二热门壁纸的链接。你明白要点了吗？ :)

PPS:更重要的是，通过这种方法，您可以使用默认的 urllib 模块。通常，当你抓取 Reddit 时，你必须创建假的 User-Agent header 并在请求时传递它(或者它会给你一个 429 响应代码，但那不是用这种方法的情况。

关于python - 从reddit获取顶级壁纸，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34531011/

python - 从reddit获取顶级壁纸

上一篇：python - 将二维数组合并到现有的三维数组

下一篇：python - 将线图添加到 imshow 并更改轴标记