python - 在 Python 上抓取

标签 python web-scraping instagram screen-scraping

我想得到标题，没有。特定用户最近 10 张图像的点赞和评论。使用下面的代码我只能得到最新的代码。

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
import json, time, re
phantomjs_path = r'C:\Users\ravi.janjwadia\Desktop\phantomjs-2.1.1-windows\bin\phantomjs.exe'
browser = webdriver.PhantomJS(phantomjs_path)
user = "barackobama"     
browser.get('https://instagram.com/' + user)
time.sleep(0.5)
soup = BeautifulSoup(browser.page_source, 'html.parser')
script_tag = soup.find('script',text=re.compile('window\._sharedData'))
shared_data = script_tag.string.partition('=')[-1].strip(' ;')
result = json.loads(shared_data)
print(result['entry_data']['ProfilePage'][0]['user']['media']['nodes'][0]['caption'])

结果: LAST CALL:在今晚的截止日期之前输入今年夏天与奥巴马总统会面的机会。 → 个人资料中的链接。

最佳答案

在下面的代码中，您只检索第一个节点(即第一张图片)。

print(result['entry_data']['ProfilePage'][0]['user']['media']['nodes'][0]['caption'])

要获取用户最近 10 张图片的信息，请尝试使用此方法。

recent_ten_nodes = result['entry_data']['ProfilePage'][0]['user']['media']['nodes'][:10]

要仅打印标题、点赞数和评论数，请执行此操作。

for node in recent_ten_nodes:
    print node['caption']
    print node['likes']['count']
    print node['comments']['count']

要存储这些值，由您决定如何存储它们。

关于python - 在 Python 上抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37936780/

上一篇：python - 将我的 python 代码转换为 Windows 应用程序(右键单击菜单)

下一篇：python - 如何在 python argparse 中使用子解析器定义全局选项？

java - Java中使用Instagram4j连接Instagram的问题

python - sklearn make_scorer 的输入形状错误 need_proba=True

python - Python 3 时来自 BeautifulSoup 的 "illegal multibyte sequence"错误

java - Web 抓取、屏幕抓取、数据挖掘技巧？

api - Instagram API 中的 'access token' 和 'code' 有什么区别？

python - 从 python 类中的文件构建字典

python - Matplotlib，控制 mark_inset() 属性(kwargs)

python - 大消息RSA加密和解密不正确

javascript - 使用 Apps 脚本抓取 javascript 渲染的网页