javascript - 使用 Selinium、Scrapy、Python 检索用户个人资料的公共(public) facebook 墙贴

标签 javascript python facebook selenium scrapy

我正在尝试检索我的公开个人资料墙贴。我需要检查消息是否到达我的墙上并在给定的时间戳内传递。我本质上是在编写一个监视检查来验证我们的消息传递系统的消息传递。我收到无法建立连接的消息,因为目标计算机主动拒绝了它。不太清楚为什么?

#!/usr/bin/env python

# Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
#
# Some things to note:
# You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
#
# This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.


    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    import time
    from selenium import selenium

    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["https://www.facebook.com/chronotrackmsgcheck"]

        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )

        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "https://www.facebook.com/chronotrackmsgcheck")
            self.selenium.start()

        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)

        def parse_page(self, response):
            item = Item()

            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()

            sel = self.selenium
            sel.open(response.url)

            #Wait for javscript to load in Selenium
            time.sleep(2.5)

            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item

    SeleniumSpider()

最佳答案

这就是答案。这将使用 selenium 解析用户配置文件,然后仅解析页面上被视为文本的内容。如果您想使用它,您将必须编写自己的数据挖掘算法,但它适合我的目的。

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/profileusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("facebookemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("facebookpassword")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)

parse_data = soup.get_text().encode('utf-8').split('Grant Zukel') #if you use your name extactly how it is displayed on facebook it will parse all post it sees, because your name is always in every post.

latest_message = parse_data[3]
driver.close()
print latest_message

这就是我获取用户最新帖子的方式:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/fbusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("fbemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("fbpass")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)
parse_data = soup.get_text().encode('utf-8').split('Grant Zukel')    
latest_message = parse_data[4]
latest_message = parse_data[4].split('·')
driver.close()
time = latest_message[0]
message = latest_message[1]
print time,message

关于javascript - 使用 Selinium、Scrapy、Python 检索用户个人资料的公共(public) facebook 墙贴,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27932356/

相关文章:

javascript - 我可以在 openwhisk 中安排一次性操作吗?

javascript - 取决于选择哪个单选按钮的 JQuery 操作 - if/else

python - 如何获取li的第二个span标签中存在的第二个 anchor 标签的href。 - BeautifulSoup

javascript - 使用Vue.js和Auth0进行前端认证时如何在Django的后端数据库中创建用户

python - 使用 lxml 将 XML 转换为 python 对象

facebook - 如何以自动化方式为网站创建 Facebook 粉丝页面?

java - 无法在签名的 apk 中登录 Facebook?

javascript - 您可以使用 JavaScript 和 Firefox 扩展更改现有页面吗

javascript - 当我使用 PrivateRoute 时,当我重新加载页面时,我总是会转到登录页面一次

objective-c - 在 Facebook iOS 3.1.1 SDK 中使用 FBSession openActiveSessionWithReadPermissions 处理无效的 accessToken