python - 在使用 Scrapy 进行身份验证的同时抓取 LinkedIn

标签 python linkedin scrapy scraper

所以我通读了 Crawling with an authenticated session in Scrapy我被挂断了,我 99% 确定我的解析代码是正确的,我只是不相信登录正在重定向并成功。

我也遇到了 check_login_response() 的问题,不确定它正在检查哪个页面。虽然“注销”是有意义的。




====== 已更新 ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': 'user@email.com', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items



通过在 self.initialized() 前面添加 'Return' 解决了这个问题

再次感谢! -马克

最佳答案

class LinkedPySpider(BaseSpider):

应该是:

class LinkedPySpider(InitSpider):

此外,您不应该像我在此处的回答中提到的那样覆盖 parse 函数:https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

如果您不了解如何定义提取链接的规则,请仔细阅读文档:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

关于python - 在使用 Scrapy 进行身份验证的同时抓取 LinkedIn,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10953991/

相关文章:

python - 响应对象没有属性编码,出现抓取错误

android - LinkedIn Android SDK 与消费者 key 和消费者 secret 混淆?

algorithm - 计算两个用户之间的社交距离

javascript - 如何创建一个简单的共享 linkedIn 链接?

Python Scrapy : Convert relative paths to absolute paths

Python 和 mongoDB 连接池(pymongo)

python - 根据原始索引合并到子数组

python 从日期时间对象获取时区值

python - 合并3个同名数据库,并在python中重命名它们

python - 如何从命令行使用 Scrapy 传递表单数据?