javascript - Scrapy 爬行在 ASPX 网站上不起作用

标签 javascript python asp.net web-scraping scrapy

我正在抓取 Madrid Assembly 的网站,它是用 aspx 构建的,但我不知道如何模拟点击我需要从中获取相应政客的链接。我试过这个:

import scrapy

class AsambleaMadrid(scrapy.Spider):

name        =   "Asamblea_Madrid"
start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

def parse(self, response):

    for id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):
        target                  =   id.extract()
        url                     =   "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"

        formdata=   {'__EVENTTARGET': target,
                     '__VIEWSTATE': '/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ',
                     '__EVENTVALIDATION': '/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8',
                     '__VIEWSTATEGENERATOR': 'BAB98CB3',
                     '__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB2BE0D88C6E150945E8F5D82C189B56A0DA4BC859D'}

        yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)


def takeEachParty(self, response):

     print response.css('ul.listadoVert02 ul li::text').extract()

进入网站的源代码,我可以看到链接的样子,以及它们如何发送 JavaScript 查询。这是我需要访问的链接之一:

<a id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl00$lnk_Grupo&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>

我看了那么多关于的文章,但问题可能出在我的无知方面。

提前致谢。

已编辑:

解决方案:我终于做到了!将 Padraic Cunningham 的非常有用的代码翻译成 Scrapy 方式。由于我为 Scrapy 指定了问题,所以我想发布结果以防万一有人遇到与我相同的问题。

就是这样:

import scrapy
import js2xml

class AsambleaMadrid(scrapy.Spider):

     name        =   "AsambleaMadrid"
     start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

    def parse(self, response):

         source  =   response
         hrefs   =   response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()
         form_data = self.validate(source)
         for ref in hrefs:
             # js2xml allows us to parse the JS function and params, and so to grab the __EVENTTARGET
             js_xml            = js2xml.parse(ref)
             _id               = js_xml.xpath(
                            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]
             form_data["__EVENTTARGET"] = _id.text

             url_diputado    =   'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'
             # The proper way to send a POST in scrapy is by using the FormRequest
             yield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')

     def validate(self, source):
         # these fields are the minimum required as cannot be hardcoded
         data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),
             "__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),
             "__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract(),
             " __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}
         return data

     def extract_parties(self, response):
         source      =   response
         name        =   source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()
         print name

我希望是清楚的。再次感谢大家!

最佳答案

如果您在 chrome 或 firebug 中查看发布到表单的数据,您会发现在发布请求中传递了很多字段,其中有一些是必不可少的,必须从原始页面解析,从中解析 id div.sangria ul li a 标签不够,因为发布的实际数据略有不同,发布的内容在 Javascript 函数中,WebForm_DoPostBackWithOptionshref 不是 id 属性:

href='javascript:WebForm_DoPostBackWithOptions(new 
 WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl03$lnk_Grupo", "", true, "", "", false, true))'>

有时所有的下划线都被美元符号替换,所以很容易用 str.replace 来让它们按正确的顺序排列,但在这种情况下并非如此,我们可以使用正则表达式来解析但我喜欢js2xml lib,它可以将 javascript 函数及其参数解析为 xml 树。

以下使用请求的代码向您展示了如何从初始请求中获取数据并到达您想要的所有页面:

import requests
from  lxml import html
import js2xml

post = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"


def validate(xml):
    # these fields are the minimum required as cannot be hardcoded
    data = {"__VIEWSTATEGENERATOR": xml.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0],
            "__EVENTVALIDATION": xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0],
            "__VIEWSTATE": xml.xpath("//*[@id='__VIEWSTATE']/@value")[0],
            " __REQUESTDIGEST": xml.xpath("//*[@id='__REQUESTDIGEST']/@value")[0]}
    return data



with requests.Session() as s:
    # make initial requests to get the links/hrefs and the from fields
    r = s.get(
        "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
    xml = html.fromstring(r.content)
    hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
    form_data = validate(xml)
    for h in hrefs:
        js_xml = js2xml.parse(h)
        _id = js_xml.xpath(
            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
            0]
        form_data["__EVENTTARGET"] = _id.text
        r = s.post(post, data=form_data)
        xml = html.fromstring(r.content)
        print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))

如果我们运行上面的代码,我们会看到来自所有 anchor 标记的不同文本输出:

In [2]: with requests.Session() as s:
   ...:         r = s.get(
   ...:             "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
   ...:         xml = html.fromstring(r.content)
   ...:         hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
   ...:         form_data = validate(xml)
   ...:         for h in hrefs:
   ...:                 js_xml = js2xml.parse(h)
   ...:                 _id = js_xml.xpath(
   ...:                     "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
   ...:                     0]
   ...:                 form_data["__EVENTTARGET"] = _id.text
   ...:                 r = s.post(post, data=form_data)
   ...:                 xml = html.fromstring(r.content)
   ...:                 print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
   ...:         
[u'Abo\xedn Abo\xedn, Sonsoles Trinidad', u'Adrados Gautier, M\xaa Paloma', u'Aguado Del Olmo, M\xaa Josefa', u'\xc1lvarez Padilla, M\xaa Nadia', u'Arribas Del Barrio, Jos\xe9 M\xaa', u'Ballar\xedn Valc\xe1rcel, \xc1lvaro C\xe9sar', u'Berrio Fern\xe1ndez-Caballero, M\xaa In\xe9s', u'Berzal Andrade, Jos\xe9 Manuel', u'Cam\xedns Mart\xednez, Ana', u'Carballedo Berlanga, M\xaa Eugenia', 'Cifuentes Cuencas, Cristina', u'D\xedaz Ayuso, Isabel Natividad', u'Escudero D\xedaz-Tejeiro, Marta', u'Fermosel D\xedaz, Jes\xfas', u'Fern\xe1ndez-Quejo Del Pozo, Jos\xe9 Luis', u'Garc\xeda De Vinuesa Gardoqui, Ignacio', u'Garc\xeda Mart\xedn, Mar\xeda Bego\xf1a', u'Garrido Garc\xeda, \xc1ngel', u'G\xf3mez Ruiz, Jes\xfas', u'G\xf3mez-Angulo Rodr\xedguez, Juan Antonio', u'Gonz\xe1lez Gonz\xe1lez, Isabel Gema', u'Gonz\xe1lez Jim\xe9nez, Bartolom\xe9', u'Gonz\xe1lez Taboada, Jaime', u'Gonz\xe1lez-Mo\xf1ux V\xe1zquez, Elena', u'Gonzalo L\xf3pez, Rosal\xeda', 'Izquierdo Torres, Carlos', u'Li\xe9bana Montijano, Pilar', u'Mari\xf1o Ortega, Ana Isabel', u'Moraga Valiente, \xc1lvaro', u'Mu\xf1oz Abrines, Pedro', u'N\xfa\xf1ez Guijarro, Jos\xe9 Enrique', u'Olmo Fl\xf3rez, Luis Del', u'Ongil Cores, M\xaa Gador', 'Ortiz Espejo, Daniel', u'Ossorio Crespo, Enrique Mat\xedas', 'Peral Guerra, Luis', u'P\xe9rez Baos, Ana Isabel', u'P\xe9rez Garc\xeda, David', u'Pla\xf1iol De Lacalle, Regina M\xaa', u'Redondo Alcaide, M\xaa Isabel', u'Roll\xe1n Ojeda, Pedro', u'S\xe1nchez Fern\xe1ndez, Alejandro', 'Sanjuanbenito Bonal, Diego', u'Serrano Guio, Jos\xe9 Tom\xe1s', u'Serrano S\xe1nchez-Capuchino, Alfonso Carlos', 'Soler-Espiauba Gallo, Juan', 'Toledo Moreno, Lucila', 'Van-Halen Acedo, Juan']
[u'Andaluz Andaluz, M\xaa Isabel', u'Ardid Jim\xe9nez, M\xaa Isabel', u'Carazo G\xf3mez, M\xf3nica', u'Casares D\xedaz, M\xaa Luc\xeda Inmaculada', u'Cepeda Garc\xeda De Le\xf3n, Jos\xe9 Carmelo', 'Cruz Torrijos, Diego', u'Delgado G\xf3mez, Carla', u'Franco Pardo, Jos\xe9 Manuel', u'Freire Campo, Jos\xe9 Manuel', u'Gabilondo Pujol, \xc1ngel', 'Gallizo Llamas, Mercedes', u"Garc\xeda D'Atri, Ana", u'Garc\xeda-Rojo Garrido, Pedro Pablo', u'G\xf3mez Montoya, Rafael', u'G\xf3mez-Chamorro Torres, Jos\xe9 \xc1ngel', u'Gonz\xe1lez Gonz\xe1lez, M\xf3nica Silvana', u'Leal Fern\xe1ndez, M\xaa Isaura', u'Llop Cuenca, M\xaa Pilar', 'Lobato Gandarias, Juan', u'L\xf3pez Ruiz, M\xaa Carmen', u'Manguan Valderrama, Eva M\xaa', u'Maroto Illera, M\xaa Reyes', u'Mart\xednez Ten, Carmen', u'Mena Romero, M\xaa Carmen', u'Moreno Navarro, Juan Jos\xe9', u'Moya Nieto, Encarnaci\xf3n', 'Navarro Lanchas, Josefa', 'Nolla Estrada, Modesto', 'Pardo Ortiz, Josefa Dolores', u'Quintana Viar, Jos\xe9', u'Rico Garc\xeda-Hierro, Enrique', u'Rodr\xedguez Garc\xeda, Nicol\xe1s', u'S\xe1nchez Acera, Pilar', u'Sant\xedn Fern\xe1ndez, Pedro', 'Segovia Noriega, Juan', 'Vicente Viondi, Daniel', u'Vinagre Alc\xe1zar, Agust\xedn']
['Abasolo Pozas, Olga', 'Ardanuy Pizarro, Miguel', u'Beirak Ulanosky, Jazm\xedn', u'Camargo Fern\xe1ndez, Ra\xfal', 'Candela Pokorna, Marco', 'Delgado Orgaz, Emilio', u'D\xedaz Rom\xe1n, Laura', u'Espinar Merino, Ram\xf3n', u'Espinosa De La Llave, Mar\xeda', u'Fern\xe1ndez Rubi\xf1o, Eduardo', u'Garc\xeda G\xf3mez, M\xf3nica', 'Gimeno Reinoso, Beatriz', u'Guti\xe9rrez Benito, Eduardo', 'Huerta Bravo, Raquel', u'L\xf3pez Hern\xe1ndez, Isidro', u'L\xf3pez Rodrigo, Jos\xe9 Manuel', u'Mart\xednez Abarca, Hugo', u'Morano Gonz\xe1lez, Jacinto', u'Ongil L\xf3pez, Miguel', 'Padilla Estrada, Pablo', u'Ruiz-Huerta Garc\xeda De Viedma, Lorena', 'Salazar-Alonso Revuelta, Cecilia', u'San Jos\xe9 P\xe9rez, Carmen', u'S\xe1nchez P\xe9rez, Alejandro', u'Serra S\xe1nchez, Isabel', u'Serra S\xe1nchez, Clara', 'Sevillano De Las Heras, Elena']
[u'Aguado Crespo, Ignacio Jes\xfas', u'\xc1lvarez Cabo, Daniel', u'Gonz\xe1lez Pastor, Dolores', u'Iglesia Vicente, M\xaa Teresa De La', 'Lara Casanova, Francisco', u'Marb\xe1n De Frutos, Marta', u'Marcos Arias, Tom\xe1s', u'Meg\xedas Morales, Jes\xfas Ricardo', u'N\xfa\xf1ez S\xe1nchez, Roberto', 'Reyero Zubiri, Alberto', u'Rodr\xedguez Dur\xe1n, Ana', u'Rubio Ruiz, Juan Ram\xf3n', u'Ruiz Fern\xe1ndez, Esther', u'Sol\xeds P\xe9rez, Susana', 'Trinidad Martos, Juan', 'Veloso Lozano, Enrique', u'Zafra Hern\xe1ndez, C\xe9sar']

您可以将完全相同的逻辑添加到您的爬虫中,我只是使用请求向您展示了一个工作示例。您还应该知道,并非每个 asp.net 站点的行为都相同,您可能必须像此相关的 answer 中那样对每个帖子重新验证。 .

关于javascript - Scrapy 爬行在 ASPX 网站上不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37577405/

相关文章:

asp.net - Entity Framework 与相关实体更新

javascript - 如何从多个 svg 文件中提取数据并将其存储在 JavaScript 数组中

javascript - 支持 AJAX 历史按钮而不会发疯

python - 多处理和 GUI 更新 - Qprocess 还是多处理?

python - 修补来自不同模块的多个方法(使用 Python mock)

ASP.NET MVC 路由返回 404 而不采取任何行动

javascript - NVDA 屏幕阅读器无法与大型菜单键盘导航一起使用

javascript - 棘手的正则表达式捕获

python - numpy:多维林空间

c# - 如何在 Asp.Net Web Api 中正确设置路由路径