ruby - 使用 Ruby、Nokogiri 和 Mechanize 解析网页中的 java cookie 链接

标签 ruby cookies nokogiri mechanize scrape

每个人。
我需要解析一个为每个链接设置了 java cookie 的网页。我可以解析正常搜索,并显示每个产品并将其导入到 mysql 数据库中。
我能够使用以下代码从搜索结果中抓取每个产品及其元素:
这就是我所拥有的:

    require 'rubygems'
    require 'logger'
    require 'mechanize'
    require 'mysql2'
    
    agent = WWW::Mechanize.new{|a| a.log = Logger.new(STDERR) }
    #agent.set_proxy('a-proxy', '8080')
    agent.read_timeout = 60
    
    def add_cookie(agent, uri, cookie)
      uri = URI.parse(uri)
      Mechanize::Cookie.parse(uri, cookie) do |cookie|
        agent.cookie_jar.add(uri, cookie)
      end
    end
    
    
    # get main page
    page = agent.get "http://www.site.com.mx"
    
    # get login form
    form = page.forms.first
    form.correo_ingresar = "user"
    form.password = "password"
    
    # submit login form
    page = agent.submit form
    
    # parse cookies
    myarray = page.body.scan(/SetCookie\(\"(.+)\", \"(.+)\"\)/)
    
    # set session cookies
    myarray.each do |item|
      add_cookie(agent, 'http://www.site.com.mx', "#{item[0]}=#{item[1]}; path=/; domain=www.site.com.mx")
    end
    # show 1000 search results per page
    add_cookie(agent, 'http://www.site.com.mx', "tampag=1000; path=/; domain=www.site.com.mx")
    
    # order results
    add_cookie(agent, 'http://www.site.com.mx', "orden_articulos=existencias asc; path=/; domain=www.site.com.mx")
    
    # section results
    add_cookie (agent, 'http://www.site.com.mx', "codigoseccion_buscar=14; path=/; domain=www.site.com.mx")
    
    # get main page
    page = agent.get "http://www.site.com.mx/tienda/index.php"
    
    search_form = page.forms.first
    
    search_result = agent.submit search_form
    
    doc = Nokogiri::HTML(search_result.body)
    
    rows = doc.css("table.articulos tr")
    
    i = 0
    details = rows.collect do |row|
      detail = {}
      [
        [:sku, 'td[3]/text()'],
        [:desc, 'td[4]/text()'],
        [:qty, 'td[5]/text()'],
        [:qty2, 'td[5]/p/b/text()'],
        [:price, 'td[6]/text()']
      ].collect do |name, xpath|
        detail[name] = row.at_xpath(xpath).to_s.strip
      end
      i = i + 1
      detail
    end
    
    # walk through paginator links
    links = doc.css("a.paginar").map {|l| "http://www.site.com.mx#{l['href']}"}.uniq!
    
    links.each do |l|
        page = agent.get l
    
        doc = Nokogiri::HTML(page.body)
    
        rows = doc.css("table.articulos tr")
    
        rows.each do |row|
            detail = {}
            [
                    [:sku, 'td[3]/text()'],
                    [:desc, 'td[4]/text()'],
                    [:qty, 'td[5]/text()'],
                    [:qty2, 'td[5]/p/b/text()'],
                    [:price, 'td[6]/text()']
            ].collect do |name, xpath|
                    detail[name] = row.at_xpath(xpath).to_s.strip
            end
            details << detail
        end
    end
    
    # update db
    client = Mysql2::Client.new(:host => "localhost", :username => "myusername", :password => "mypassword", :database => "mydatabase")
    
    details.each do |d|
        if d[:sku] != ""
            price = d[:price].split
    
            if price[1] == "D"
                currency = 144
            else
                currency = 168
            end
    
            cost = price[0].gsub(",", "").to_f
    
            if d[:qty] == ""
                qty = d[:qty2]
            else
                qty = d[:qty]
            end 
    
            results = client.query("SELECT * FROM jos_vm_product WHERE product_sku = '#{d[:sku]}' LIMIT 1;")
            if results.count == 1
                product = results.first
    
                            client.query("UPDATE jos_vm_product SET product_sku = '#{d[:sku]}', product_name = '#{d[:desc]}', product_desc = '#{d[:desc]}', product_in_stock = '#{qty}' WHERE product_id = 
    #{product['product_id']};")
    
                client.query("UPDATE jos_vm_product_price SET product_price = '#{cost}', product_currency = '#{currency}' WHERE product_id = '#{product['product_id']}';")
            else
                client.query("INSERT INTO jos_vm_product(product_sku, product_name, product_desc, product_in_stock) VALUES('#{d[:sku]}', '#{d[:desc]}', '#{d[:desc]}', '#{qty}');")
                last_id = client.last_id
    
                client.query("INSERT INTO jos_vm_product_price(product_id, product_price, product_currency) VALUES('#{last_id}', '#{cost}', #{currency});")
            end
        end
    end
现在我不想搜索我想从类别列表中解析:
主页链接:http://www.site.com.mx/tienda/articulos.php?opcion=lineas&seccion_mostrar=11
这显示了这样的表格(所有内容都包含链接)
顶部名称:ACCESSORIOS 是 ACCESORIOS 类别的链接,下面列出的粗体名称是子类别,粗体名称下面的是品牌。如果我点击 ACCESORIOS,它会显示每个品牌和每个混合的子类别,依此类推。
饰品
配件多媒体(6)
墨西哥 ACTECK (5), 曼哈顿 (1)
配件 P/impress. Punto De Venta(1)
爱普生公司 (1)
Accesorios Para Cableados De Patch Panels(1)
智能网络解决方案 (1)
Accesorios Para Camaras Digitales(1)
曼哈顿 (1)
Accesorios Para Computadoras De Escritorio(32)
墨西哥 ACTECK (2)、GENERICA (1)、曼哈顿 (28)、TARGUS (1)
Accesorios Para Computadoras Portatiles(60)
ACTECK DE MEXICO (3)、GENIUS (2)、HP COMERCIAL (2)、HP IMPRESION (1)、MANHATTAN (17)、PERFECT CHOICES (32)、SOLIDEX (1)、TARGUS (1)、TECH Zone (1)
配件 Para Ipod(3)
墨西哥 ACTECK (1),完美的选择 (2)
配件 Para Mesas(3)
曼哈顿 (2),完美的选择 (1)
配件 Para Redes(13)
英特尔网络解决方案 (5), 曼哈顿 (8)
Accesoriso Para Celulares(14)
黑莓 (14)
适配器蓝牙(6)
墨西哥 ACTECK (1)、曼哈顿 (2)、完美选择 (3)
Adaptadores Para Mouse Y Teclado(3)
曼哈顿 (2),完美的选择 (1)
Audifono/diademas Y Microfonos(49)
墨西哥 ACTECK (14)、BTO (1)、天才 (3)、罗技 (2)、曼哈顿 (11)、完美选择 (18)
这是表格的代码,每个链接都有 cookie,这就是为什么我一直很难抓取它。
    <table width="95%" cellspacing="0" cellpadding="3" border="0">
    <tbody>
    <tr>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px" colspan="2"><a onClick="fijar_filtro('codigoseccion_buscar','11')" href="javascript:void(0)" class="busquedas"><b>ACCESORIOS</b></a></td>
    </tr>
    <tr>
    <td width="20" valign="top" align="left"></td>
    <td valign="top" align="left" style="font-family: verdana; font-size: 12px"><a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','338')" href="javascript:void(0)" class="busquedas"><b>Accesorios Multimedia</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (5)</a>, <a onClick="SetCookie('codigolinea_buscar','338');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','540')" href="javascript:void(0)" class="busquedas"><b>Accesorios P/impres. Punto De Venta</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','540');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','106');" href="javascript:void(0)" class="busquedas">EPSON CORPORATION (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','542');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','361')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Camaras Digitales</b>(1)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','361');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','277')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras De Escritorio</b>(32)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (2)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','530');" href="javascript:void(0)" class="busquedas">GENERICA (1)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (28)</a>, <a onClick="SetCookie('codigolinea_buscar','277');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','357')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Computadoras Portatiles</b>(60)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (3)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','167');" href="javascript:void(0)" class="busquedas">GENIUS (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','694');" href="javascript:void(0)" class="busquedas">HP COMERCIAL (2)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','107');" href="javascript:void(0)" class="busquedas">HP IMPRESION (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (17)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (32)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','212');" href="javascript:void(0)" class="busquedas">SOLIDEX (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','586');" href="javascript:void(0)" class="busquedas">TARGUS (1)</a>, <a onClick="SetCookie('codigolinea_buscar','357');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','691');" href="javascript:void(0)" class="busquedas">TECH ZONE (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1302')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Ipod</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1302');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (2)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1175')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Mesas</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1175');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','292')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Redes</b>(13)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','635');" href="javascript:void(0)" class="busquedas">INTELLINET NETWORK SOLUTIONS (5)</a>, <a onClick="SetCookie('codigolinea_buscar','292');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (8)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1378')" href="javascript:void(0)" class="busquedas"><b>Accesoriso Para Celulares</b>(14)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1378');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','714');" href="javascript:void(0)" class="busquedas">BLACKBERRY (14)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','1313')" href="javascript:void(0)" class="busquedas"><b>Adaptador Bluetooth</b>(6)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','602');" href="javascript:void(0)" class="busquedas">ACTECK DE MEXICO (1)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','1313');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (3)</a><br>
    <br>
    <a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','555')" href="javascript:void(0)" class="busquedas"><b>Adaptadores Para Mouse Y Teclado</b>(3)</a><br>
    <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','585');" href="javascript:void(0)" class="busquedas">MANHATTAN (2)</a>, <a onClick="SetCookie('codigolinea_buscar','555');SetCookie('codigoseccion_buscar','11');fijar_filtro('codigomarca_buscar','532');" href="javascript:void(0)" class="busquedas">PERFECT CHOICES (1)</a><br>
    </td>
    </tr>
    </tbody>
    </table>
所以问题是我要在代码中添加什么才能访问每个链接?如果它使用java cookie。
使用的 cookies :
名称、值范围
codigoseccion_buscar, 11-30
codigomarca_buscar, 100-736
codigolinea_buscar, 15-1385

最佳答案

我设法通过向我的 Ruby 代码添加 cookie 来抓取这些链接内容之一:

    # set cookies
    add_cookie(agent, 'http://www.site.com.mx', "codigoseccion_buscar=11; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "codigolinea_buscar=; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "codigomarca_buscar=; path=/; domain=www.site.com.mx")

    add_cookie(agent, 'http://www.site.com.mx', "textobuscar=; path=/; domain=www.site.com.mx")

奇怪的是,如果我只添加其中一个 cookie,它将不起作用。所以我不得不添加所有,即使它们没有任何值,因为每个链接都有一个 cookie,这样它就会删除或清除保存的 cookie。

现在我需要刮掉那些 cookies ,将其用作变量并执行循环或其他操作,有人可以帮助我吗?
<a onClick="SetCookie('codigomarca_buscar','');fijar_filtro('codigolinea_buscar','542')" href="javascript:void(0)" class="busquedas"><b>Accesorios Para Cableados De Patch Panels</b>(1)</a><br>

关于ruby - 使用 Ruby、Nokogiri 和 Mechanize 解析网页中的 java cookie 链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6286073/

相关文章:

mysql - Rails - 如何从对象中获取随机记录?

cookies - HTTP API 网关 JWT 授权器从 cookie 中获取身份源

google-chrome - Chrome 80+ 与跨站点资源关联的 cookie 设置为没有 `SameSite` 属性。已被屏蔽

ruby-on-rails - 定义@shops 变量 NoMethodError 的问题

ruby - if 语句中 grep 返回值的赋值

cookies - 跟踪代码管理器 - 添加基于 cookie 值的触发器触发异常

ruby - 为什么这个 Nokogiri 命令会去掉 HTML 标签?

ruby - 在具有特定属性的元素之后选择 Nokogiri 元素

ruby-on-rails - 如何像使用 Nokogiri gem 一样使用 Ox gem 打开、解析和处理 XML 文件?

ruby-on-rails - 在 Rails 中使用 RSpec 测试 Auth0 登录