xpath - Scrapy - 从表中提取项目

标签 xpath scrapy

试图了解 Scrapy,但遇到了一些死胡同。

我在一页上有 2 个表格,想从每个表格中提取数据,然后移至下一页。

表看起来像这样(第一个称为 Y1,第二个称为 Y2)并且结构相同。

<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
                                <h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">                    

                <table class="table table-striped table-hover table-curved">
                    <thead>
                        <tr>
                            <th class="tCol1" style="padding: 10px;">First Col Head</th>
                            <th class="tCol2" style="padding: 10px;">Second Col Head</th>
                            <th class="tCol3" style="padding: 10px;">Third Col Head</th>
                        </tr>
                    </thead>
                    <tbody>

                        <tr>
                            <td>Info 1</td>
                            <td>Monday 5 September, 2016</td>
                            <td>Friday 21 October, 2016</td>
                        </tr>
                        <tr class="vevent">
                            <td class="summary"><b>Info 2</b></td>
                            <td class="dtstart" timestamp="1477094400"><b></b></td>
                            <td class="dtend" timestamp="1477785600">
                            <b>Sunday 30 October, 2016</b></td>
                        </tr>
                        <tr>
                            <td>Info 3</td>
                            <td>Monday 31 October, 2016</td>
                            <td>Tuesday 20 December, 2016</td>
                        </tr>


                    <tr class="vevent">
                        <td class="summary"><b>Info 4</b></td>                      
                        <td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
                        <td class="dtend" timestamp="1483315200">
                        <b>Monday 2 January, 2017</b></td>
                    </tr>



                </tbody>
            </table>

正如你所看到的,结构有点不一致,但只要我能得到每个 td 并输出到 csv,那么我就会很高兴。

我尝试使用 xPath,但这只会让我更加困惑。

我的最后一次尝试:

import scrapy

class myScraperSpider(scrapy.Spider):
name = "myScraper"

allowed_domains = ["mysite.co.uk"]
start_urls =    (
                'https://mysite.co.uk/page1/',
                )

def parse_products(self, response):
    products = response.xpath('//*[@id="Y1"]/table')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
       item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
       item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
       yield item

这里没有错误,但它只是返回大量有关抓取的信息,但没有实际结果。

更新:

  import scrapy

       class SchoolSpider(scrapy.Spider):
name = "school"

allowed_domains = ["termdates.co.uk"]
start_urls =    (
                'https://termdates.co.uk/school-holidays-16-19-abingdon/',
                )

  def parse_products(self, response):
  products = sel.xpath('//*[@id="Year1"]/table//tr')
 for p in products[1:]:
  item = dict()
  item['hol'] = p.xpath('td[1]/text()').extract_first()
  item['first'] = p.xpath('td[1]/text()').extract_first()
  item['last'] = p.xpath('td[1]/text()').extract_first()
  yield item

这给了我:IndentationError:意外缩进

如果我运行下面修改后的脚本(感谢@Granitosaurus)以输出到 CSV (-o schoolDates.csv),我会得到一个空文件:

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse_products(self, response):
    products = sel.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[1]/text()').extract_first()
        item['last'] = p.xpath('td[1]/text()').extract_first()
        yield item

这是日志:

  • 2017-03-23 12:04:08 [scrapy.core.engine] INFO:Spider 打开 2017-03-23 12:04:08 [scrapy.extensions.logstats] 信息:已爬网 0 页数 (0 页/分钟), 抓取 0 条 (0 条/分钟) 2017-03-23 12:04:08 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 于... 2017-03-23 12:04:08 [scrapy.core.engine] 调试:爬行(200) https://termdates.co.uk/robots.txt>(引用:无)2017-03-23 12:04:08 [scrapy.core.engine] 调试:爬行 (200) https://termdates.co.uk/school-holidays-16-19-abingdon/> (引用: 无)2017-03-23 12:04:08 [scrapy.core.scraper] 错误:蜘蛛错误 处理 https://termdates.co.uk/school-holidays-16-19-abingdon/> (引用: 无)回溯(最近一次调用最后一次):文件 “c:\python27\lib\site-packages\twisted\internet\defer.py”,第 653 行, 在_runCallbacks中 current.result = 回调(current.result, *args, **kw) 文件 "c:\python27\lib\site-packages\scrapy-1.3.3-py2.7.egg\scrapy\spiders__init__.py", 第 76 行,解析中 raise NotImplementedError NotImplementedError 2017-03-23 12:04:08 [scrapy.core.engine] INFO:关闭蜘蛛(已完成)2017-03-23 12:04:08 [scrapy.statscollectors] 信息:转储 Scrapy 统计数据: {'下载者/request_bytes':467,'下载者/request_count':2, '下载器/request_method_count/GET':2, '下载器/response_bytes':11311,'下载器/response_count':2, 'downloader/response_status_count/200': 2, 'finish_reason': '完成', 'finish_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 845000), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'response_received_count': 2, '调度程序/出队':1,'调度程序/出队/内存':1, '调度程序/排队': 1, '调度程序/排队/内存': 1, 'spider_exceptions/NotImplementedError':1,'start_time': datetime.datetime(2017, 3, 23, 12, 4, 8, 356000)} 2017-03-23 12:04:08 [scrapy.core.engine]信息:Spider已关闭(已完成)

更新 2:(跳过行) 这会将结果推送到 csv 文件,但会跳过每隔一行。

外壳显示 {'hol':无,'last':u'\r\n\t\t\t\t\t\t\t\t','first':无}

import scrapy

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    for p in products[1:]:
        item = dict()
        item['hol'] = p.xpath('td[1]/text()').extract_first()
        item['first'] = p.xpath('td[2]/text()').extract_first()
        item['last'] = p.xpath('td[3]/text()').extract_first()
        yield item

解决方案:感谢@vold 这会抓取start_urls中的所有页面并处理不一致的表格布局

# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item

class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
              'https://termdates.co.uk/school-holidays-3-dimensions',)

def parse(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]:
        item = Schooldates1Item()
        item['hol'] = product.xpath('td[1]//text()').extract_first()
        item['first'] = product.xpath('td[2]//text()').extract_first()
        item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
        item['url'] = response.url
        yield item

最佳答案

您需要稍微更正您的代码。由于您已经选择了表中的所有元素,因此无需再次指向表。因此,您可以将 xpath 缩短为这样的td[1]//text()

def parse_products(self, response):
    products = response.xpath('//*[@id="Year1"]/table//tr')
    # ignore the table header row
    for product in products[1:]  
       item = Schooldates1Item()
       item['hol'] = product.xpath('td[1]//text()').extract_first()
       item['first'] = product.xpath('td[2]//text()').extract_first()
       item['last'] = product.xpath('td[3]//text()').extract_first()
       yield item

编辑了我的答案,因为 @stutray 提供了网站的链接。

关于xpath - Scrapy - 从表中提取项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42947417/

相关文章:

python - Linux Python Scrapy 没有名为 six.moves 的模块

android - 如何使用appium在移动应用程序中查找xpath

python - Scrapy - 如何获取重复的请求引用者

python - 属性错误 : 'str' object has no attribute 'xpath'

php - 使用 XML 和 XPath 的 Symfony DomCrawler

amazon-s3 - 导出 Scrapy JSON Feed - 使用 ScrapingHub 导出 AWS S3 的动态 FEED_URI 失败

python - 使用 scrapy 获取链接和文本

PHP XPATH 评估

xml - 为什么 Apache Hive XPath 只返回第一个匹配项?

php - 从网页提取的数据/文本不会插入到 mysql 数据库中