python - 将 scrapy 转换为 lxml

标签 python scrapy lxml

我的 scrapy 代码看起来像这样

for row in response.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"):
    print "================" 
        print row.xpath(".//td[@class='time']/text()").extract()
        print row.xpath(".//td[@class='currency']/text()").extract()
        print row.xpath(".//td[@class='impact']/span/@title").extract()
        print row.xpath(".//td[@class='event']/span/text()").extract()
        print row.xpath(".//td[@class='actual']/text()").extract()
        print row.xpath(".//td[@class='forecast']/text()").extract()
        print row.xpath(".//td[@class='previous']/text()").extract()
    print "================" 

我可以像这样使用纯Python获得相同的东西,

from lxml import html
import requests

page = requests.get('http://www.forexfactory.com/calendar.php?day=dec1.2011')

tree = html.fromstring(page.text)

print tree.xpath(".//td[@class='time']/text()")
print tree.xpath(".//td[@class='currency']/text()")
print tree.xpath(".//td[@class='impact']/span/@title")
print tree.xpath(".//td[@class='event']/span/text()")
print tree.xpath(".//td[@class='actual']/text()")
print tree.xpath(".//td[@class='forecast']/text()")
print tree.xpath(".//td[@class='previous']/text()")

但是我需要逐行执行此操作。我第一次尝试移植到 lxml 失败了:

from lxml import html
import requests

page = requests.get('http://www.forexfactory.com/calendar.php?day=dec1.2011')

tree = html.fromstring(page.text)

for row in tree.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"):
    print row.xpath(".//td[@class='time']/text()")
    print row.xpath(".//td[@class='currency']/text()")
    print row.xpath(".//td[@class='impact']/span/@title")
    print row.xpath(".//td[@class='event']/span/text()")
    print row.xpath(".//td[@class='actual']/text()")
    print row.xpath(".//td[@class='forecast']/text()")
    print row.xpath(".//td[@class='previous']/text()")

将此 scrapy 代码移植到纯 lxml 的正确方法是什么?

编辑:我已经接近了一点。我可以看到一个 table{} 对象,但我只是不知道如何遍历它。

import urllib2
from lxml import etree


#import requests

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''

    return outtxt


url = 'http://www.forexfactory.com/calendar.php?day='
date = 'dec1.2011'

data = wgetUrl(url + date)
parser = etree.HTMLParser()

tree   = etree.fromstring(data, parser)

for elem in tree.xpath("//div[@id='flexBox_flex_calendar_mainCal']"):
    print elem[0].tag, elem[0].attrib, elem[0].text
    # elem[1] is where the table is
    print elem[1].tag, elem[1].attrib, elem[1].text
    print elem[1]

最佳答案

我喜欢使用lxml进行抓取。不过,我通常不使用其 xpath 功能,而是选择其 ElementPath 库。它在语法上非常相似。下面是我如何移植您的 scrapy 代码。

逐行浏览:

初始化:

from lxml import etree

# analogous function xpath(.../text()).extract() for lxml etree nodes
def extract_text(elem):        
    if elem is None:
        print None
    else
        return ''.join(i for i in elem.itertext())

data = wgetUrl(url+date)  # wgetUrl, url, date you defined in your question
tree = etree.HTML(content)

第 1 行

# original
for row in response.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"):

# ported
for row in tree.findall(r'.//div[@id="flexBox_flex_calendar_mainCal"]//table/tr[@class="calendar_row"]'):

第 2 行

print "================" 

第 3 行

# original
print row.xpath(".//td[@class='time']/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="time"]'))

第 4 行

# original
print row.xpath(".//td[@class='currency']/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="currency"]'))

第 5 行

# original
print row.xpath(".//td[@class='impact']/span/@title").extract()
# ported
td = row.find(r'.//td[@class="impact"]/span')
if td is not None and 'title' in td.attrib:
    print td.attrib['title']

第 6 行

# original
print row.xpath(".//td[@class='event']/span/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="event"]/span'))

第 7 行

# original
print row.xpath(".//td[@class='actual']/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="actual"]'))

第 8 行

# original
print row.xpath(".//td[@class='forecast']/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="forecast"]'))

第 9 行

# original
print row.xpath(".//td[@class='previous']/text()").extract()
# ported
print extract_text(row.find(r'.//td[@class="previous"]'))

第 10 行

print "================" 

现在大家在一起:

from lxml import etree

def wgetUrl(target):
    # same as you defined it

# analogous function xpath(.../text()).extract() for lxml etree nodes
def extract_text(elem):        
    if elem is None:
        print None
    else
        return ''.join(i for i in elem.itertext())

content = wgetUrl(your_url)  # wgetUrl as the function you defined in your question
node = etree.HTML(content)


for row in node.findall(r'.//div[@id="flexBox_flex_calendar_mainCal"]//table/tr[@class="calendar_row"]'):
    print "================" 
    print extract_text(row.find(r'.//td[@class="time"]'))
    print extract_text(row.find(r'.//td[@class="currency"]'))
    td = row.find(r'.//td[@class="impact"]/span')
    if td is not None and 'title' in td.attrib:
        print td.attrib['title']
    print extract_text(row.find(r'.//td[@class="event"]/span'))
    print extract_text(row.find(r'.//td[@class="actual"]'))
    print extract_text(row.find(r'.//td[@class="forecast"]'))
    print extract_text(row.find(r'.//td[@class="previous"]'))
    print "================"

关于python - 将 scrapy 转换为 lxml,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29657575/

相关文章:

python - 从 Python 脚本向 Scrapy Spider 传递参数

python - 在python和lxml中生成xml

python - 在 Python 中将数组(numpy 数据类型)转换为元组( native 数据类型)

python - Django 1.9 教程第 2 部分 : No module named 'polls.apps' when running python manage. py makemigrations 投票

python - Scrapy:从相对路径构造绝对路径的非重复列表

python - 如何根据scrapy中本周的日期获取上周作为开始日期和结束日期

python - 漂亮的 Soup Looping 元素,只有在当前元素及其父元素存在时才获取当前元素的文本

python - 使用 pymox 模拟 urllib2.urlopen 和 lxml.etree.parse

python - 网关超时在 Apache 上使用 Django 和 mod_wsgi

python - 在 Python 中,如何让线程休眠到特定时间?