python - Scrapy Pipeline不向MySQL插入数据

标签 python mysql scrapy

我正在 scrapy 中创建一个管道,将抓取的数据存储在 mysql 数据库中。当蜘蛛在终端中运行时,它工作得很好。连管道都打开了。但是数据并未发送到数据库。任何帮助表示赞赏! :)

这是管道代码:

import scrapy

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request

import datetime
import logging
import MySQLdb
import MySQLdb.cursors
from scrapy.exceptions import DropItem
from en_movie.items import EnMovie



class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['image_urls'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['image_urls'])
            return item



class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = ', '.join(image_paths)
        return item



class EnMovieStorePipeline(object):
    def __init__(self):

        self.conn = MySQLdb.connect(host="localhost", user="root",      passwd="pass", db="passdb", charset="utf8", use_unicode=True)
        self.cursor = self.conn.cursor()



    def process_item(self, item, spider):
        cursor.execute("""SELECT * FROM dmmactress_enmovielist WHERE Content_ID = %s and release_date = %s and running_time = %s and Actress = %s and Series = %s and Studio = %s and Director = %s and Label = %s and image_paths = %s and image_urls = %s""",
        (item['Content_ID'][0].encode('utf-8'), item['release_date'][0].encode('utf-8'), item['running_time'][0].encode('utf-8'), item['Actress'][0].encode('utf-8'), item['Series'][0].encode('utf-8'), item['Studio'][0].encode('utf-8'), item['Director'][0].encode('utf-8'), item['Label'][0].encode('utf-8'), item['image_urls'][0].encode('utf-8')))
        result = self.cursor.fetchone()


        if result:
            print("data already exist")
        else:
            try:
                 cursor.execute("""INSERT INTO dmmactress_enmovielist(Content_ID, release_date, running_time, Actress, Series, Studio, Director, Label, image_paths, image_urls) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)""",
                (item['Content_ID'][0].encode('utf-8'), item['release_date'][0].encode('utf-8'), item['running_time'][0].encode('utf-8'), item['Actress'][0].encode('utf-8'), item['Series'][0].encode('utf-8'), item['Studio'][0].encode('utf-8'), item['Director'][0].encode('utf-8'), item['Label'][0].encode('utf-8'), item['image_urls'][0].encode('utf-8')))
                 self.conn.commit()
             except MySQLdb.Error as e:
                 print ("Error %d: %s" % (e.args[0], e.args[1]))
                 return item

编辑:

  def parse_item(self, response):
    for sel in response.xpath('//*[@id="contents"]/div[10]/section/section[1]/section[1]'):
        item = EnMovie()
        Content_ID = sel.xpath('normalize-space(div[2]/dl/dt[contains (.,"Content ID:")]/following-sibling::dd[1]/text())').extract()
        item['Content_ID'] = Content_ID[0].encode('utf-8')
        release_date = sel.xpath('normalize-space(div[2]/dl[1]/dt[contains (.,"Release Date:")]/following-sibling::dd[1]/text())').extract()
        item['release_date'] = release_date[0].encode('utf-8')
        running_time = sel.xpath('normalize-space(div[2]/dl[1]/dt[contains (.,"Runtime:")]/following-sibling::dd[1]/text())').extract()
        item['running_time'] = running_time[0].encode('utf-8')
        Series = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Series:")]/following-sibling::dd[1]/text())').extract()
        item['Series'] = Series[0].encode('utf-8')
        Studio = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Studio:")]/following-sibling::dd[1]/a/text())').extract()
        item['Studio'] = Studio[0].encode('utf-8')
        Director = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Director:")]/following-sibling::dd[1]/text())').extract()
        item['Director'] = Director[0].encode('utf-8')
        Label = sel.xpath('normalize-space(div[2]/dl[2]/dt[contains (.,"Label:")]/following-sibling::dd[1]/text())').extract()
        item['Label'] = Label[0].encode('utf-8')
        item['image_urls'] = sel.xpath('div[1]/img/@src').extract()

        actresses = sel.xpath("//*[@itemprop='actors']//*[@itemprop='name']/text()").extract()
        actress = [x.strip() for x in actresses]
        item['Actress'] = ", ".join(actress)
        yield item

最佳答案

我几周前就开始工作了,但切换到了不同的数据库。当时我基于此编写了一些代码:https://gist.github.com/tzermias/6982723

如果您使用上述代码,还请记住更新您的 settings.py 文件...

关于python - Scrapy Pipeline不向MySQL插入数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38818846/

相关文章:

python - 从不包括管道的脚本运行 scrapy

python django scrapy 将项目返回到 Controller

python - Google bigquery python 客户端库 SQL 选择正则表达式错误

python - 用 Celery 取消已经执行的任务?

python - 旋转由 numpy 数组处理的图像

php - 在 index.php 文件中写入数据库连接

php - 通过 HTML 按钮显示 SQL 查询

MySQL YYYY-MM-DDThh :mm:ss

javascript - 通过scrapy python中的javascript与splash实现下一页?

python - 如何构建模板引擎