项目管道

译者：OSGeo 中国

在一个项目被 Spider 抓取之后，它被发送到项目管道，该管道通过几个按顺序执行的组件来处理它。

每个项管道组件（有时称为“项管道”）都是一个实现简单方法的Python类。它们接收一个项目并对其执行操作，还决定该项目是否应继续通过管道，或者是否应删除并不再处理。

项目管道的典型用途有：

清理HTML数据
验证抓取的数据（检查项目是否包含某些字段）
检查重复项（并删除它们）
将刮下的项目存储在数据库中

编写自己的项目管道

每个item pipeline组件都是一个python类，必须实现以下方法：

process_item(self, item, spider)

对每个项管道组件调用此方法。 process_item() 必须：返回包含数据的dict，返回 Item （或任何后代类）对象，返回 Twisted Deferred or raise DropItem 例外。删除的项不再由其他管道组件处理。

| 参数: |

item (Item object or a dict) -- 物品被刮掉了
spider (Spider object) -- 刮掉物品的 Spider

| | --- | --- |

此外，它们还可以实现以下方法：

open_spider(self, spider)

当spider打开时调用此方法。

参数:	spider (`Spider` object) -- 打开的 Spider

close_spider(self, spider)

当spider关闭时调用此方法。

参数:	spider (`Spider` object) -- 关闭的 Spider

from_crawler(cls, crawler)

如果存在，则调用此ClassMethod从 Crawler . 它必须返回管道的新实例。爬虫对象提供对所有零碎核心组件（如设置和信号）的访问；它是管道访问它们并将其功能连接到零碎的一种方式。

参数:	crawler (`Crawler` object) -- 使用此管道的爬虫程序

项目管道示例

无价格的价格验证和删除项目

让我们看看下面的假设管道，它调整了 price 不包括增值税的项目的属性（ price_excludes_vat 属性），并删除不包含价格的项目：

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

将项目写入JSON文件

下面的管道将所有刮掉的项目（从所有 Spider ）存储到一个单独的管道中 items.jl 文件，每行包含一个以JSON格式序列化的项：

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

注解

jsonWriterPipeline的目的只是介绍如何编写项管道。如果您真的想将所有的抓取项存储到JSON文件中，那么应该使用 Feed exports .

将项目写入MongoDB

在本例中，我们将使用pymongo_uu将项目写入mongodb_u。在Scrapy设置中指定MongoDB地址和数据库名称；MongoDB集合以item类命名。

这个例子的要点是演示如何使用 from_crawler() 方法和如何正确清理资源。：：

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

项目截图

这个例子演示了如何返回延迟的 process_item() 方法。它使用splash_u呈现项目url的屏幕截图。管道向本地运行的splash_uuu实例发出请求。在下载请求并触发延迟回调之后，它将项目保存到文件中，并将文件名添加到项目中。

import scrapy
import hashlib
from urllib.parse import quote

class ScreenshotPipeline(object):
    """Pipeline that uses Splash to render screenshot of
 every Scrapy item."""

    SPLASH_URL = "http://localhost:8050/render.png?url={}"

    def process_item(self, item, spider):
        encoded_item_url = quote(item["url"])
        screenshot_url = self.SPLASH_URL.format(encoded_item_url)
        request = scrapy.Request(screenshot_url)
        dfd = spider.crawler.engine.download(request, spider)
        dfd.addBoth(self.return_item, item)
        return dfd

    def return_item(self, response, item):
        if response.status != 200:
            # Error happened, return item.
            return item

        # Save screenshot to file, filename will be hash of url.
        url = item["url"]
        url_hash = hashlib.md5(url.encode("utf8")).hexdigest()
        filename = "{}.png".format(url_hash)
        with open(filename, "wb") as f:
            f.write(response.body)

        # Store filename in item.
        item["screenshot_filename"] = filename
        return item

重复筛选器

查找重复项并删除已处理的项的筛选器。假设我们的项目有一个唯一的ID，但是我们的spider返回具有相同ID的多个项目：

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

激活项目管道组件

若要激活项管道组件，必须将其类添加到 ITEM_PIPELINES 设置，如以下示例中所示：

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

在此设置中分配给类的整数值决定了它们的运行顺序：项从低值类传递到高值类。习惯上把这些数字定义在0-1000范围内。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

14.md

14.md

项目管道

编写自己的项目管道

项目管道示例

无价格的价格验证和删除项目

将项目写入JSON文件

将项目写入MongoDB

项目截图

重复筛选器

激活项目管道组件

Collapse file tree

Files

14.md

Latest commit

History

14.md

File metadata and controls

项目管道

编写自己的项目管道

项目管道示例

无价格的价格验证和删除项目

将项目写入JSON文件

将项目写入MongoDB

项目截图

重复筛选器

激活项目管道组件