elasticsearch - Stormcrawler -> Elasticsearch 的最佳设置,如果爬行的礼貌不是问题?

标签 elasticsearch web-crawler stormcrawler

我们的大学网络系统有大约 1200 个站点,包括几百万个页面。我们在一台本地运行 apache 的机器上安装和配置了 Stormcrawler,并将驱动器映射到 Web 环境的文件系统。这意味着我们可以让 Stormcrawler 尽可能快地爬行,而不会产生任何网络流量,也不会影响公共(public)网络的存在。我们让 Tika 解析器运行以索引 .doc、.pdf 等。

  • 所有网站都在 *.example.com 域下。
  • 我们有一个使用大量 CPU 运行的 Elasticsearch 实例,
  • index-index 有 4 个分片。
  • 指标索引有 1 个分片。
  • 状态索引有 10 个分片。

  • 考虑到所有这些,我们可以做的最佳爬取配置是什么让爬虫忽略礼貌并在本地网络环境中爆炸并尽可能快地爬取所有内容?

    以下是 es-crawler.flux 中关于 spouts 和 bolts 的当前设置:
    name: "www-all-crawler"
        - resource: true
          file: "/crawler-default.yaml"
          override: false
        - resource: false
          file: "crawler-conf.yaml"
          override: true
        - resource: false
          file: "es-conf.yaml"
          override: true
      - id: "spout"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.AggregationSpout"
        parallelism: 10
      - id: "partitioner"
        className: "com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt"
        parallelism: 1
      - id: "fetcher"
        className: "com.digitalpebble.stormcrawler.bolt.FetcherBolt"
        parallelism: 2
      - id: "sitemap"
        className: "com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt"
        parallelism: 1
      - id: "parse"
        className: "com.digitalpebble.stormcrawler.bolt.JSoupParserBolt"
        parallelism: 1
      - id: "index"
        className: "com.digitalpebble.stormcrawler.elasticsearch.bolt.IndexerBolt"
        parallelism: 1
      - id: "status"
        className: "com.digitalpebble.stormcrawler.elasticsearch.persistence.StatusUpdaterBolt"
        parallelism: 1
      - id: "status_metrics"
        className: "com.digitalpebble.stormcrawler.elasticsearch.metrics.StatusMetricsBolt"
        parallelism: 1
      - id: "redirection_bolt"
        className: "com.digitalpebble.stormcrawler.tika.RedirectionBolt"
        parallelism: 1
      - id: "parser_bolt"
        className: "com.digitalpebble.stormcrawler.tika.ParserBolt"
        parallelism: 1
      - from: "spout"
        to: "partitioner"
          type: SHUFFLE
      - from: "spout"
        to: "status_metrics"
          type: SHUFFLE
      - from: "partitioner"
        to: "fetcher"
          type: FIELDS
          args: ["key"]
      - from: "fetcher"
        to: "sitemap"
          type: LOCAL_OR_SHUFFLE
      - from: "sitemap"
        to: "parse"
          type: LOCAL_OR_SHUFFLE
      - from: "parse"
        to: "index"
          type: LOCAL_OR_SHUFFLE
      - from: "fetcher"
        to: "status"
          type: FIELDS
          args: ["url"]
          streamId: "status"
      - from: "sitemap"
        to: "status"
          type: FIELDS
          args: ["url"]
          streamId: "status"
      - from: "parse"
        to: "status"
          type: FIELDS
          args: ["url"]
          streamId: "status"
      - from: "index"
        to: "status"
          type: FIELDS
          args: ["url"]
          streamId: "status"
      - from: "parse"
        to: "redirection_bolt"
          type: LOCAL_OR_SHUFFLE
      - from: "redirection_bolt"
        to: "parser_bolt"
          type: LOCAL_OR_SHUFFLE
      - from: "redirection_bolt"
        to: "index"
          type: LOCAL_OR_SHUFFLE
      - from: "parser_bolt"
        to: "index"
          type: LOCAL_OR_SHUFFLE
      - from: "redirection_bolt"
        to: "parser_bolt"
          type: LOCAL_OR_SHUFFLE
          streamId: "tika"

    # Custom configuration for StormCrawler
    # This is used to override the default values from crawler-default.xml and provide additional ones
    # for your custom components.
    # Use this file with the parameter -conf when launching your extension of ConfigurableTopology.
    # This file does not contain all the key values but only the most frequently used ones. See crawler-default.xml for an extensive list.
      topology.workers: 2
      topology.message.timeout.secs: 300
      topology.max.spout.pending: 100
      topology.debug: false
      fetcher.threads.number: 50
      # give 2gb to the workers
      worker.heap.memory.mb: 2048
      # mandatory when using Flux
        - com.digitalpebble.stormcrawler.Metadata
      # metadata to transfer to the outlinks
      # used by Fetcher for redirections, sitemapparser, etc...
      # these are also persisted for the parent document (see below)
      # metadata.transfer:
      # - customMetadataName
      # lists the metadata to persist to storage
      # these are not transfered to the outlinks
       - _redirTo
       - error.cause
       - error.source
       - isSitemap
       - isFeed
      http.agent.name: "Storm Crawler"
      http.agent.version: "1.0"
      http.agent.description: "built with StormCrawler Archetype 1.13"
      http.agent.url: "http://example.com/"
      http.agent.email: "noreply@example"
      # The maximum number of bytes for returned HTTP response bodies.
      # The fetched page will be trimmed to 65KB in this case
      # Set -1 to disable the limit.
      http.content.limit: 2000000
      jsoup.treat.non.html.as.error: false
      # FetcherBolt queue dump => comment out to activate
      # if a file exists on the worker machine with the corresponding port number
      # the FetcherBolt will log the content of its internal queues to the logs
      # fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
      parsefilters.config.file: "parsefilters.json"
      urlfilters.config.file: "urlfilters.json"
      # revisit a page daily (value in minutes)
      # set it to -1 to never refetch a page
      fetchInterval.default: 2880
      # revisit a page with a fetch error after 2 hours (value in minutes)
      # set it to -1 to never refetch a page
      fetchInterval.fetch.error: 120
      # never revisit a page with an error (or set a value in minutes)
      ### Currently set to check back in 1 month.
      fetchInterval.error: 40320
      # text extraction for JSoupParserBolt
       - DIV[id="block-edu-bootstrap-subtheme-content" class="block block-system block-system-main-block"]
       - MAIN[role="main"]
       - DIV[id="content--news"]
       - DIV[id="content--person"]
       - ARTICLE[class="node container node--type-facility facility-full node-101895 node--promoted node--view-mode-full py-5"]
       - ARTICLE[class="node container node--type-spotlight spotlight-full node-90543 node--promoted node--view-mode-full py-5"]
       - DIV[class="field field--name-field-content field--type-entity-reference-revisions field--label-hidden field__items"]
       - ARTICLE
       - BODY
    #   - DIV[id="maincontent"]
    #   - DIV[itemprop="articleBody"]
    #   - ARTICLE
       - STYLE
       - SCRIPT
       - FOOTER
      # custom fetch interval to be used when a document has the key/value in its metadata
      # and has been fetched successfully (value in minutes)
      # fetchInterval.FETCH_ERROR.isFeed=true: 30
      # fetchInterval.isFeed=true: 10
      # configuration for the classes extending AbstractIndexerBolt
      # indexer.md.filter: "someKey=aValue"
      indexer.url.fieldname: "url"
      indexer.text.fieldname: "content"
      indexer.canonical.name: "canonical"
      - parse.title=title
      - parse.keywords=keywords
      - parse.description=description
      - domain=domain
      # Metrics consumers:
         - class: "org.apache.storm.metric.LoggingMetricsConsumer"
           parallelism.hint: 1

    和 es-conf.yaml:
    # configuration for Elasticsearch resources
      # ES indexer bolt
      # adresses can be specified as a full URL
      # if not we assume that the protocol is http and the port 9200
      es.indexer.addresses: "https://example.com:9200"
      es.indexer.index.name: "www-all-index"
      # es.indexer.pipeline: "_PIPELINE_"
      #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
      es.indexer.doc.type: "doc"
      es.indexer.user: "{username}"
      es.indexer.password: "{password}"
      es.indexer.create: false
      #### Change the Cluster Name ####
        cluster.name: "edu-web"
      # ES metricsConsumer
      es.metrics.addresses: "https://example.com:9200"
      es.metrics.index.name: "www-all-metrics"
      #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
      es.metrics.doc.type: "datapoint"
      es.metrics.user: "{username}"
      es.metrics.password: "{password}"
      #### Change the Cluster Name ####
        cluster.name: "edu-web"
      # ES spout and persistence bolt
      es.status.addresses: "https://example.com:9200"
      es.status.index.name: "www-all-status"
      #### Check the document type thoroughly it needs to match with the elastic search index mapping ####
      es.status.doc.type: "status"
      es.status.user: "{username}"
      es.status.password: "{password}"
      # the routing is done on the value of 'partition.url.mode'
      es.status.routing: true
      # stores the value used for the routing as a separate field
      # needed by the spout implementations
      es.status.routing.fieldname: "metadata.hostname"
      es.status.bulkActions: 500
      es.status.flushInterval: "5s"
      es.status.concurrentRequests: 1
      #### Change the Cluster Name ####
        cluster.name: "edu-web"
      # spout config #
      # positive or negative filter parsable by the Lucene Query Parser
      # es.status.filterQuery: "-(metadata.hostname:stormcrawler.net)"
      # time in secs for which the URLs will be considered for fetching after a ack of fail
      spout.ttl.purgatory: 30
      # Min time (in msecs) to allow between 2 successive queries to ES
      spout.min.delay.queries: 1000
      # Delay since previous query date (in secs) after which the nextFetchDate value will be reset to the current time
      # Setting this to -1 or a large value means that the ES will cache the results but also that less and less results
      # might be returned.
      spout.reset.fetchdate.after: 120
      es.status.max.buckets: 50
      es.status.max.urls.per.bucket: 20
      # field to group the URLs into buckets
      es.status.bucket.field: "metadata.hostname"
      # field to sort the URLs within a bucket
      es.status.bucket.sort.field: "nextFetchDate"
      # field to sort the buckets
      es.status.global.sort.field: "nextFetchDate"
      # CollapsingSpout : limits the deep paging by resetting the start offset for the ES query
      es.status.max.start.offset: 500
      # AggregationSpout : sampling improves the performance on large crawls
      es.status.sample: false
      # AggregationSpout (expert): adds this value in mins to the latest date returned in the results and
      # use it as nextFetchDate
      es.status.recentDate.increase: -1
      es.status.recentDate.min.gap: -1
           - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.MetricsConsumer"
             parallelism.hint: 1
             #  - "fetcher_counter"
             #  - "fetcher_average.bytes_fetched"
             #  - "__receive.*"

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
                                                                            implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
                                                            <!-- The filters below are necessary if you want to include the Tika
                                                                    module -->
                                                                            <!-- https://issues.apache.org/jira/browse/STORM-2428 -->


    好的,所以您实际上正在处理少量不同的主机名。你真的可以在一个带有单个 ES spout 的单个 ES 分片上拥有它。要点是提取器将根据主机名强制执行礼貌,并且抓取速度会相对较慢。您可能也不需要多个 Fe​​tcherBolt 实例。



    并且还从每个查询中检索更多的 URL 到 ES



    顺便说一句:如果您可以在 https://github.com/DigitalPebble/storm-crawler/wiki/Powered-By 中列出,请给我发一封电子邮件


    关于elasticsearch - Stormcrawler -> Elasticsearch 的最佳设置,如果爬行的礼貌不是问题?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55281184/


    elasticsearch - stormcrawler:indexer.md.mapping-如果元数据标记不存在会发生什么?

    search - 仅更新 elasticsearch 中的特定字段值

    elasticsearch - Elasticsearch:所有前缀必须出现在文档中

    java - gwt 应用程序的爬虫花费太多时间

    url - 如何获取域的 URL 列表

    elasticsearch - 如何根据主机将StormCrawler内容发送到多个Elasticsearch索引?

    elasticsearch - Stormcrawler v1.14是否与Elasticsearch 6.7.x兼容?

    javascript - Elasticsearch 中必须查询(AND)内的嵌套应该查询(OR)

    elasticsearch - ElasticSearch-如何在每个聚合桶中获得最小时间戳?

    indexing - 在抓取我们的网站时,Google 是否会忽略哈希片段 (#) 之后的内容?