python - Scrapy 没有按照 allowed_domains 过滤结果

注意:首先，我是 Scrapy 的新手，我没有足够的声誉来对 this 发表评论。问题。所以，我决定问一个新的!

Problem Statement:

我正在使用 BeautifulSoup 从特定网站抓取电子邮件地址。如果电子邮件地址在该特定页面上可用(即 example.com )，则工作正常，但如果在 example.com/contact-us 上可用，则不能, 很明显!

出于这个原因，我决定使用 Scrapy。虽然我使用的是 allowed_domains 为了只获得与域相关的链接，它还为我提供了所有异地链接。我尝试了@agstudy 在 this 中建议的另一种方法在规则中使用 SgmlLinkExtractor 的问题。

然后我得到了这个错误，

Traceback (most recent call last):     
    File "/home/msn/Documents/email_scraper/email_scraper/spiders/emails_spider.py", line 14, in <module>
        from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor  
    File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/contrib/linkextractors/sgml.py", line 7, in <module>  
      from scrapy.linkextractors.sgml import *  
    File "/home/msn/Documents/scrapy/lib/python3.5/site-packages/scrapy/linkextractors/sgml.py", line 7, in <module>  
      from sgmllib import SGMLParser  
ImportError: No module named 'sgmllib'

基本上，ImportError 是关于 Python 3.x 中 sgmlib(简单 SGML 解析器)的弃用

What I've tried so far:

class EmailsSpiderSpider(scrapy.Spider):
    name = 'emails'
    # allowed_domains = ['example.com']
    start_urls = [
        'http://example.com/'
    ]

    rules = [
        Rule(SgmlLinkExtractor(allow_domains=("example.com"),), callback='parse_url'),
    ]

    def parse_url(self, response):
        hxs = HtmlXPathSelector(response)
        urls = hxs.select("//a/@href").extract()
        print(set(urls))  # sanity check

我还尝试了 LxmlLinkExtractor 和 CrawlSpider，但仍然获得异地链接。

我应该怎么做才能完成这项工作？或者我解决问题的方法是错误的？

如有任何帮助，我们将不胜感激!

另一个注意事项: 每次网站都会有不同的垃圾邮件。所以，我不能使用特定的 HTML 或 CSS 选择器!

最佳答案

您在 hxs.select('//a/@href') 中使用 xpath 表达式，这意味着从所有 a 中提取 href 属性值页面上的 标记 这样您就可以获得所有链接，包括非现场链接。您可以改用 LinkExtractor，它会是这样的:

from scrapy.linkextractors import LinkExtractor

def parse_url(self, 
    urls = [l.url for l in LinkExtractor(allow_domains='example.com').extract_links(response)]
    print(set(urls))  # sanity check

这就是 LinkExtractor 的真正用途(我猜)。

顺便说一句，请记住，您可以在 Internet 上找到的大多数 Scrapy 示例(包括 Stackoverflow)都引用了与 Python 3 不完全兼容的早期版本。

关于python - Scrapy 没有按照 allowed_domains 过滤结果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41923550/

python - Scrapy 没有按照 allowed_domains 过滤结果

上一篇：python - 调试(过早？)OOM-killer 输出

下一篇：python - 在 Python Matplotlib 中更改 3D 曲面图中的网格线粗细