python - 如何从scrapy中获得好的结果

标签 python html web-scraping scrapy

我正在尝试使用 scrapy 从维基百科中抓取详细信息。我能够抓取掉它,但得到的结果非常困惑和糟糕。因为我是 python 和 scrapy 的新手,所以我很难解决这个问题。

这是我的代码:

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from wikipedia.items import WikipediaItem

class WikipediaSpider(BaseSpider):
    name = "wiki"
    allowed_domains = ["wikipedia.org"]
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//table[@id="mp-upper"]/tr')
        items = []
        for site in sites:
            item = WikipediaItem()
            item['title'] = site.select('.//a/text()').extract()
            item['link'] = site.select('.//a/@href').extract()
            item['details'] = site.select('.//p/text()').extract()
            items.append(item)
        return items

这是结果:

2013-04-19 02:18:48+0800 [wiki] DEBUG: Scraped from <200 http://en.wikipedia.org/wiki/Main_Page>

{'details': [u' is a fungal species found in moist habitats in ',

u'. The species produces brown ',
                 u' with ',

                 u' of varying shapes up to 40 millimetres (1.6\xa0in) across, and tall, thin ',

                 u' up to 62 millimetres (2.4\xa0in) long, at the base of which is a large and well-defined "bulb". The stem varies in colour, with whitish, pale yellow-brown, pale red-brown, pale brown and grey-brown all observed. The species produces unusually shaped, irregular ',

                 u', each with a few thick protrusions. This feature helps differentiate it from other species that would otherwise be similar in appearance and ',

                 u'. It grows in ',

                 u' association with ',

                 u', and it is for this that the species is named. However, particular species favoured by the fungus are unclear and may include ',

                 u' and ',

                 u' taxa. The mushrooms grow from the ground, often among mosses or ',

                 u'. The species was first described in 2009, and within the genus ',

                 u', it is a part of the ',

                 u' ',

                 u'. The ',

                 u' ',

                 u' was collected from the shore of a lake near ',

                 u', Finland. The species has also been recorded in Sweden and, at 
least in some areas, it is relatively common. (',

                 u')',

                 u'Recently featured: ',

                 u'\xa0\u2013 ',

                 u'\xa0\u2013 ',

                 u': ',

                 u' ',

                 u' ',

                 u'More anniversaries: ',

                 u' ',

                 u' '],

     'link': [u'/wiki/File:Inocybe_saliceticola.jpg',

              u'/wiki/Inocybe_saliceticola',

              u'/wiki/Nordic_countries',

              u'/wiki/Mushrooms',

              u'/wiki/Pileus_(mycology)',

              u'/wiki/Stipe_(mycology)',

              u'/wiki/Spore',

              u'/wiki/Habit_(biology)',

              u'/wiki/Mycorrhizal',

              u'/wiki/Willow',

              u'/wiki/Beech',

              u'/wiki/Alder',

              u'/wiki/Detritus',

              u'/wiki/Section_(botany)',

              u'/wiki/Holotype',

              u'/wiki/Nurmes',

              u'/wiki/Inocybe_saliceticola',

              u'/wiki/Thistle,_Utah',

              u'/wiki/Be_Here_Now_(album)',

              u'/wiki/Sumatran_rhinoceros',

              u'/wiki/Wikipedia:Today%27s_featured_article/April_2013',

              u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',

              u'/wiki/Wikipedia:Featured_articles',

              u'/wiki/Wikipedia:Recent_additions',

              u'/wiki/File:Ezra_Meeker_1921_crop.jpg',

              u'/wiki/Ezra_Meeker',

              u'/wiki/Oregon_Trail',

              u'/wiki/Bullock_cart',

              u'/wiki/Italy_at_the_2009_Mediterranean_Games',

              u'/wiki/2009_Mediterranean_Games_medal_table',

              u'/wiki/Cossack_hetman',

              u'/wiki/Ivan_Petrizhitsky-Kulaga',

              u'/wiki/Cossacks',

              u'/wiki/Fokus_(magazine)',

              u'/wiki/Amir_Garrett',

              u'/wiki/College_basketball',


              u'/wiki/Fastball',

              u'/wiki/Armenian_Genocide',

              u'/wiki/Karin_dialect',

              u'/wiki/Scottish_American',

              u'/wiki/Daniel_Pennie_House',

              u'/wiki/Wikipedia:Recent_additions',

              u'/wiki/Wikipedia:Your_first_article',

              u'/wiki/Template_talk:Did_you_know',

              u'/wiki/Slang',

              u'/wiki/Hammer',

              u'/wiki/Church_(building)',

              u'/wiki/Wikipedia:Today%27s_articles_for_improvement',

              u'/wiki/File:2013_Boston_Marathon_aftermath_people.jpg',

              u'/wiki/West_fertilizer_plant_explosion',

              u'/wiki/West,_Texas',

              u'/wiki/Texas',

              u'/wiki/Moment_magnitude_scale',

              u'/wiki/2013_Sistan_and_Baluchestan_earthquake',

              u'/wiki/Sistan_and_Baluchestan_Province',

              u'/wiki/15_April_2013_Iraq_attacks',

              u'/wiki/Boston_Marathon_bombings',

              u'/wiki/2013_Boston_Marathon',

              u'/wiki/Death_and_state_funeral_of_Hugo_Ch%C3%A1vez',

              u'/wiki/Nicol%C3%A1s_Maduro',

              u'/wiki/Venezuelan_presidential_election,_2013',

              u'/wiki/List_of_Presidents_of_Venezuela',

              u'/wiki/Adam_Scott_(golfer)',

              u'/wiki/2013_Masters_Tournament',

              u'/wiki/Government_of_India',

              u'/wiki/Bollywood',

              u'/wiki/Pran',

              u'/wiki/Dadasaheb_Phalke_Award',

              u'/wiki/Deaths_in_2013',

              u'/wiki/Colin_Davis',

              u'/wiki/Maria_Tallchief',

              u'/wiki/Jonathan_Winters',

              u'//en.wikinews.org/wiki/Main_Page',

              u'/wiki/Portal:Current_events',

              u'/wiki/April_18',

              u'/wiki/File:Stpetes.JPG',

              u'/wiki/1506',

              u'/wiki/St._Peter%27s_Basilica',

              u'/wiki/Vatican_City',

              u'/wiki/Old_St._Peter%27s_Basilica',

              u'/wiki/1689',

              u'/wiki/Militia_(United_States)',

              u'/wiki/Boston',

              u'/wiki/1689_Boston_revolt',

              u'/wiki/Dominion_of_New_England',

              u'/wiki/1923',

              u'/wiki/New_York_Yankees',

              u'/wiki/Major_League_Baseball',

              u'/wiki/Yankee_Stadium_(1923)',

              u'/wiki/1938',

              u'/wiki/Superman',

              u'/wiki/Jerry_Siegel',

              u'/wiki/Joe_Shuster',

              u'/wiki/Action_Comics_1',

              u'/wiki/Superhero',

              u'/wiki/Comic_book',

              u'/wiki/1947',

              u'/wiki/List_of_the_largest_artificial_non-nuclear_explosions',

              u'/wiki/Royal_Navy',

              u'/wiki/Tonne',

              u'/wiki/Ammunition',

              u'/wiki/Heligoland',

              u'/wiki/1949',

              u'/wiki/Republic_of_Ireland',

              u'/wiki/Commonwealth_of_Nations',

              u'/wiki/1996',

              u'/wiki/1996_shelling_of_Qana',

              u'/wiki/Qana',

              u'/wiki/Operation_Grapes_of_Wrath',

              u'/wiki/United_Nations_Interim_Force_in_Lebanon',

              u'/wiki/April_17',

              u'/wiki/April_18',

              u'/wiki/April_19',

              u'/wiki/Wikipedia:Selected_anniversaries/April',

              u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l',

              u'/wiki/List_of_historical_anniversaries',

              u'/wiki/Coordinated_Universal_Time',

              u'//en.wikipedia.org/w/index.php?title=Main_Page&action=purge'],
 'title': [u'Inocybe saliceticola',

 u'Nordic countries',

               u'mushrooms',

               u'caps',

               u'stems',

               u'spores',

               u'habit',

               u'mycorrhizal',

               u'willow',

               u'beech',

               u'alder',

               u'detritus',

               u'section',

               u'holotype',

               u'Nurmes',

               u'Thistle, Utah',

               u'Be Here Now',

               u'Sumatran rhinoceros',

               u'Archive'

               u'List of historical anniversaries',

               u'UTC',

               u'Reload this page']}

最佳答案

我无法访问您访问的同一页面,但您获得的结果可能非常不稳定,因为维基百科文本充满了链接。当你这样做时site.select('.//p/text()') ,您只需选择节点 <p> 正下方的文本。这意味着子节点 <a href=..>text</a> 里面有什么没有被抓取掉。链接标签分割了结果,所以你最终会得到一个奇怪的列表。

如果你想检索每个节点,你可以使用

contents = site.select('.//p/node()').extract()
item['details'] = ''.join(contents)

这样你就可以拥有 <p> 中的所有内容。标签(包括 <a> 标签)。如果您只想要不带链接标签的文本,则可以使用 strip_html(item['details']) (实际上,contents = site.select('.//p//text()').extract() 也可能有效,并且更面向 xpath)。

关于python - 如何从scrapy中获得好的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16090497/

相关文章:

python - Pandas - 用名称 ID 替换数字字符串

python - 按二级列表排序元组列表,跳过 Python 中的缺失值

python - 基于列中的约束处理 pandas 数据框中的聚合

javascript - 位置 :absolute div inside another position:absolute div and making them position:fixed

javascript - JQuery/javascript 选择器和 html 表操作

使用队列的Python多处理死锁

javascript - 固定顶部 div 和底部可滚动 div

python - 从动态页面检索所有汽车链接

python - 确定网站是否是网上商店

xpath - 从分布在不同 div 的列表中提取内容