python - 使用 scrapy 遍历嵌套标签中文本中的列表和 strip 标签

标签 python python-2.7 xpath web-scraping scrapy

我正在测试 scrapy,无法弄清楚如何在嵌套在标签中时检索没有标签的纯文本。这是我测试它的 URL: http://www.tripadvisor.com/ShowTopic-g293915-i3686-k8824646-What_s_the_coolest_thing_you_saw_or_did_in_Thailand-Thailand.html

期望的输出: content of the posts as separate elements in the item[body] object

我的代码:

import scrapy

from tripadvisor.items import TripadvisorItem

class TripadvisorSpider(scrapy.Spider):
[...]

def parse_thread_contents(self, response):
    url = response.url
    item = TripadvisorItem()
    for sel in response.xpath('//div[@class="balance"]'):
        item['body'] = sel.xpath('//div[@class="postBody"]//p').extract()
    yield item

最佳答案

您需要获取p 元素的text()。循环中还有一个问题——你需要一个一个地遍历帖子并获取帖子正文并将它们收集在列表中:

item['body'] = ["".join(post.xpath('.//div[@class="postBody"]/p/text()').extract()) 
                for post in response.xpath('//div[@class="postcontent"]')]

另请注意,表达式开头的点也很重要 - 它会使搜索上下文特定

演示:

In [1]: for post in response.xpath('//div[@class="postcontent"]'):
   ...:     print("".join(post.xpath('.//div[@class="postBody"]/p/text()').extract()))
   ...:      
What's that memory you'll carry forever with you? Maybe you stayed on a floating hut in Khao Sok Lake, or you washed elephants in a sanctuary, or....I have no idea. Please share if you like, I'd love to hear!
The heat when you you go to  for the first time, my blessing ceremony with my husband on Bottle Beach is up there, as is the first time I met him in Samui. Phang Nga Bay on the west coast is stunning and took my breath away, I overnighted on a friend's boat and watched the stars come out. Hong Island was amazing and arriving at Koh Racha before it had hotels on it. Early morning mist on the river at Amphawa whilst looking across to a beautiful temple, the Chao Praya River in Bangkok, the Reclining Buddha at Wat Pho - I could go on and on. : )
First trip to  few years back. Not very informed, no smart phone, no google earth....rent a bike, with my wife and we just ride the bike "till the road ends"...ended up at their local uni, watch student going in and out of the uni gate, sat on the road side having a coke. No worries...just me and my wife.Cassnu, pls...go on and on...we dont mind.
...

关于python - 使用 scrapy 遍历嵌套标签中文本中的列表和 strip 标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32528943/

相关文章:

javascript - 如何使用 xpath 克隆 SVG map

python - Matplotlib:取消matplotlib 2.0引入的坐标轴偏移

Python如何将方法的结果转换为生成器

xpath - 简化 selenium 使用的 xpath 表达式

python-2.7 - 如何在 pyspark 中创建具有两个输入的 UDF

python - 在 __init__ 之外初始化字段

c# - 如何在 string.Format 中引用数组值?

python - 使用 Pandas 创建大型数据框

python - 在系统调用期间捕获/阻塞 SIGINT

python - 在 Python 中使用嵌套列表推导式从列表中过滤出项目