我正在测试 scrapy,无法弄清楚如何在嵌套在标签中时检索没有标签的纯文本。这是我测试它的 URL: http://www.tripadvisor.com/ShowTopic-g293915-i3686-k8824646-What_s_the_coolest_thing_you_saw_or_did_in_Thailand-Thailand.html
期望的输出: content of the posts as separate elements in the item[body] object
我的代码:
import scrapy
from tripadvisor.items import TripadvisorItem
class TripadvisorSpider(scrapy.Spider):
[...]
def parse_thread_contents(self, response):
url = response.url
item = TripadvisorItem()
for sel in response.xpath('//div[@class="balance"]'):
item['body'] = sel.xpath('//div[@class="postBody"]//p').extract()
yield item
最佳答案
您需要获取p
元素的text()
。循环中还有一个问题——你需要一个一个地遍历帖子并获取帖子正文并将它们收集在列表中:
item['body'] = ["".join(post.xpath('.//div[@class="postBody"]/p/text()').extract())
for post in response.xpath('//div[@class="postcontent"]')]
另请注意,表达式开头的点也很重要 - 它会使搜索上下文特定。
演示:
In [1]: for post in response.xpath('//div[@class="postcontent"]'):
...: print("".join(post.xpath('.//div[@class="postBody"]/p/text()').extract()))
...:
What's that memory you'll carry forever with you? Maybe you stayed on a floating hut in Khao Sok Lake, or you washed elephants in a sanctuary, or....I have no idea. Please share if you like, I'd love to hear!
The heat when you you go to for the first time, my blessing ceremony with my husband on Bottle Beach is up there, as is the first time I met him in Samui. Phang Nga Bay on the west coast is stunning and took my breath away, I overnighted on a friend's boat and watched the stars come out. Hong Island was amazing and arriving at Koh Racha before it had hotels on it. Early morning mist on the river at Amphawa whilst looking across to a beautiful temple, the Chao Praya River in Bangkok, the Reclining Buddha at Wat Pho - I could go on and on. : )
First trip to few years back. Not very informed, no smart phone, no google earth....rent a bike, with my wife and we just ride the bike "till the road ends"...ended up at their local uni, watch student going in and out of the uni gate, sat on the road side having a coke. No worries...just me and my wife.Cassnu, pls...go on and on...we dont mind.
...
关于python - 使用 scrapy 遍历嵌套标签中文本中的列表和 strip 标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32528943/