python - scrapy选择器xpath提取匹配的正则表达式或切片字符串

我是 scrapy 的新手，对 Python 有一点了解。

我想检索项目['评级']。评级采用字符串“评级为 4”的形式，但我只想要数字...我怎样才能获得它？

我对下面的解决方案进行了思考，但不知道它们是否有意义。但没有一个起作用。

> item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt').re(r'\d+') #to extract only the number since the result with extract() would be "rating is 4"

或

 > item_pub['rating'] = review.xpath('/html/body//*/div[@class="details"]/table[@class="detailtoptable"]/tbody/tr[1]/td/img/@alt')[-1:].extract() #to extract only the number since the result with extract() would be "rating is 4"

非常感谢您的帮助，并对我的英语感到抱歉，我希望我的问题很清楚。

最佳答案

你的思维方式没问题，使用正则表达式。你只是有一个糟糕的 Xpath。
这里有一些提示:

不需要做/html/body//，你可以做//
无需使用 //* 选择所有元素，只需稍后选择单个元素即可。您可以继续并选择所需的元素://div
如果您使用浏览器找到此 xpath，则很可能实际上没有 tbody 元素，因为浏览器经常添加这些元素

尝试这样:

item_pub['rating'] = review.xpath('//div[@class="details"]/table[@class="detailtoptable"]/tr[1]/td/img/@alt').re_first(r'\d+')

关于python - scrapy选择器xpath提取匹配的正则表达式或切片字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29973752/

上一篇：python - python程序运行完成后如何自动重新运行？主管？

下一篇：python - 将 pandas 数据框与重叠的列/行组合

Javascript:正则表达式 .test() 用于空白

java - 用一个副本替换一系列子字符串

python - 如何使用 python 从内联样式标记中删除特定值对？

xml - 在 XSLT 中调用函数

python - 如何以人类可读的格式序列化 Python 对象？

python - 在python中生成随机句子

python - 我如何使用 xpath 和 lxml 从以下可怕的 html 中选择这些元素？

xpath - 如何在 Silktest XPaths 中转义星号？

python - 导入错误: Cannot import name 'Asset' bigchaindb