python - 如何使用scrapy获取职位描述？

我是 scrapy 的新手和XPath但有一段时间用Python编程。我想要email , name of the person making the offer和phone页面编号https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/使用scrapy。如您所见，电子邮件和电话以 <p> 内的文本形式提供。标签，这使得提取变得困难。

我的想法是首先获取 Job Overview 内的文本或者至少所有谈论这个各自工作的文本并使用 ReGex获取email , phone number如果可能的话 name of the person 。

所以，我启动了 scrapy shell使用命令:scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/并获取response从那里。

现在，我尝试从 div job_description 获取所有文本我实际上什么也没得到。我用过

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

它返回[u'\t\t\t\n\t\t ']

如何从提到的页面获取所有文本？显然，任务将在之后获得前面提到的属性，但是，首先要做的事情。

更新:此选择仅返回 [] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

最佳答案

你很亲近

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

除了您得到的内容之外，div 标签实际上没有任何文本。

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

如您所见，通过 response.xpath('//div[@class="job_description"]/text()').extract() 获得的文本是在 div 标签之间，而不是在 div 标签内的标签之间。为此，您需要:

response.xpath('//div[@class="job_description"]//*/text()').extract()

它的作用是从 div[@class="job_description] 中选择所有子节点并返回文本(请参阅 here 了解不同 xpath 的作用)。

您将看到这也会返回很多无用的文本，因为您仍然得到所有 \n 等。为此，我建议您将 xpath 缩小到您想要的元素，而不是采用广泛的方法。

例如，整个职位描述将在

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()

关于python - 如何使用scrapy获取职位描述？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41178659/

python - 如何使用scrapy获取职位描述？

上一篇：python - Spark - 操作数据框中的特定列值(删除字符)

下一篇：python - 如何在给定整数索引的情况下检索 pandas 数据帧行的标签索引？