python - Python 中的网页抓取，与路径混淆

我有抓取的基本知识。这是一个基本示例:

page = requests.get('some_website.com')
tree = html.fromstring(page.text)
desc = tree.path('//div[@class = "my class"]/text()')

我的 desc 将返回 div 中的所有内容。但是如果我的 javascript 更复杂，我该怎么办

<tr>
    <th class="my class">some text</th>
    <td>some text</td>
</tr>

我只需要 <td></td> 里面的部分就在里面<tr></tr> 如果 <tr> 我将如何进行将在 <div> 内

最佳答案

您可能应该阅读 XPath 教程才能更好地理解。

I need only the part that is inside <td></td> that is inside <tr></tr> And how would I proceed if the <tr> would be inside a <div>

在你的情况下，它是:

//div[@class = "my class"]//tr/td/text()

如果你事先知道“一些文字”，你可以用 following-sibling 横着走。 :

//div[@class = "my class"]//th[. = "some text"]/following-sibling::td/text()

关于python - Python 中的网页抓取，与路径混淆，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33129713/

相关文章：

html - 使用基于百分比的布局时的奇怪数学