使用更改 href 进行 Python 网络抓取

我一直在使用 Python 2.7 抓取一些网站

    page = requests.get(URL)
    tree = html.fromstring(page.content)

    prices = tree.xpath('//span[@class="product-price"]/text()')
    titles = tree.xpath('//span[@class="product-title"]/text()')

这对于包含这些清晰标签的网站来说效果很好，但我遇到的很多网站都具有以下 HTML 设置:

<a href="https://www.retronintendokopen.nl/gameboy/games/gameboy-classic/populous" class="product-name"><strong>Populous</strong></a>

(我正在努力提取标题:Populous) 当我提取的每个标题的 href 发生变化时，我已经在上面的示例中尝试了以下操作，希望它能看到该类，这就足够了，但这不起作用

titles = tree.xpath('//a[@class="product-name"]/text()')

我正在寻找一个像 * 一样工作的字符，例如“我不在乎这里有什么，只需使用 href= 获取所有内容即可。但找不到任何内容

titles = tree.xpath('//a[@href="*"]/text()')

另外，我是否需要指定 a 标记中也有 class=

titles = tree.xpath('//a[@href="*" @class="product-name"]/text()')

编辑:如果使用路径中仅更改标签，我还找到了修复

titles = tree.xpath('//h3/a/@title')

此标签的示例

<h3><a href="http://www.a-retrogame.nl/index.php?id_product=5843&amp;controller=product&amp;id_lang=7" title="4 in 1 fun pack">4 in 1 fun pack</a></h3>

最佳答案

试试这个:

titles = tree.xpath('//a[@class="product-name"]//text()')

注意类选择器后面的//。

关于使用更改 href 进行 Python 网络抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43608620/

上一篇：python - 删除在pyspark中使用numpy.savetxt创建的csv文件

下一篇：python - 访问Python列表中的特定变量

python - 如何通过 Selenium 和 Python 输入文本搜索并从搜索结果中检索值

python - 使用 BeautifulSoup 进行网页抓取时出现问题

python 根据预定义映射按属性对对象列表进行排序

python - 用一个滚动条滚动两个不同长度的列表框

python - 使用 nosql 排名在 SQLalchemy 中排序和分页

python - 如何使 beautifulsoup 编码和解码脚本标签的内容

git - 获取 checkout 修订版的 git 标签？

javascript - 使用 Javascript 提取 HTML 中标签的属性

python - 为 df rows 中的每个条目创建单行