java - XPath//*[@href] 不仅仅抓取链接

标签 java html selenium xpath

我正在使用 XPath//*[@href] 从网页中抓取链接。然而,我注意到除了页面上的实际链接之外,Selenium 还在抓取 javascript:void(0)

为什么会发生这种情况?

如果您在 http://google.com 上运行此测试您会发现 45 个链接 - 其中 3 个实际上不是链接,而是 javascript:void(0)。

如何解决?

更新:http://google.com 上的预期输出(截至撰写本文时)如下:

1 = https://www.google.com/images/branding/product/ico/googleg_lodp.ico
2 = https://www.google.com/
3 = https://www.google.com/setprefs?suggon=2&prev=https://www.google.com/?gws_rd%3Dssl&sig=0_OQUUDCX_hZxBr1qNxxxxxxxxxxxEH_4%3D
4 = https://mail.google.com/mail/?tab=wm
5 = https://www.google.com/imghp?hl=en&tab=wi&ei=3NfSWL2xxxxxxxBg&ved=0EKouCBgoAQ
6 = https://www.google.com/intl/en/options/
7 = https://myaccount.google.com/?utm_source=OGB
8 = https://www.google.com/webhp?tab=ww&ei=3NfSWL2DKpxxxxxxxg&ved=0EKkuCAIoAQ
9 = https://maps.google.com/maps?hl=en&tab=wl
10 = https://www.youtube.com/
11 = https://play.google.com/?hl=en&tab=w8
12 = https://news.google.com/nwshp?hl=en&tab=wn&ei=3NfSWL2xxxxxxxxxxxBg&ved=0EKkuCAYoBQ
13 = https://mail.google.com/mail/?tab=wm
14 = https://drive.google.com/?tab=wo
15 = https://www.google.com/calendar?tab=wc
16 = https://plus.google.com/?gpsrc=ogpy0&tab=wX
17 = https://translate.google.com/?hl=en&tab=wT
18 = https://photos.google.com/?tab=wq&pageId=none
19 = https://www.google.com/intl/en/options/
20 = http://www.google.com/shopping?hl=en&tab=wf&ei=3NxxxxxxxxxTYBg&ved=0EKkuCA0oDA
21 = https://wallet.google.com/?tab=wa
22 = https://www.google.com/finance?tab=we
23 = https://docs.google.com/document/?usp=docs_alc
24 = https://books.google.com/bkshp?hl=en&tab=wp&ei=3NfSWL2xxxxxxxxxxxBg&ved=0EKkuCBEoEA
25 = https://www.blogger.com/?tab=wj
26 = https://www.google.com/contacts/?hl=en&tab=wC
27 = https://hangouts.google.com/
28 = https://keep.google.com/
29 = https://www.google.com/intl/en/options/
30 = https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/%3Fgws_rd%3Dssl
----------------------- (removed)
32 = https://www.google.com/url?q=https://www.google.com/intl/en_us/homepage/search/sp-firefox.html%3Futm_source%3Dgoogle.com%26utm_medium%3Dpushdown%26utm_content%3Dswitch%26utm_campaign%3Dffdse&source=hpp&id=190xx319&ct=7&usg=AFxxxxxxxxbZR_QouKfSxxxxxxxuQ&cot=2
33 = https://www.google.com/webhp?hl=en&sa=X&ved=0ahUKxxxxxy8-rSAxxxxxxxx8QPAgD
34 = https://support.google.com/websearch/answer/186645?hl=en
35 = https://www.google.com/intl/en/policies/privacy/?fg=1
36 = https://www.google.com/intl/en/policies/terms/?fg=1
37 = https://www.google.com/preferences?hl=en
38 = https://www.google.com/preferences?hl=en&fg=1
39 = https://www.google.com/advanced_search?hl=en&fg=1
40 = https://www.google.com/history/optout?hl=en&fg=1
41 = https://support.google.com/websearch/?p=ws_results_help&hl=en&fg=1
------------------ (removed)
43 = https://www.google.com/intl/en/ads/?fg=1
44 = https://www.google.com/services/?fg=1
45 = https://www.google.com/intl/en/about.html?fg=1

最佳答案

如果您确实需要获取除这两个链接之外的所有具有“href”属性的元素,那么您可以使用下一个 xPath:

//*[@href][not(contains(@href,'javascript:void'))]

关于java - XPath//*[@href] 不仅仅抓取链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42961167/

相关文章:

java - Couchbase N1ql 在 Java 中急切地查询获取

java - 数据更改时 LiveData 不更新

javascript - 具有绝对定位的可滚动div

javascript - 使用 javascript 提交时保持浏览器 HTML 表单验证?

python - Selenium Python 按文本/样式单击页面上的链接

regex - Selenium Python 如何使用正则表达式查找字符串值中的 2 位数字

java - token 空指针异常

java - 来自 aws-java-sdk 的 DynamoDB 在 gradle 下不完整

html - float 输入的标签索引

Python网络驱动程序: If-And Statement Not Working With Numeric Values