Python:Selenium 和 PhantomJS

标签 python selenium web-scraping beautifulsoup phantomjs

我正在尝试抓取以下网站: https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0

我想要得到的文字是:

Showing 114,877 results

HTML 代码:

<div class="jobs-search-results__count-sort pt3">
            <div class="jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4">
                Showing 114,877 results
            </div>

我的Python代码是:

index_url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'

    java = '!function(i,n){void 0!==i.addEventListener&&void 0!==i.hidden&&(n.liVisibilityChangeListener=function(){i.hidden&&(n.liHasWindowHidden=!0)},i.addEventListener("visibilitychange",n.liVisibilityChangeListener))}(document,window);'
    browser = webdriver.PhantomJS()
    browser.get(index_url)
    browser.execute_script(java)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4" 
    div = soup.find('div', {"class":link})
    text = div.text

到目前为止,我的代码似乎无法正常工作。我认为这是对java脚本的执行做一些事情。

我收到以下错误:

<小时/>
AttributeError                            Traceback (most recent call last)
<ipython-input-33-7cdc1c4e0894> in <module>()
      6 link = "jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4"
      7 div = soup.find('div', {"class":link})
----> 8 text = div.text

AttributeError: 'NoneType' object has no attribute 'text'

汤输出:

<html><head>\n<script type="text/javascript">\nwindow.onload = function() {\n  // Parse the tracking code from cookies.\n  var trk = "bf";\n  var trkInfo = "bf";\n  var cookies = document.cookie.split("; ");\n  for (var i = 0; i < cookies.length; ++i) {\n    if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n      trk = cookies[i].substring(8);\n    }\n    else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n      trkInfo = cookies[i].substring(8);\n    }\n  }\n\n  if (window.location.protocol == "http:") {\n    // If "sl" cookie is set, redirect to https.\n    for (var i = 0; i < cookies.length; ++i) {\n      if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n        window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n        return;\n      }\n    }\n  }\n\n  // Get the new domain. For international domains such as\n  // fr.linkedin.com, we convert it to www.linkedin.com\n  var domain = "www.linkedin.com";\n  if (domain != location.host) {\n    var subdomainIndex = location.host.indexOf(".linkedin");\n    if (subdomainIndex != -1) {\n      domain = "www" + location.host.substring(subdomainIndex);\n    }\n  }\n\n  window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n      "&originalReferer=" + document.referrer.substr(0, 200) +\n      "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n</script>\n</head><body></body></html>

最佳答案

我在webdriver.Chrome中有解决方案,因为我从未使用过PhantomJS。如果您想获取结果文本,有两种情况。结果文本。一种情况是您已从驱动程序实例登录 Linkedin,另一种情况是您尚未登录。

假设您尚未登录。因此以下代码将完成您的工作

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
text = soup.find('div',{'class':'results-context'}).text
print(text)

假设您已登录

from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?keywords=coach%20&location=United%20States&locationId=us%3A0'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

class = 'jobs-search-results__count-string results-count-string Sans-15px-black-55% pb0 pl5 pr4'
text = soup.find('div',{'class':class}).text.split('\n')[1].lstrip()
print(text)

关于Python:Selenium 和 PhantomJS,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45451285/

相关文章:

python - 为什么 numpy 1.14 将 float16 65504 舍入到 65500

python - 如何将 cookie/session 从 Mechanize 导出到 Selenium

python - 使用 Python/PhantomJS/Selenium 滚动无限页面

php - file_get_contents 无法打开流 : HTTP request failed! HTTP/1.1 500 Internal > Server Error in

javascript - 使用 Javascript 从元数据中抓取信息

python - 使用python脚本的docker中的尾栈

任意大小的 Python 屏幕截图应用程序窗口

python - 如何使用 PyQt 和 OpenCV 读取帧

xlink 的 Selenium/java-xpath :href attribute

google-chrome - session ID 为空。调用 quit() 后使用 Web 驱动程序?代码在chrome、ie11和edge浏览器中正确执行,但在firefox 55.0.3中不正确