python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么?

标签 python html selenium web-scraping beautifulsoup

所以我试图通过网络抓取 LinkedIn 的关于页面来获取某些公司的“专业”。当试图用漂亮的汤抓取 LinkedIn 时,它给了我一个访问被拒绝的错误,所以我使用一个标题来伪造我的浏览器。但是,它给出了这个输出而不是相应的 HTML:

\n\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n return;\n }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0, 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content) 

我做错了什么?我认为它试图检查 cookie。有什么方法可以将其添加到我的代码中吗?

最佳答案

LinkedIn 实际上执行了一些有趣的 Cookie 设置和后续重定向,这会阻止您的代码按原样工作。通过检查根据您的初始请求返回的 JavaScript,这一点很明显。基本上,HTTP Cookies 由 Web 服务器设置用于跟踪信息,并且在最终重定向发生之前,这些 cookies 由您遇到的 JavaScript 解析。如果你对 JavaScript 进行逆向工程,你会发现最终的重定向是这样的(至少对我来说基于我的位置和跟踪信息):

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'

另外,你可以使用Python的requests模块为你维护session,它会自动管理cookies等HTTP header,让你不用担心。以下应该为您提供您正在寻找的 HTML 源代码。我将留给您实现 BeautifulSoup 并解析您想要的内容。

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'


with requests.Session() as s:
        response = s.get(url)
        print(response.content) 

关于python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55762784/

相关文章:

python - ffmpeg如何将mpg编码为2 channel aac?

类属性中的Python循环导入

html - 如何使用 Flexbox 获得相同的列宽?

python - WebDriverException : Message: Can't load the profile. 不知道发生了什么

python - 如何使用 Django 表单创建过滤下拉列表?

python - 如何比较 numpy 数组的内存表示的相等性(例如包括 dtype)

html - 如何使 select2 只读?

javascript - 根据不带表格标签的表格中的文本更改表格单元格文本颜色和行背景

python - 如何将抓取的数据保存到 CSV 文件中?

java - cucumber .api.PendingException : TODO: implement me