python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么？

所以我试图通过网络抓取 LinkedIn 的关于页面来获取某些公司的“专业”。当试图用漂亮的汤抓取 LinkedIn 时，它给了我一个访问被拒绝的错误，所以我使用一个标题来伪造我的浏览器。但是，它给出了这个输出而不是相应的 HTML:

\n\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n return;\n }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0, 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'

import requests
from bs4 import BeautifulSoup as BS


url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; 
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", 
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}

response = requests.get(url, headers=headers)
print(response.content)

我做错了什么？我认为它试图检查 cookie。有什么方法可以将其添加到我的代码中吗？

最佳答案

LinkedIn 实际上执行了一些有趣的 Cookie 设置和后续重定向，这会阻止您的代码按原样工作。通过检查根据您的初始请求返回的 JavaScript，这一点很明显。基本上，HTTP Cookies 由 Web 服务器设置用于跟踪信息，并且在最终重定向发生之前，这些 cookies 由您遇到的 JavaScript 解析。如果你对 JavaScript 进行逆向工程，你会发现最终的重定向是这样的(至少对我来说基于我的位置和跟踪信息):

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'

另外，你可以使用Python的requests模块为你维护session，它会自动管理cookies等HTTP header，让你不用担心。以下应该为您提供您正在寻找的 HTML 源代码。我将留给您实现 BeautifulSoup 并解析您想要的内容。

import requests
from bs4 import BeautifulSoup as BS

url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'


with requests.Session() as s:
        response = s.get(url)
        print(response.content)

关于python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55762784/

python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么？

上一篇：html - Web 抓取后表数据返回空值

下一篇：javascript - 无法在 JavaScript 中生成异步子进程