所以我试图通过网络抓取 LinkedIn 的关于页面来获取某些公司的“专业”。当试图用漂亮的汤抓取 LinkedIn 时,它给了我一个访问被拒绝的错误,所以我使用一个标题来伪造我的浏览器。但是,它给出了这个输出而不是相应的 HTML:
\n\nwindow.onload = function() {\n // Parse the tracking code from cookies.\n var trk = "bf";\n var trkInfo = "bf";\n var cookies = document.cookie.split("; ");\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("trkCode=") == 0) && (cookies[i].length > 8)) {\n trk = cookies[i].substring(8);\n }\n else if ((cookies[i].indexOf("trkInfo=") == 0) && (cookies[i].length > 8)) {\n trkInfo = cookies[i].substring(8);\n }\n }\n\n if (window.location.protocol == "http:") {\n // If "sl" cookie is set, redirect to https.\n for (var i = 0; i < cookies.length; ++i) {\n if ((cookies[i].indexOf("sl=") == 0) && (cookies[i].length > 3)) {\n window.location.href = "https:" + window.location.href.substring(window.location.protocol.length);\n return;\n }\n }\n }\n\n // Get the new domain. For international domains such as\n // fr.linkedin.com, we convert it to www.linkedin.com\n var domain = "www.linkedin.com";\n if (domain != location.host) {\n var subdomainIndex = location.host.indexOf(".linkedin");\n if (subdomainIndex != -1) {\n domain = "www" + location.host.substring(subdomainIndex);\n }\n }\n\n window.location.href = "https://" + domain + "/authwall?trk=" + trk + "&trkInfo=" + trkInfo +\n "&originalReferer=" + document.referrer.substr(0, 200) +\n "&sessionRedirect=" + encodeURIComponent(window.location.href);\n}\n\n'
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.linkedin.com/company/biotech/'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14;
rv:66.0) Gecko/20100101 Firefox/66.0", "Accept":
"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate",
"DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)
我做错了什么?我认为它试图检查 cookie。有什么方法可以将其添加到我的代码中吗?
最佳答案
LinkedIn 实际上执行了一些有趣的 Cookie 设置和后续重定向,这会阻止您的代码按原样工作。通过检查根据您的初始请求返回的 JavaScript,这一点很明显。基本上,HTTP Cookies 由 Web 服务器设置用于跟踪信息,并且在最终重定向发生之前,这些 cookies 由您遇到的 JavaScript 解析。如果你对 JavaScript 进行逆向工程,你会发现最终的重定向是这样的(至少对我来说基于我的位置和跟踪信息):
url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
另外,你可以使用Python的requests模块为你维护session,它会自动管理cookies等HTTP header,让你不用担心。以下应该为您提供您正在寻找的 HTML 源代码。我将留给您实现 BeautifulSoup 并解析您想要的内容。
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.linkedin.com/authwall?trk=bf&trkInfo=bf&originalReferer=&sessionRedirect=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Fbiotech%2F'
with requests.Session() as s:
response = s.get(url)
print(response.content)
关于python - Web 抓取 LinkedIn 没有给我 html...。我做错了什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55762784/