我正在开发一个通过 Chrome 扩展程序运行的抓取工具。它获取页面上的所有 HTML,并将其发送到过滤和保存数据的 Python 代码。我以这种方式进行抓取的原因是因为该网站有 Distil Networks 并且“传统”抓取器被阻止了。
我在这两个代码之间建立了成功的连接,但每当我尝试发送“测试”时。到 python 服务器,它只输出浏览器的 header 。
b'GET / HTTP/1.1 Host: localhost:18364 Connection: Upgrade Pragma: no-cache Cache-Control: no-cache User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36 Upgrade: websocket Origin: chrome-extension://ocplnbpkkcpcomkjioockgnlohhkdeic Sec-WebSocket-Version: 13 Accept-Encoding: gzip, deflate, br Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 Sec-WebSocket-Key: SDC7zPgHK/eV+QRSJy0DZQ== Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits'
JavaScript 代码(客户端):
chrome.runtime.onMessage.addListener(function(request, sender) {
if (request.action == "getSource") {
var pageAmount = parseInt(request.source, 10)
var allHTML = ""
var BaseURL = "https://www.funda.nl/huur/rotterdam/p"
function encode_utf8(s) {
return unescape(encodeURIComponent(s));
}
var websocket = new WebSocket('ws://localhost:18364');
websocket.onopen = function () {
data = encode_utf8('Test.')
websocket.send('Test.');
};
message.innerText = request.source;
}
});
function onWindowLoad() {
var message = document.querySelector('#message');
chrome.tabs.executeScript(null, {
file: "getPageContent.js"
}, function() {
// If you try and inject into an extensions page or the webstore/NTP you'll get an error
if (chrome.runtime.lastError) {
message.innerText = 'There was an error injecting script : \n' + chrome.runtime.lastError.message;
}
});
}
window.onload = onWindowLoad;
Python 代码(服务器):
import socket
LocalSocket = socket.socket()
allHTML = ''
try: # Connecting the Socket
LocalSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
LocalSocket.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
LocalSocket.bind(('localhost', 18364))
print("Connected.")
except socket.error as err:
print("ConnectionError: %s" % err)
def main():
LocalSocket.listen(1)
c, addr = LocalSocket.accept()
print('Got connection from', addr)
print(c.recv(1024))
c.close()
if __name__ == "__main__":
main()
最佳答案
网络套接字在 HTTP 上分层,因此这是预期的行为。您需要一个 Web 服务器(或使用 HTTP 的东西)来处理 Connection: Upgrade
和 Upgrade: websocket
部分,然后在获得有效连接之前执行其余的握手支持双向通信
你可以看看使用 websockets
很好地包装了这个包
关于javascript - (Web)套接字连接发送 header 而不是字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59292777/