我有一个大型 csv 文件,用于比较 txt 文件的 URL
如何在Python中获得具有多个值的相同名称并获得唯一的结果,有没有办法更好地比较两个文件的速度?因为它的 csv 文件最小为 1 GB
文件1.csv
[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
文件2.txt
1 www.amazon.com shop
1 wakers.com shop
脚本:
import csv
with open("file1.csv", 'r') as f:
reader = csv.reader(f)
for k in reader:
ko = set()
srcip = k[2]
url = k[6]
lines = url.replace(":443", "").replace(":8080", "")
war = lines.split("//")[-1].split("/")[0].split('?')[0]
ko.add((war,srcip))
for to in ko:
with open("file2.txt", "r") as f:
all_val = set()
for i in f:
val = i.strip().split(" ")[1]
if val in to[0]:
all_val.add(to)
for ki in all_val:
print(ki)
我的输出:
('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')
如何获取url是否相同,获取具有唯一值的总值
如何得到这样的结果?
amazon.com 102.12.14.22
167.27.14.62
197.99.94.32
157.87.34.72
wakers.com 192.10.77.95
最佳答案
简短回答:您不能直接这样做。好吧,你可以,但性能较低。
CSV 是一种很好的存储格式,但如果您想做类似的事情,您可能希望将所有内容存储在另一个自定义数据文件中。您可以首先将文件解析为仅包含唯一 ID,而不是长字符串(例如 amazon = 0、wakers = 1 等),以提高性能并降低比较成本。
问题是,这些东西对于变量 csv 来说非常糟糕,内存映射或从 csv 构建数据库也可能很棒(并且对数据库进行更改,仅在需要时转储 csv)
看:How do quickly search through a .csv file in Python以获得更完整的答案。
关于python - 如何在Python中获得具有多个值的相同名称并获得唯一的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59819281/