python - 如何在Python中获得具有多个值的相同名称并获得唯一的结果

标签 python

我有一个大型 csv 文件,用于比较 txt 文件的 URL

如何在Python中获得具有多个值的相同名称并获得唯一的结果,有没有办法更好地比较两个文件的速度?因为它的 csv 文件最小为 1 GB

文件1.csv

[01/Nov/2019:09:54:26 +0900] ","","102.12.14.22","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","164.16.37.75","52.222.194.116","200","CONNECT","http://www.google.com:443","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","192.10.77.95","21.323.12.96","200","CONNECT","http://www.wakers.com/sg/wew/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","167.27.14.62","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","197.99.94.32","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"
[01/Nov/2019:09:54:26 +0900] ","","157.87.34.72","34.122.104.106","200","CONNECT","http://www.amazon.com/asdd/asd/","555976","1508"

文件2.txt

1 www.amazon.com shop
1 wakers.com shop

脚本:

import csv
with open("file1.csv", 'r') as f: 
    reader = csv.reader(f)
    for k in reader:
        ko = set()
        srcip = k[2]
        url = k[6]
        lines = url.replace(":443", "").replace(":8080", "")
        war = lines.split("//")[-1].split("/")[0].split('?')[0]
        ko.add((war,srcip))
        for to in ko:
            with open("file2.txt", "r") as f:
                all_val = set()
                for i in f:
                    val = i.strip().split(" ")[1]
                    if val in to[0]:
                        all_val.add(to)
                for ki in all_val:
                  print(ki)

我的输出:

('www.amazon.com', '102.12.14.22')
('www.amazon.com', '167.27.14.62')
('www.wakers.com', '192.10.77.95')
('www.amazon.com', '167.27.14.62')
('www.amazon.com', '197.99.94.32')
('www.amazon.com', '157.87.34.72')

如何获取url是否相同,获取具有唯一值的总值

如何得到这样的结果?

amazon.com    102.12.14.22 
              167.27.14.62 
              197.99.94.32
              157.87.34.72
wakers.com    192.10.77.95

最佳答案

简短回答:您不能直接这样做。好吧,你可以,但性能较低。

CSV 是一种很好的存储格式,但如果您想做类似的事情,您可能希望将所有内容存储在另一个自定义数据文件中。您可以首先将文件解析为仅包含唯一 ID,而不是长字符串(例如 amazon = 0、wakers = 1 等),以提高性能并降低比较成本。

问题是,这些东西对于变量 csv 来说非常糟糕,内存映射或从 csv 构建数据库也可能很棒(并且对数据库进行更改,仅在需要时转储 csv)

看:How do quickly search through a .csv file in Python以获得更完整的答案。

关于python - 如何在Python中获得具有多个值的相同名称并获得唯一的结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59819281/

相关文章:

python - 同时修改ZODB中的不同键

python - 异步等待 TimerHandle

python - 如何将脚本参数传递给 pdb (Python)?

python - Django,模板上下文处理器

python - Python 的 os.system() 相当于 cocoa/Objective-C 的什么?

python - requests.exceptions.HTTPError 与 requests.HTTPError

python - Pandas:将数据写入 MySQL 时减少了毫秒数

python - 格式化 SQLAlchemy 代码

Python - 匹配和更改日期时间

python - 如何在Python中用时间数据绘制直方图