我有一个 Splunk 生成的 CSV 文件,格式类似于以下内容:
Category,URL,Hash,ID,"__mv_Hash","_mkv_ID"
binary,somebadsite.com/file.exe,12345abcdef,123,,,
callback,bad.com,,567,,,
我需要做的是遍历 CSV 文件,维护标题顺序,如果结果是二进制或回调,则采取不同的操作。对于此示例,如果结果是二进制,我将返回任意“干净”或“脏”评级,如果是回调,我将只打印出详细信息。
下面是我目前打算使用的代码,但我是 Python 的新手,希望获得有关代码的反馈,以及是否有更好的方法来完成此任务。如果结果是二进制的,我也不完全清楚我处理方式的区别: for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") 而不是 k.startswith("""_mkv_""")))
以及如果不是,我将如何处理。两者都达到相同的结果,那么一个比另一个有什么好处?
import gzip
import csv
import json
csv_file = 'test_csv.csv.gz'
class GZipCSVReader:
def __init__(self, filename):
self.gzfile = gzip.open(filename)
self.reader = csv.DictReader(self.gzfile)
self.fieldnames = self.reader.fieldnames
def next(self):
return self.reader.next()
def close(self):
self.gzfile.close()
def __iter__(self):
return self.reader.__iter__()
def get_rating(hash):
if hash == "12345abcdef":
rating = "Dirty"
else:
rating = "Clean"
return hash, rating
def print_callback(result):
print json.dumps(result, sort_keys=True, indent=4, separators=(',',':'))
def process_results_content(r):
for row in r:
values = {}
values_misc = {}
if row["Category"] == "binary":
# Iterate through key:value pairs and add to dictionary
for k in (k for k in r.fieldnames if (not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""))):
v = row[k]
values[k] = v
rating = get_rating(row["Hash"])
if rating[1] == "Dirty":
print rating
else:
for k in r.fieldnames:
if not k.startswith("""__mv_""") and not k.startswith("""_mkv_"""):
v = row[k]
values_misc[k] = v
print_callback(values_misc)
r.close()
if __name__ == '__main__':
r = GZipCSVReader(csv_file)
process_results_content(r)
最后,for...else
循环会比执行诸如 if row["Category"] == "binary"
这样的事情更好吗?例如,我可以做这样的事情吗:
def process_results_content(r):
for row in r:
values = {}
values_misc = {}
for k in (k for k in r.fieldnames if (not row["Category"] == "binary")):
v = row[k]
...
else:
v = row[k]
...
这似乎是相同的逻辑,第一个子句会捕获任何非二进制的内容,第二个子句会捕获其他所有内容,但似乎不会产生正确的结果。
最佳答案
我使用 pandas
库。
代码:
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file)
df = df[["Category","URL","Hash","ID"]] # Remove the other columns.
get_rating = lambda x: "Dirty" if x == "12345abcdef" else "Clean"
df["Rating"] = df["Hash"].apply(get_rating) # Assign a value to each row based on Hash value.
print df
j = df.to_json() # Self-explanatory. :)
print j
结果:
Category URL Hash ID Rating
0 binary somebadsite.com/file.exe 12345abcdef 123 Dirty
1 callback bad.com NaN 567 Clean
{"Category":{"0":"binary","1":"callback"},"URL":{"0":"somebadsite.com\/file.exe","1":"bad.com"},"Hash":{"0":"12345abcdef","1":null},"ID":{"0":123,"1":567},"Rating":{"0":"Dirty","1":"Clean"}}
如果这是您想要的结果,那么只需将上面的内容替换为您的 GZipReader
,因为我没有模拟打开 gzip
文件。
关于python - 解析 CSV 并根据行内容采取行动的最有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31283953/