python - 在大文件上使用 Pandas 数据透视表的最有效方法

标签 python multithreading python-3.x pandas multiprocessing

我正在遍历从 Windows 主机中提取的许多导出的安全事件日志,示例数据框如下:

"MachineName","EventID","EntryType","Source","TimeGenerated","TimeWritten","UserName","Message"
"mycompname","5156","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2017 10:47:41 AM","4/26/2017 10:47:41 AM",,"The Windows Filtering Platform has permitted a connection.    Application Information:   Process ID:  4   Application Name: System    Network Information:   Direction:  %%14592   Source Address:  192.168.10.255   Source Port:  137   Destination Address: 192.168.10.238   Destination Port:  137   Protocol:  17    Filter Information:   Filter Run-Time ID: 83695   Layer Name:  %%14610   Layer Run-Time ID: 44"
"mycompname","4688","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2014 10:47:03 AM","4/26/2014 10:47:03 AM",,"A new process has been created.    Subject:   Security ID:  S-1-5-18   Account Name:  mycompname$   Account Domain:  mydomain   Logon ID:  0x3e7    Process Information:   New Process ID:  0x1b04   New Process Name: C:\Windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe   Token Elevation Type: %%1936   Creator Process ID: 0x300   Process Command Line: C:\windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe    Token Elevation Type indicates the type of token that was assigned to the new process in accordance with User Account Control policy.    Type 1 is a full token with no privileges removed or groups disabled.  A full token is only used if User Account Control is disabled or if the user is the built-in Administrator account or a service account.    Type 2 is an elevated token with no privileges removed or groups disabled.  An elevated token is used when User Account Control is enabled and the user chooses to start the program using Run as administrator.  An elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege, and the user is a member of the Administrators group.    Type 3 is a limited token with administrative privileges removed and administrative groups disabled.  The limited token is used when User Account Control is enabled, the application does not require administrative privilege, and the user does not choose to start the program using Run as administrator."
"mycompname","4673","SuccessAudit","Microsoft-Windows-Security-Auditing","4/26/2014 10:47:00 AM","4/26/2014 10:47:00 AM",,"A privileged service was called.    Subject:   Security ID:  S-1-5-18   Account Name:  mycompname$   Account Domain:  mydomain   Logon ID:  0x3e7    Service:   Server: NT Local Security Authority / Authentication Service   Service Name: LsaRegisterLogonProcess()    Process:   Process ID: 0x308   Process Name: C:\Windows\System32\lsass.exe    Service Request Information:   Privileges:  SeTcbPrivilege"

我将其转换为从“消息”列中提取键:值对并将键转换为如下所示的列

def myfunc(folder):
    file = ''.join(glob2.glob(folders + "\\*security*"))
    df = pd.read_csv(file) 
    df.message = df.message.replace(["[ ]{6}", "[ ]{3}"],[","," ||"], regex=True)
    message_results = df.message.str.extractall(r"\|([^\|]*?):(.*?)\|").reset_index()
    message_results.columns = ["org_index", "match", "keys", "vals"]
    # PART THAT TAKES THE LONGEST
    p = pd.pivot_table(message_results, values="vals", columns=['keys'], index=["org_index"], aggfunc=np.sum)
    df = df.join(p).fillna("NONE")

上述函数的输出:

MachineName,EventID,EntryType,Source,TimeGenerated,TimeWritten,UserName,Message, Application Information, Filter Information, Network Information, Process, Process Information, Service, Service Request Information, Subject,Account Domain,Account Name,Application Name,Creator Process ID,Destination Address,Destination Port,Direction,Filter Run-Time ID,Layer Name,Logon ID,New Process ID,New Process Name,Process Command Line,Process ID,Process Name,Protocol,Security ID,Server,Service Name,Source Address,Source Port,Token Elevation Type
mycompname,5156,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:41 AM,4/26/2017 10:47:41 AM,NONE,The Windows Filtering Platform has permitted a connection. || Application Information: ||Process ID:  4 ||Application Name: System || Network Information: ||Direction:  %%14592 ||Source Address:  192.168.10.255 ||Source Port:  137 ||Destination Address: 192.168.10.238 ||Destination Port:  137 ||Protocol:  17 || Filter Information: ||Filter Run-Time ID: 83695 ||Layer Name:  %%14610 ||Layer Run-Time ID: 44, , , ,NONE,NONE,NONE,NONE,NONE,NONE,NONE, System ,NONE, 192.168.10.238 ,  137 ,  %%14592 , 83695 ,  %%14610 ,NONE,NONE,NONE,NONE,  4 ,NONE,  17 ,NONE,NONE,NONE,  192.168.10.255 ,  137 ,NONE
mycompname,4688,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:03 AM,4/26/2017 10:47:03 AM,NONE,"A new process has been created. || Subject: ||Security ID:  S-1-5-18 ||Account Name:  mycompname$ ||Account Domain:  mydomain ||Logon ID:  0x3e7 || Process Information: ||New Process ID:  0x1b04 ||New Process Name: C:\Windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe ||Token Elevation Type: %%1936 ||Creator Process ID: 0x300 ||Process Command Line: C:\windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe || Token Elevation Type indicates the type of token that was assigned to the new process in accordance with User Account Control policy. || Type 1 is a full token with no privileges removed or groups disabled.  A full token is only used if User Account Control is disabled or if the user is the built-in Administrator account or a service account. || Type 2 is an elevated token with no privileges removed or groups disabled.  An elevated token is used when User Account Control is enabled and the user chooses to start the program using Run as administrator.  An elevated token is also used when an application is configured to always require administrative privilege or to always require maximum privilege, and the user is a member of the Administrators group. || Type 3 is a limited token with administrative privileges removed and administrative groups disabled.  The limited token is used when User Account Control is enabled, the application does not require administrative privilege, and the user does not choose to start the program using Run as administrator.",NONE,NONE,NONE,NONE, ,NONE,NONE, ,  mydomain ,  MEADWK4216DC190$ ,NONE, 0x300 ,NONE,NONE,NONE,NONE,NONE,  0x3e7 ,  0x1b04 , C:\Windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe , C:\windows\SysWOW64\Macromed\Flash\FlashPlayerUpdateService.exe ,NONE,NONE,NONE,  S-1-5-18 ,NONE,NONE,NONE,NONE, %%1936 
mycompname,4673,SuccessAudit,Microsoft-Windows-Security-Auditing,4/26/2017 10:47:00 AM,4/26/2017 10:47:00 AM,NONE,A privileged service was called. || Subject: ||Security ID:  S-1-5-18 ||Account Name:  mycompname$ ||Account Domain:  mydomain ||Logon ID:  0x3e7 || Service: ||Server: NT Local Security Authority / Authentication Service ||Service Name: LsaRegisterLogonProcess() || Process: ||Process ID: 0x308 ||Process Name: C:\Windows\System32\lsass.exe || Service Request Information: ||Privileges:  SeTcbPrivilege,NONE,NONE,NONE, ,NONE, , , ,  mydomain ,  mycompname$ ,NONE,NONE,NONE,NONE,NONE,NONE,NONE,  0x3e7 ,NONE,NONE,NONE, 0x308 , C:\Windows\System32\lsass.exe ,NONE,  S-1-5-18 , NT Local Security Authority / Authentication Service , LsaRegisterLogonProcess() ,NONE,NONE,NONE

该程序的功能有效,但在较大数据集(大约 150000 行)的代码的 p = pivot_table 部分上速度非常慢。

我目前正在使用 concurrent.futures.ThreadPoolExecutor(maxworkers=1000) 对文件的每次读取进行迭代,如下所示:

with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as pool:
    for folder in path:
        if os.path.isdir(folder):
            try:
                print(folder)
                pool.submit(myfunc(folder), 1000)
            except:
                print('error') 

如何加快函数的数据透视表部分?

另外,有没有什么方法可以加快 pandas 的 pivot_table 调用?

对此的任何帮助将不胜感激。谢谢你。

最佳答案

语法错误

您的代码有许多语法错误

pool.submit(myfunc(folder), 1000)

pool.submit method将函数作为第一个参数。

据我所知,您的函数 myfunc 不返回任何内容,而且绝对不是函数。

即便如此,据我了解,您正在尝试启动 1000 个工作人员,他们都读取同一个文件夹,然后创建数据帧。

并行化问题

在任何线程场景中,工作线程的数量都应该接近您正在运行的机器上可用的内核数量。这是常识,我不会引用任何东西。

生成 1000 个工作人员会产生大量开销,并且可能是您运行缓慢的原因。此外,您所有的员工似乎都在做完全相同的事情,这当然意味着您做了 1000 次相同的工作。

我对实际枢轴问题的猜测

因此,从您编写的内容(除了代码)来看,我了解到您正在尝试创建一个巨大的键空间,以允许您分割任何指标并深入研究数据集。

您正在使用我所看到的单个列来执行此操作。您应该将它们分成单独的列。正如评论者所暗示的那样,pandas 有可以使用的分类列,但即使没有它们,如果关键部分位于单独的列中,键空间的索引也会小得多。您当前的数据集很可能几乎每行都有一个单独的键,因此不会将多于几行聚合在一起,从而使数据透视表的大小与原始数据集相同。

TLDR;

将您的关键列拆分为多个列,最好是分类列。

关于python - 在大文件上使用 Pandas 数据透视表的最有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43728304/

相关文章:

python - 如何在 Pandas 中用 df.apply() 替换 df.iterrows()?

multithreading - 关闭超线程时的最佳线程数

python - 是否可以从函数返回字典或系列以及数据帧?

python - 如何为某些列添加没有值的行

Python如何获取调用函数(不仅仅是它的名字)?

python - ModelForm 如果为空则禁用 imageField

python - Matplotlib 自定义样式默认标题位置

python - 如何在 Keras 函数式 API 中使用逐元素乘法训练来组合 2 个向量?

java - AWS lambda 和 Java 并发

java - Spring中实例DAI与Singleton Bean的交互