python - Python(或一般的 CS)中的循环或分层字典式数据结构?

标签 python pandas dataframe dictionary data-structures

我目前不知所措,因为我有一个巨大的 Pandas DataFrame(超过 100 万行),并且我正在查看 3 列 - 即以下内容:

Company_Name     Business_ID    Location
ABC, Inc.         BY2389AS        MTV
ABC, Inc.          100020         LMD
XYZW               010012         MTV
XYZW               010012         LMD
XYZW              AB23JKF         QAT
                  BA23F3              
SomethingCo        2342
SomethingCo                       ALD

可以看出,有时某些字段会丢失。我想针对给定的注册表进行检查(其中包含数百万个 CSV 格式的三元组(Company_Name、Business_ID、Location),如果存在唯一匹配项,请尝试返回缺失的字段(如果存在唯一匹配项) .

注册表在 CSV 格式中看起来像这样:

Company_Name, Business_ID, Location
ABC, Inc., BY2389AS, MTV
ABC, Inc., 100020, LMD
XYZW, 010012, MTV
XYZW, 010012, LMD
XYZW, AB23JKF, QAT
DLCComp, BA23F3, PLT
DLCComp, 234XYZ, QAT            
SomethingCo, 2342, COD
SomethingCo, 2020 , ALD

如上所示,此 CSV 文件中没有丢失任何内容。

需要注意的是,执行数据帧分组、数据透视表、堆栈/取消堆栈,甚至逻辑查找并选择数据帧的子集会减慢速度(因为查看整个注册表需要很长时间。我有一组要执行的逻辑,如果缺少某些字段,则查看注册表以处理唯一匹配项并填写缺少的字段;否则,如果无法识别唯一匹配项,则按原样返回。

字典查找似乎很理想 - 但由于这 3 个字段的任何组合都可能丢失,因此我无法从这个巨大的注册表数据帧(我出于当前目的将其读入内存)创建字典并通过其中一列创建 key 。

我正在尝试考虑什么数据帧最适合处理这个问题,并考虑了循环或分层字典(如果存在),但这也变得有点复杂,因为查找必须适应不同的键值配对取决于数据库与注册表中缺少的值。根据可用的列值重新排列或子集数据框的成本需要一段时间 - 我试图找出是否有一个好的数据结构或算法可以解决这个问题。我还想看看 Pandas 是否有一种非常有效的方法来通过逻辑操作(这就是我目前正在做的)对数据帧进行子集化,如下所示:

# Example for if Business_ID is populated, but both Company_Name and Location are not:
def specific_case_func_for_demo_purposes(company_name, business_id, location):
    if not company_name and business_id and not location:
        subset_df = registry_df[registry_df[Business_ID] == business_id_im_looking_for]

        if len(subset_df) == 0:
            return company_name, business_id, location
        elif len(subset_df) == 1:
            return subset_df['Company_Name'], business_id, subset_df['Location']
        else:
            # handle case when there are multiple business_id matches by seeing if company name is unique, since company name can be identified by business ID:
            if len(subset_df['Company_Name'].unique()) == 1:
                return subset_df['Company_Name'].iloc[0], business_id, location
            else:
                # can't possible know if there is a unique match - so just return the empty company_name and location
                return company_name, business_id, location

这只是一个处理这种特殊情况的函数,当 Business_ID 已填充但 Company_Name 和 Location 未填充时。正如所见,这可能会变得复杂。我目前正在处理所有 8 个案例(其中一些可以减少为重复项或本质上相同的案例,因此总而言之,大约有 4 个案例和几个子案例),但这在设计和性能方面似乎都非常低效。当使用行数 = 800,000 的注册表 CSV 数据的子集并对大约 400 个数据点执行此类逻辑时,使用 %timeit 需要 35 秒,stdev 为 128 毫秒。我使用了 df.apply 使用我设计的主要函数来计时。

我想知道一种数据结构,当将其应用于更大的数据帧时,它可以提供良好的查找和设置时间 - 我知道必须有一种更有效的方法,无论是有效地使用 Pandas 方法还是不同的数据诸如相互查找字典(或循环或分层字典,如果这些概念存在)之类的结构、可能的基于树的方法或其他东西。我欢迎对此提出任何意见。

最佳答案

Solution 1 - To optimise dataframe search

import sys
import time
import timeit
import random
import pandas as pd
# Creating dummy data for 1 million rows
number_of_rows = 1000000 # 1 million
list_Temp = [('abcdefghi_'+str(i), 'jklmnopqrs_'+str(i), 'tuvwxyz_'+str(i)) for i in range(number_of_rows)]

# This is your registry_df
df = pd.DataFrame(list_Temp, columns = ['A' , 'B', 'C']) 
print('Dummy registry_df size =', (sys.getsizeof(df)/1000000), 'MB')
df.head()

输出:

Dummy registry_df size = 217.666774 MB

        A               B             C
0   abcdefghi_0   jklmnopqrs_0    tuvwxyz_0
1   abcdefghi_1   jklmnopqrs_1    tuvwxyz_1
2   abcdefghi_2   jklmnopqrs_2    tuvwxyz_2
3   abcdefghi_3   jklmnopqrs_3    tuvwxyz_3
4   abcdefghi_4   jklmnopqrs_4    tuvwxyz_4

>

# Creating 'n' random numbers for searching in registry_df
n = 100
list_random_numbers = [random.randrange(0, number_of_rows) for i in range(n)]
# Time taken for searching for n values in registry_df. You need to select one of the best from option

%timeit for number in list_random_numbers: df_Temp = df.loc[df['A'] == 'abcdefghi_'+str(number)]
%timeit for number in list_random_numbers: df_Temp = df[df['A'] == 'abcdefghi_'+str(number)]
%timeit for number in list_random_numbers: df_Temp = df[df.A.eq("'abcdefghi_"+str(number)+"'")]
%timeit for number in list_random_numbers: df_Temp = df.query(('A == ' + ("'abcdefghi_"+str(number)+"'")))
%timeit for number in list_random_numbers: df_Temp = df.eval("A == 'abcdefghi_"+str(number)+"'")
%timeit for number in list_random_numbers: df_Temp = df[df.A.isin( ["A == 'abcdefghi_"+str(number)+"'"])]

输出:

11.9 s ± 338 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
11.4 s ± 441 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
13 s ± 756 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.57 s ± 384 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
5.22 s ± 122 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.99 s ± 140 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2 - If you have a little bit of extra memory. Store indexes in respective column dictionary

# Searching dataframe using indexes
# For demo Creating list of 'n' random numbers for searching in registry_df based on indexes

n = 100
list_random_numbers= [random.randrange(0, number_of_rows) for i in range(n)]

# Fetching rows based on index

%timeit for number in list_random_numbers: row = df[df.index.isin([number])]
%timeit for number in list_random_numbers: row = df.iloc[[number]]

输出:

948 ms ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
27.7 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

> 创建一个包含索引的字典

dict_Company_Name ={}
dict_Business_ID = {}
dict_Location = {}

def Create_Dictionary_From_DataFrame(df, dict_Company_Name, dict_Business_ID, dict_Location):
    for row_Index in range(len(df)):
        comany_name, business_id, location = df.iloc[row_Index]
        dict_Company_Name.setdefault(comany_name, []).append(row_Index)
        dict_Business_ID.setdefault(business_id, []).append(row_Index)
        dict_Location.setdefault(location, []).append(row_Index)

%time Create_Dictionary_From_DataFrame(df, dict_Company_Name, dict_Business_ID, dict_Location)

Wall time: 2min 12s

# Size in MB. For you this will vary as I have created dummy data
print((sys.getsizeof(dict_Company_Name))/1000000)
print((sys.getsizeof(dict_Business_ID))/1000000)
print((sys.getsizeof(dict_Company_Name))/1000000)

41.943144
41.943144
41.943144
# Searching based on indexes

n = 10000 # 1
list_random_numbers = [random.randrange(0, number_of_rows) for i in range(n)]

def specific_case_func_business_id():

    for row_Index in list_random_numbers:
        business_id = 'jklmnopqrs_' + str(row_Index)

        if business_id in dict_Business_ID:
            list_Indexes_Business_ID = dict_Business_ID[business_id]

            if len(list_Indexes_Business_ID) > 0:
                subset_df = df.iloc[list_Indexes_Business_ID]           
                """
                Your case specific code
                """
            else:
                raise Exception('No matching data found for Business ID = ' + business_id )

%timeit specific_case_func_business_id()

输出:

2.99 s ± 278 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

正如在数据帧中搜索所见,最优化的方法仅花费约 4 秒来搜索 100 个值。 而使用索引搜索所有 10000 个值大约需要 3 秒

此外,如果您有 2 个值和 1 个缺失值需要查找。您需要从相应的字典中查找索引,并使用以下函数获取公共(public)索引,最后从公共(public)索引列表中获取subset_df

def Get_Common_List_Values(list1, list2):
    if (list1 is None) and (list2 is not None): return list2
    if (list2 is None) and (list1 is not None): return list1

    if (list2 is not None) and (list1 is not None):
        return [row_Index for row_Index in list1 if row_Index in list2]

注意:所有时间均根据虚拟数据计算。您的实际时间会有所不同

关于python - Python(或一般的 CS)中的循环或分层字典式数据结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59872154/

相关文章:

python - 在 PyScripter 中清理屏幕的命令?

python - 我收到预期错误 <class 'openpyxl.styles.fills.Fill' > 使用 pandas read_excel 读取 excel 文件

r - 在 R 中从列表转换为 data.frame 时,utf-8 字符会丢失

python - Psycopg2 使用 pip 安装有效但无法在 OS X 10.9 上导入模块

python - 在大图像中绘制边界框

python - Beautiful Soup 只提取一个标签时可以在 html 代码中看到所有其他标签

list - 逐行创建 R 数据框

python - 将数据帧拆分为多个数据帧

python - 如何计算数据框列中的值?

python - 从 DataFrame 中选择第 n 个最低值(每一行!)