python - Pandas - 匹配引用号以查找最早的日期

我希望能在优化方面征求您的意见。我仍在学习越来越多的关于 python 的知识，并将其用于我的日常运营分析师职位。我的任务之一是对大约 60k 唯一记录标识符进行排序，并搜索另一个包含大约 120k 交互记录的数据框、创作交互的员工以及交互发生的时间。

作为引用，此时的两个数据帧如下所示:

main_data = 仅唯一标识符 nok_data = 作者姓名、唯一标识符(称为案例文件标识符)、注释文本、创建于。

我的设置目前运行它的速度大约是每分钟 2500 行，因此大约需要 25-30 分钟左右的运行时间。我很好奇的是我执行的任何步骤是:

冗余且效率低下，整体减慢了我的流程
由于缺乏知识，语法使用不当。

下面是我的代码:

nok_data = pd.read_csv("raw nok data.csv") #Data set from warehouse

main_data = pd.read_csv("exampledata.csv") #Data set taken from iTx ids from referral view

row_count = 0
error_count = 0
print(nok_data.columns.values.tolist())
print(main_data.columns.values.tolist()) #Commented out, used to grab header titles if needed.
data_length = len(main_data) #used for counting how many records left.
earliest_nok = {}
nok_data["Created On"] = pd.to_datetime(nok_data["Created On"]) #convert all dates to datetime at beginning.


for row in main_data["iTx Case ID"]:
    list_data = []
    nok = nok_data["Case File Identifier"] == row 
    matching_dates = nok_data[["Created On", "Authored By Name"]][nok == True]    #takes created on date only if nok shows row was true
    if len(matching_dates) > 0:
        try:
            min_dates = matching_dates.min(axis=0)
            earliest_nok[row] = [min_dates[0], min_dates[1]]
        except ValueError:
            error_count += 1
            earliest_nok[row] = None

    row_count += 1
    print("{} out of {} records").format(row_count, data_length)


with open('finaloutput.csv','wb') as csv_file:
    writer = csv.writer(csv_file)
    for key, value in earliest_nok.items():
        writer.writerow([key, value])

向那些执行此类代码的人寻求任何建议或专业知识，时间比我长得多。我感谢你们所有人花时间阅读本文。周二快乐，

安迪·M.

**** 请求编辑以显示数据抱歉，我的新手搬到那里不包括任何数据类型。

main_data 示例

ITX Case ID
2017-023597
2017-023594
2017-023592
2017-023590

nok_data 又名“原始 nok 数据.csv”

Authored By:   Case File Identifier:   Note Text:    Authored on
John Doe         2017-023594           Random Text     4/1/2017 13:24:35
John Doe         2017-023594           Random Text     4/1/2017 13:11:20
Jane Doe         2017-023590           Random Text     4/3/2017 09:32:00
Jane Doe         2017-023590           Random Text     4/3/2017 07:43:23
Jane Doe         2017-023590           Random Text     4/3/2017 7:41:00
John Doe         2017-023592           Random Text     4/5/2017 23:32:35
John Doe         2017-023592           Random Text     4/6/2017 00:00:35

最佳答案

您似乎想要根据案件文件标识符进行分组并获取最小日期和相应作者。

# Sort the data by `Case File Identifier:` and `Authored on` date 
# so that you can easily get the author corresponding to the min date using `first`.

nok_data.sort_values(['Case File Identifier:', 'Authored on'], inplace=True)
df = (
    nok_data[nok_data['Case File Identifier:'].isin(main_data['ITX Case ID'])]
    .groupby('Case File Identifier:')['Authored on', 'Authored By:'].first()
)
d = {k: [v['Authored on'], v['Authored By:']] for k, v in df.to_dict('index').iteritems()}

>>> d
{'2017-023590': ['4/3/17 7:41', 'Jane Doe'],
 '2017-023592': ['4/5/17 23:32', 'John Doe'],
 '2017-023594': ['4/1/17 13:11', 'John Doe']}

>>> df
                        Authored on Authored By:
Case File Identifier:                           
2017-023590             4/3/17 7:41     Jane Doe
2017-023592            4/5/17 23:32     John Doe
2017-023594            4/1/17 13:11     John Doe

使用 df.to_csv(...) 可能更容易。

main_data['ITX Case ID'] 中没有匹配记录的项目已被忽略，但如果需要，可以包含在内。

关于python - Pandas - 匹配引用号以查找最早的日期，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45825286/

python - Pandas - 匹配引用号以查找最早的日期

上一篇：python - 使用 pytest 参数化测试

下一篇：python - 使用数据集中的数字范围创建新的二进制列