如果这是一个非常基本的问题,请道歉。非常感谢您抽出时间来讨论这个问题。
我有以下格式的 CSV 数据。
2019-05-10 13:00:00 some_data,some_more_data,... 2019-05-10 16:20:10 some_data,some_more_data,... 2019-05-10 19:21:10 some_data,some_more_data,... 2019-05-11 01:10:10 some_data,some_more_data,... 2019-05-11 12:24:10 some_data,some_more_data,... 2019-05-12 01:10:10 some_data,some_more_data,... 2019-05-12 12:24:10 some_data,some_more_data,... 2019-05-12 23:10:10 some_data,some_more_data,... 2019-05-12 12:24:10 some_data,some_more_data,...
From the above listed data, how can one filter out the data corresponding to the last timestamp in a given day ?
I have employed some string parsing and achieved the following result - but am looking to find an efficient way / alternatives.
So, the desired output will be.
2019-05-10 19:21:10 some_data,some_more_data,... 2019-05-11 12:24:10 some_data,some_more_data,... 2019-05-12 23:10:10 some_data,some_more_data,...
Tried some really ugly string splitting - and datetime comparision.
monday_morning_report_data = 'C:\\Users\\a071927\\Dropbox\\monday_morning_report\\monday_morning_report_data\\test.csv'
# Open CSV file in to read data from it.
open_report_file_to_read = open(monday_morning_report_data, 'r', newline='')
monday_morning_report_generation = csv.reader(open_report_file_to_read)
# Create an empty list which will gather a list of all dates only - %Y-%m-%d
list_of_all_dates = list()
# From each row of the csv file, which is a list with ONE string.
for each_timestamp_info in monday_morning_report_generation:
# Split the string into a list.
time_stamp_all_data = each_timestamp_info[0].split(',')
# From the split list, get the index 0 which is the complete timestamp.
time_stamp_info_date_time_str = time_stamp_all_data[0]
# gather only %Y-%m-%d by splitting at ' '
time_stamp_info_date_time_str_date_only = time_stamp_info_date_time_str.split(' ')[0]
# if that day is not in list_of_all_dates append it.
if time_stamp_info_date_time_str_date_only not in list_of_all_dates:
list_of_all_dates.append(time_stamp_info_date_time_str_date_only)
# now list_of_all_Dates has the list of all unique days.
for each_day in list_of_all_dates:
open_report_file_to_read = open(monday_morning_report_data, 'r', newline='')
monday_morning_report_generation = csv.reader(open_report_file_to_read)
#Gather TIMES within each unique day.
list_of_times_in_the_given_day = list()
# From each row of the csv file, which is a list with ONE string.
for each_timestamp_info in monday_morning_report_generation:
# Split the string into a list.
time_stamp_all_data = each_timestamp_info[0].split(',')
# From the split list, get the index 0 which is the complete timestamp.
time_stamp_info_date_time_str = time_stamp_all_data[0]
# gather only %Y-%m-%d by splitting at ' ' - index 0
time_stamp_info_date_time_str_date_only = time_stamp_info_date_time_str.split(' ')[0]
# gather only '%H:%M:%S' splitting at ' ' - index 1
time_stamp_info_date_time_str_time_only = time_stamp_info_date_time_str.split(' ')[1]
if each_day == time_stamp_info_date_time_str_date_only:
list_of_times_in_the_given_day.append(time_stamp_info_date_time_str_time_only)
#print(time_stamp_info_date_time_str_time_only)
# initialize a max timestamp default of 00:00:00
max_time_stamp_within_a_day = datetime.strptime('00:00:00', '%H:%M:%S')
# initialize string with ' ' - this will be populated later.
max_time_stamp_within_a_day_str = ''
#Now from the list of unique times within a given day.
for each_time in list_of_times_in_the_given_day:
if datetime.strptime(each_time,'%H:%M:%S') >= max_time_stamp_within_a_day:
# update the max time - date time value
max_time_stamp_within_a_day = datetime.strptime(each_time,'%H:%M:%S')
# update the string.
max_time_stamp_within_a_day_str = each_time
# once the max time / last time within a day is calculated.
final_timestamp = each_day + ' ' + max_time_stamp_within_a_day_str
# Print given unique day.
print(each_day)
# print list of times data was gathered during this day
print(list_of_times_in_the_given_day)
# print the final and latest timestamp.
print(final_timestamp)
open_report_file_to_read = open(monday_morning_report_data, 'r', newline='')
monday_morning_report_generation = csv.reader(open_report_file_to_read)
for each_timestamp_info in monday_morning_report_generation:
time_stamp_all_data = each_timestamp_info[0].split(',')
time_stamp_info_date_time_str = time_stamp_all_data[0]
# From the final timestamp get the data.
if time_stamp_info_date_time_str == final_timestamp:
print(each_timestamp_info)
print('---------')
open_report_file_to_read.close()
有什么有效的方法可以达到同样的效果吗?
最佳答案
您可以使用pandas
来做到这一点。需要注意的一件事是,您的 csv 数据在日期和 some_data
之间没有逗号。我对数据进行了预处理以分割它们。另请注意,以下解决方案仅在数据按日期排序时才有效。如果未排序,您可以在下面的 set_index
调用后添加 df.sort_index()
。
import pandas as pd
from dateutil.parser import parse
df = pd.read_csv('path_to_csv.csv')
df.iloc[:,0] = df.iloc[:,0].apply(parse)
df.set_index(df.columns[0], inplace=True)
indices = df.index.floor('D')
new_df = df[~indices.duplicated(keep='last') | ~indices.duplicated(keep=False)]
本质上,我们在这里所做的是将日期列解析为日期时间对象,然后将其设置为 DataFrame 的索引。然后我们得到这些指数,按其日
计算。这实际上创建了一系列日期,然后我们可以删除重复项并保留每组重复项中最后一个值的位置。
关于python - 从多天的日期时间时间戳列表中,如何找到每天的最后一个时间戳?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56118123/