python - 从 Excel 工作表导出后如何清理数据框中的日期时间字符串？

我有一个 Excel 电子表格，其中一列包含一些日期时间数据。我使用 Pandas 将数据导出到数据框中。然而，该列有一些日期 block 交换了月份和日期，而同一列中的其他日期 block 是正确的。这是一个例子-

图 1:错误地交换了日期和月份

上图显示了交换的日期和月份。日期显示为 2016-01-10，但应该改为 2016-10-01。将其与同一列中的另一组日期时间值进行比较 -

图 2:正确表示日和月

在图 2 的上述示例中，月份正确表示为 12，日期为 31。

我使用了这个问题的解决方案 - How to swap months and days in a datetime object?

我也尝试过使用这个解决方案 - Python Pandas - Day and Month mix up

我也尝试编写自己的函数来映射到条目，但这也无济于事 -

def dm_swap(day, month):
if(month != 10 or month != 11 or month != 12):
    temp = day
    day = month
    month = temp

t2016Q4.start.map(dmswap, t2016Q4.dt.day, t2016Q4.dt.month)

但是，这两种解决方案都会更改列中的所有日期时间值。因此，当不正确的值得到更正时，正确的值就会变得不正确。

为了方便起见，我还链接了 excel 文件。这是一个开放的数据集。

https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#343faeaa-c920-57d6-6a75-969181b6cbde

请选择最后一个数据集 Bikeshare Ridership (2016 Q4)。 “开始”和“结束”列存在上述问题。

是否有更有效的方法来清理日期时间数据？

最佳答案

好的。

再次编辑。我运行了下面的代码，它花了很长时间!我最后中止了，但这在合理的时间内肯定也有效 - 祝你好运!:

import pandas as pd

f = "string\to\file\here.xlsx"
df = pd.read_excel(f)

def alter_date(timestamp):

    try:
        date_time = timestamp.to_datetime().strftime("%Y-%d-%m %H:%M:%S")
        time_stamp = pd.Timestamp(date_time)
        return time_stamp
    except:
        return timestamp

new_starts = df["trip_start_time"].apply(alter_date)
df["trip_start_time"] = new_starts
new_ends =  df["trip_stop_time"].apply(alter_date)
df["trip_stop_time"] = new_ends

编辑:我进行了一些挖掘，根据我之前所做的，这看起来是可能的，这里是新代码:

import pandas as pd

f = "string\to\file\here.xlsx"
df = pd.read_excel(f)

for idx in df.index:
    trip_start = df.loc[df.index[idx], "trip_start_time"]
    trip_end = df.loc[df.index[idx], "trip_stop_time"]
    start_dt = trip_start.to_datetime()
    end_dt = trip_end.to_datetime()
    try:
        start_dt_string = start_dt.strftime("%Y-%d-%m %H:%M:%S")
        end_dt_string = end_dt.strftime("%Y-%d-%m %H:%M:%S")
        start_ts = pd.Timestamp(start_dt_string)
        end_ts = pd.Timestamp(end_dt_string)
        df.loc[idx, "trip_start_time"] = start_ts
        df.loc[idx, "trip_stop_time"] = end_ts
    except ValueError:
        pass

它有点慢(有一堆数据行)但我的计算机似乎正在处理它 - 如果失败会再次更新。

旧回复: 所以，发生的事情是，每个不可能出现歧义的日期/时间都在原始数据集中，格式为:DD/MM/YYYY HH:MM:SS。

如果可以强制转换为 MM/DD/YY HH:MM:SS 那么它有

我会做的是遍历每一列

for row in df.index:
    try:
        new_dt = datetime.strptime(row, "%Y-%d-%m %H:%M:%S")
        #write back to the df here
    except ValueError:
        pass#ignore anything  that cannot be converted

关于python - 从 Excel 工作表导出后如何清理数据框中的日期时间字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53500872/

python - 从 Excel 工作表导出后如何清理数据框中的日期时间字符串？

上一篇：python - 向 Dask 分布式集群提交任务时本地 python 文件导入问题

下一篇：python - 正则表达式 - 在文本中搜索相似的国家名称