python - Json/奇怪的列转换

标签 python pandas

我正在从 mongo 数据库获取一些数据。这样的表包含几个列,并且一些操作系统这样的列由非常奇怪的格式组成。

列/系列的一行示例

'[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]'

至少就我的射击知识而言,这不是 Json。我正在努力如何将每个“事件”(由 {} 项组成)转换为列表。

之后,我如何根据每个事件的包含来查询/过滤数据?我应该将事件 pd.explode 到新行中并作为字符串查询吗?

最佳答案

您可以尝试将字符串“转换”为正确的 Json(使用 re),然后使用标准 json.loads ( Regex101 demo ):

import re
import json
import pandas as pd


s = "[{idEvento.$oid=63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]"

s = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"\g<1>":"\g<2>"', s)
data = json.loads(s)

df = pd.DataFrame(data)
print(df)

打印:

              idEvento.$oid dataHoraEvento.$date codigoTipoEvento mesAnoReferenciaContabilizacao                                           _class
0  63ffaec3cdc01e6352729bad        1677690003377                1                         032023                                              NaN
1  63ffb5c8cdc01e6352729bae        1677691800676                3                         032023                                              NaN
2  6405cc8711c78c20369b4033        1678090851560                8                         032023                                              NaN
3  6422b4c97e45dd75abb4f831        1679985307560                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
4  6422b4c97e45dd75abb4f832        1679985309584                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil

注意:这适用于本示例,但可能需要根据实际情况调整模式。


编辑:要应用于数据框:

考虑以下数据框:

df = pd.DataFrame(
    {
        "col1": [
            "[{idEvento.$oid=01_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
            "[{idEvento.$oid=02_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
            "[{idEvento.$oid=03_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]",
        ]
    }
)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               col1
0  [{idEvento.$oid=01_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]
1  [{idEvento.$oid=02_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]
2  [{idEvento.$oid=03_63ffaec3cdc01e6352729bad, dataHoraEvento.$date=1677690003377, codigoTipoEvento=1, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=63ffb5c8cdc01e6352729bae, dataHoraEvento.$date=1677691800676, codigoTipoEvento=3, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6405cc8711c78c20369b4033, dataHoraEvento.$date=1678090851560, codigoTipoEvento=8, mesAnoReferenciaContabilizacao=032023}, {idEvento.$oid=6422b4c97e45dd75abb4f831, dataHoraEvento.$date=1679985307560, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}, {idEvento.$oid=6422b4c97e45dd75abb4f832, dataHoraEvento.$date=1679985309584, codigoTipoEvento=6, mesAnoReferenciaContabilizacao=032023, _class=br.com.bb.rcp.model.vantagens.HistoricoContabil}]

然后:

def fn(x):
    x = re.sub(r"([^ =,\[\]\{\}]+)=([^ =,\[\]\{\}]+)", r'"\g<1>":"\g<2>"', x)
    return json.loads(x)

out = df["col1"].apply(fn).explode().apply(pd.Series)
print(out)

打印:

                 idEvento.$oid dataHoraEvento.$date codigoTipoEvento mesAnoReferenciaContabilizacao                                           _class
0  01_63ffaec3cdc01e6352729bad        1677690003377                1                         032023                                              NaN
0     63ffb5c8cdc01e6352729bae        1677691800676                3                         032023                                              NaN
0     6405cc8711c78c20369b4033        1678090851560                8                         032023                                              NaN
0     6422b4c97e45dd75abb4f831        1679985307560                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
0     6422b4c97e45dd75abb4f832        1679985309584                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
1  02_63ffaec3cdc01e6352729bad        1677690003377                1                         032023                                              NaN
1     63ffb5c8cdc01e6352729bae        1677691800676                3                         032023                                              NaN
1     6405cc8711c78c20369b4033        1678090851560                8                         032023                                              NaN
1     6422b4c97e45dd75abb4f831        1679985307560                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
1     6422b4c97e45dd75abb4f832        1679985309584                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
2  03_63ffaec3cdc01e6352729bad        1677690003377                1                         032023                                              NaN
2     63ffb5c8cdc01e6352729bae        1677691800676                3                         032023                                              NaN
2     6405cc8711c78c20369b4033        1678090851560                8                         032023                                              NaN
2     6422b4c97e45dd75abb4f831        1679985307560                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil
2     6422b4c97e45dd75abb4f832        1679985309584                6                         032023  br.com.bb.rcp.model.vantagens.HistoricoContabil

关于python - Json/奇怪的列转换,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76830940/

相关文章:

python - 使用ast.literal_eval()清理数据时出现语法错误

python - 删除flask中的一对一关系

python - 无法安装最新版本的 pandas (1.0.3)

python - 我的代码只循环到某个点并忽略之后的 Python、字典

python - 如何从 Pandas Dataframe 计算信息的香农熵?

python - 如果我使用不同数量的核心,XGBoost 会产生相同的结果吗?

python - python 中的 select.select 需要 1 - 3 个参数序列

python - 如何存储搜索结果以进行本地化

Python Pandas - 比较列文本并提供匹配的字数

python - 用 Pane 数据填充数据框