python - Pandas : merge on column of ByteArray

标签 python merge arrays teradata

关于如何在一个通常命名的 bytearray 字段上连接两个 pandas 数组有什么想法吗?源 (Teradata) 中的字段是一个实际的 ByteArray,并且从 Teradata 端来看,不能将其强制为字符或在 Teradata 之外可用的东西)

Teradata Export 可以完美地读入 Panda 的数组。但是我无法合并两个具有通用命名字段 (DatabaseId) 的表,其中该字段是字节数组。

(将 pandas 导入为 pd 和 itertools)

当我尝试简单合并时:

merge1 = pd.merge(tvm, dbase, on="DatabaseId")

我得到以下错误:

TypeError: type object argument after * must be a sequence, not itertools.imap

我搜索了 StackOverflow 并找到了一个 similar problem for joining on a cell containing a collection

dbase['DBID'] = dbase.DatabaseId.apply(lambda r: type(sorted(r.iteritems())))

但是我得到了错误:

AttributeError: 'bytearray' object has no attribute 'iteritems'

更新

数据示例 使用

通过 pandas 收集的数据
dbase = pd.read_sql('select databaseid, databasename from ud812.dbase sample 10', conn)
conn is a connection to a teradata database

来自 Teradata 的数据类型对于所有列都是 Varchar,除了:

DatabaseID = bytearray (Byte(4))
TVMID = bytearray (Byte(4))

>>> dbase.dtypes
DatabaseId      object
DatabaseName    object
dtype: object
>>> dbase
         DatabaseId         DatabaseName
0  [2, 0, 243, 185]  PCDW_CRS_BBCONV3_TB
1  [2, 0, 168, 114]            PAMLIF_TB
2  [2, 0, 133, 153]        PADW_PRESN_TB
3   [2, 0, 29, 184]       CEDW_MOBILE_TB
4  [2, 0, 190, 183]  CEDW_MODEL_SCORE_TB
5    [2, 0, 71, 55]            PBBBAM_TB
6  [2, 0, 169, 183]          CEDW_OCC_TB
7  [2, 0, 201, 183]    CCDW_DGTL_DEAL_TB
8    [0, 0, 139, 8]           PRECDSS_TB
9  [2, 0, 142, 203]             CDBDW_TB
>>>
>>>
>>> tvm.dtypes
TVMId         object
DatabaseId    object
TVMName       object
TableKind     object
CreateText    object
dtype: object
>>> tvm
                      TVMId        DatabaseId                        TVMName  \
0    [230, 1, 41, 11, 0, 0]   [2, 0, 67, 183]               JCP_03538_112002
1   [214, 1, 60, 133, 0, 0]   [2, 0, 186, 52]        STL_AUTHNCTD_RULE_EXECN
2    [193, 2, 59, 48, 0, 0]  [2, 0, 225, 150]       uye177_Xsell_EM_OPCL_TB2
3    [0, 2, 235, 154, 0, 0]  [2, 0, 244, 181]  PL_CALCD_INVSTR_MTHLY_HIST_ST
4   [255, 1, 131, 76, 0, 0]   [2, 0, 110, 63]            IMH867_AVA0803_SNAP
5  [125, 1, 217, 138, 0, 0]  [2, 0, 237, 153]            FD_ACCT_STMT_ADR_ST
6   [224, 0, 80, 233, 0, 0]  [2, 0, 243, 127]             EXP_SRCH_RSLT_DESC
7    [208, 1, 72, 15, 0, 0]     [2, 0, 8, 57]      SGI_PAY_DENIED_SEP_112012
8    [246, 0, 27, 61, 0, 0]  [2, 0, 143, 130]                      CR_INDIVD
9  [186, 1, 242, 167, 0, 0]   [0, 0, 244, 18]                 wzu448_sb_apps

  TableKind                                         CreateText
0         T                                               None
1         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
2         T                                               None
3         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
4         T                                               None
5         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
6         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
7         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
8         V  CREATE VIEW  ... ... ... ... ... ... ... ... ...
9         T                                               None

最佳答案

将您的 bytearray 转换为其不可变表亲 bytes

import pandas as pd

# Create your example `dbase`
DatabaseId_dbase = list(map(bytearray, [[2, 0, 243, 185], [2, 0, 168, 114],
    [2, 0, 133, 153], [2, 0, 29, 184], [2, 0, 190, 183], [2, 0, 71, 55],
    [2, 0, 169, 183], [2, 0, 201, 183], [0, 0, 139, 8], [2, 0, 142, 203]]))
DatabaseName = ['PCDW_CRS_BBCONV3_TB', 'PAMLIF_TB', 'PADW_PRESN_TB',
    'CEDW_MOBILE_TB', 'CEDW_MODEL_SCORE_TB', 'PBBBAM_TB', 'CEDW_OCC_TB',
    'CCDW_DGTL_DEAL_TB', 'PRECDSS_TB', 'CDBDW_TB']
dbase = pd.DataFrame({'DatabaseId': DatabaseId_dbase,
                      'DatabaseName': DatabaseName})

# Create your example `tvm`
DatabaseId_tvm = list(map(bytearray, [[2, 0, 67, 183], [2, 0, 186, 52],
    [2, 0, 225, 150], [2, 0, 244, 181], [2, 0, 110, 63], [2, 0, 237, 153],
    [2, 0, 243, 127], [2, 0, 243, 185], [2, 0, 143, 130], [0, 0, 244, 18]]))
TVMId = list(map(bytearray, [[230, 1, 41, 11, 0, 0], [214, 1, 60, 133, 0, 0],
    [193, 2, 59, 48, 0, 0], [0, 2, 235, 154, 0, 0], [255, 1, 131, 76, 0, 0],
    [125, 1, 217, 138, 0, 0], [224, 0, 80, 233, 0, 0], [208, 1, 72, 15, 0, 0],
    [246, 0, 27, 61, 0, 0], [186, 1, 242, 167, 0, 0]]))
TVMName = ['JCP_03538_112002', 'STL_AUTHNCTD_RULE_EXECN',
    'uye177_Xsell_EM_OPCL_TB2', 'PL_CALCD_INVSTR_MTHLY_HIST_ST',
    'IMH867_AVA0803_SNAP', 'FD_ACCT_STMT_ADR_ST', 'EXP_SRCH_RSLT_DESC',
    'SGI_PAY_DENIED_SEP_112012', 'CR_INDIVD', 'wzu448_sb_apps']
TableKind = ['T', 'V', 'T', 'V', 'T', 'V', 'V', 'V', 'V', 'T']
tvm = pd.DataFrame({'DatabaseId': DatabaseId_tvm, 'TVMId': TVMId,
                    'TVMName': TVMName, 'TableKind': TableKind})

# This line would fail with the following error
#     TypeError: type object argument after * must be a sequence, not map
# merge = pd.merge(tvm, dbase, on='DatabaseId')

# Apply the `bytes` constructor to the `bytearray` columns    
dbase['DatabaseId'] = dbase['DatabaseId'].apply(bytes)
tvm['DatabaseId'] = tvm['DatabaseId'].apply(bytes)
tvm['TVMId'] = tvm['TVMId'].apply(bytes)

# Now it works!
merge = pd.merge(tvm, dbase, on='DatabaseId')

生成的合并

   DatabaseId                     TVMId                    TVMName  \
0  b'\x02\x00\xf3\xb9'  b'\xd0\x01H\x0f\x00\x00'  SGI_PAY_DENIED_SEP_112012   

  TableKind         DatabaseName  
0         V  PCDW_CRS_BBCONV3_TB  

(我必须更改您的 tvm 中其中一行的 DatabaseId 字段,否则 merge 将是空的。我也没有包含 CreateText 列——对 SO 来说太尴尬了)

关于python - Pandas : merge on column of ByteArray,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38245661/

相关文章:

python - 算法 3D点云体积计算

python - pyplot 图卡住(无响应)

java - 使用java NodeList合并xml文件

c++ - 获取指向数组末尾的指针

python - 使用 numpy 数组时如何消除 for 循环并使用列表理解?

python - 余弦相似度与余弦距离

python - 具有多列的 groupby 以及 pandas 中的添加和频率计数

Git branch --merged/--no-merged 和 --squash 选项

python - 在合并/左连接期间替换数据框中的 NaN

Javascript - 在 while 循环中修改两个数组,两个数组都保留第二个值