python - 在python csv中处理带有行标题的数据

标签 python python-2.7 csv pandas

我有一个 csv 文件,第一行是产品名称,第二行是数据标题,从第三行开始包含每个用户状态的实际数据。

csv 文件如下所示:

adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process

我想明智地拆分数据产品并重新排列标题和数据,如下所示:

From header like this

   adidas,,
   USER_ID,USER_NAME
   b012345,zaihan,Process

To header like this

  USER_ID,USER_NAME,adidas
  b012345,zaihan,Process

并创建 DataFrame 每种产品和 merge 他们像这样:

enter image description here

我已经编写代码有一段时间了,我想我必须对标题进行硬编码(例如,“adidas”和“nike”),因为我从阅读SO答案中了解到的是,我需要唯一的标题名称以下代码没有得到我想要的:

我的Python代码是:

import csvkit
import sys
import os
from csvkit import convert

with open('/tmp/csvdata.csv', 'rb') as q:
    reader = csvkit.reader(q)
    with open('/tmp/csvdata2.csv', 'wb') as s:
        data = csvkit.writer(s)
        data.writerow(['Name', 'Userid', 'adidas', 'nike'])
        for row in reader:
            row_data = [row[0], row[1], row[2], '']
            data = csvkit.writer(s)
            data.writerow(row_data)

编辑

所以我从@piRSquared得到了一个解决方案,如果产品有唯一的记录集,那么这是正确的,但同一产品的每个用户可能有多个状态。解决方案给出 ValueError: Index contains duplicate entries, cannot reshape

输入 CSV 数据具有多种状态并会导致此问题的示例:

adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
h003455,shabree,Check
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
b712345,ibrahim,Process
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process

我希望达到这样的效果,好像同一品牌类别的用户可以有相同的id、name,并且Process和Check都相同。

USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process
h003455,shabree,Check,Process
b212345,nurhanani,Check,Process
b843432,nasirah,Call,Call
b712345,ibrahim,Check
b712345,ibrahim,Process 
b777345,ibrahim,,Process
b842134,khalee,,Call

对于在同一品牌中同时具有“检查”和“处理”功能的用户,最终结果应具有如上所示的附加行(在本例中>耐克品牌的用户 ibrahim)

最佳答案

好吧,这很复杂。

解决方案

from StringIO import StringIO
import re
import pandas as pd

text = """adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b451234,nasirah,Call
c234567,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
c234567,ibrahim,Process
c143322,zaihan,Check
b451234,nasirah,Call
"""

m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)

dfs = range(len(m))
keys = range(len(m))
for i, f in enumerate(m):
    lines = f[0].split('\n')
    lines[1] += ','
    keys[i] = lines[0].split(',')[0]
    dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))

df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)

df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)

df = df.stack().unstack()

演示

print df.to_csv()

USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process,
b212345,nurhanani,Check,
b451234,nasirah,Call,Call
b842134,khalee,,Call
c143322,zaihan,,Check
c234567,ibrahim,Check,Process
h123455,shabree,,Process

说明

# regular expression to match line with a single value identified
# by having two commas at the end of the line.
# This grabs nike and adidas.
# It also grabs all lines after that until the next single valued line.
m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)

# place holder for list of sub dataframes
dfs = range(len(m))
# place holder for list of keys.  In this example this will be nike and adidas
keys = range(len(m))

# Loop through each regex match.  This example will only have 2.
for i, f in enumerate(m):
    # split on new line so I can grab and fix stuff
    lines = f[0].split('\n')
    # Fix that header row only has 2 columns and data has 3
    lines[1] += ','
    # Grab nike or adidas or other single value
    keys[i] = lines[0].split(',')[0]
    # Create dataframe by reading in rest of lines
    dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))

# Concat dataframes with appropriate keys and pivot stuff
df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)

df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)

df = df.stack().unstack()

关于python - 在python csv中处理带有行标题的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37372311/

相关文章:

python - 用注释注释 Python print() 输出

python - 在 Pillow 中确定带有换行符的字符串的像素高度

python-2.7 - Python : Can't import a function from another. py 文件

python - pandas.read_csv : how do I parse two columns as datetimes in a hierarchically-indexed CSV?

python - 解析时去掉重复项

python - 将Python字节流从big endian更改为little endian

python - 多处理/多线程 - 不会提高速度 - Python

python - 导入错误 : No module named _mssql

python-2.7 - 保存交互式 Bokeh 图

python - 添加到现有电子表格?