我有一个 csv 文件,第一行是产品名称,第二行是数据标题,从第三行开始包含每个用户状态的实际数据。
csv 文件如下所示:
adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process
我想明智地拆分数据产品并重新排列标题和数据,如下所示:
From header like this
adidas,, USER_ID,USER_NAME b012345,zaihan,Process
To header like this
USER_ID,USER_NAME,adidas b012345,zaihan,Process
并创建 DataFrame
每种产品和 merge
他们像这样:
我已经编写代码有一段时间了,我想我必须对标题进行硬编码(例如,“adidas”和“nike”),因为我从阅读SO答案中了解到的是,我需要唯一的标题名称以下代码没有得到我想要的:
我的Python代码是:
import csvkit
import sys
import os
from csvkit import convert
with open('/tmp/csvdata.csv', 'rb') as q:
reader = csvkit.reader(q)
with open('/tmp/csvdata2.csv', 'wb') as s:
data = csvkit.writer(s)
data.writerow(['Name', 'Userid', 'adidas', 'nike'])
for row in reader:
row_data = [row[0], row[1], row[2], '']
data = csvkit.writer(s)
data.writerow(row_data)
编辑
所以我从@piRSquared得到了一个解决方案,如果产品有唯一的记录集,那么这是正确的,但同一产品的每个用户可能有多个状态。解决方案给出 ValueError: Index contains duplicate entries, cannot reshape
输入 CSV 数据具有多种状态并会导致此问题的示例:
adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
h003455,shabree,Check
b212345,nurhanani,Check
b843432,nasirah,Call
b712345,ibrahim,Check
b712345,ibrahim,Process
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
b777345,ibrahim,Process
b012345,zaihan,Check
b843432,nasirah,Call
b312451,nurhanani,Process
我希望达到这样的效果,好像同一品牌类别的用户可以有相同的id、name,并且Process和Check都相同。
USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process
h003455,shabree,Check,Process
b212345,nurhanani,Check,Process
b843432,nasirah,Call,Call
b712345,ibrahim,Check
b712345,ibrahim,Process
b777345,ibrahim,,Process
b842134,khalee,,Call
对于在同一品牌中同时具有“检查”和“处理”功能的用户,最终结果应具有如上所示的附加行(在本例中>耐克品牌的用户 ibrahim)
最佳答案
好吧,这很复杂。
解决方案
from StringIO import StringIO
import re
import pandas as pd
text = """adidas,,
USER_ID,USER_NAME
b012345,zaihan,Process
b212345,nurhanani,Check
b451234,nasirah,Call
c234567,ibrahim,Check
nike,,
USER_ID,USER_NAME
b842134,khalee,Call
h123455,shabree,Process
c234567,ibrahim,Process
c143322,zaihan,Check
b451234,nasirah,Call
"""
m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)
dfs = range(len(m))
keys = range(len(m))
for i, f in enumerate(m):
lines = f[0].split('\n')
lines[1] += ','
keys[i] = lines[0].split(',')[0]
dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))
df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)
df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)
df = df.stack().unstack()
演示
print df.to_csv()
USER_ID,USER_NAME,adidas,nike
b012345,zaihan,Process,
b212345,nurhanani,Check,
b451234,nasirah,Call,Call
b842134,khalee,,Call
c143322,zaihan,,Check
c234567,ibrahim,Check,Process
h123455,shabree,,Process
说明
# regular expression to match line with a single value identified
# by having two commas at the end of the line.
# This grabs nike and adidas.
# It also grabs all lines after that until the next single valued line.
m = re.findall(r'(.*,,\n(.*([^,]|,[^,])\n)*)', text)
# place holder for list of sub dataframes
dfs = range(len(m))
# place holder for list of keys. In this example this will be nike and adidas
keys = range(len(m))
# Loop through each regex match. This example will only have 2.
for i, f in enumerate(m):
# split on new line so I can grab and fix stuff
lines = f[0].split('\n')
# Fix that header row only has 2 columns and data has 3
lines[1] += ','
# Grab nike or adidas or other single value
keys[i] = lines[0].split(',')[0]
# Create dataframe by reading in rest of lines
dfs[i] = pd.read_csv(StringIO('\n'.join(lines[1:])))
# Concat dataframes with appropriate keys and pivot stuff
df = pd.concat(dfs, keys=keys)
df = df.set_index(['USER_ID', 'USER_NAME'], append=True).unstack(0)
df.index = df.index.droplevel(0)
df.columns = df.columns.droplevel(0)
df = df.stack().unstack()
关于python - 在python csv中处理带有行标题的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37372311/