我正在使用vertica_python从数据库中提取数据。我提取的列是以下格式的字符串:
[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"inte cept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]
然后,我分割并解析这个字符串,并按以下格式将其加载到 Excel 中,每个索引都是一个单元格:
prediction_type CONV_PROBABILIT calibration_factor 0.90655 intercept -2.41041 advMatchTypeId -0.23987 atsId 1.44701 deviceTypeId 0.19701 dmaCode -0.69982 keywordId 0.44224
这是我的问题。字符串没有明确的格式,这意味着有时我会丢失字符串中的一些功能,从而弄乱我的格式。这是一个例子:
intercept -2.41041 advMatchTypeId -0.23987 deviceTypeId 0.37839 dmaCode -0.53552 keywordId 0.44224
intercept -2.41041 advMatchTypeId -0.23987 atsId 0.80708 deviceTypeId -0.19573 dmaCode -0.69982 keywordId 0.44224
如何保留我想要的格式,并使上面的示例看起来像这样:
intercept -2.41041 advMatchTypeId -0.23987 deviceTypeId 0.37839 dmaCode -0.53552 keywordId 0.44224
intercept -2.41041 advMatchTypeId -0.23987 atsId 0.80708 deviceTypeId -0.19573 dmaCode -0.69982 keywordId 0.44224
这是我正在使用的代码:
data_all = cur.fetchall()
for i in range(len(data_all)):
col = 0
data_one = ''.join(data_all[i])
raw_coef = data_one.split(',')[1:len(data_all)]
for j in range(len(raw_coef)):
raw = ''.join(raw_coef[j])
raw = re.sub('"|}|{|[|]|', '', raw)[:-1]
raw = raw.split(":")
for k in range(len(raw)):
worksheet.write(i, col, raw[k], align_left)
feature.append(raw[0]) # for unique values
col+=1
我的查询:
cur.execute(
"""
select MODEL_COEF
from
dcf_funnel.ADV_BIDDER_PRICING_LOG
where MODEL_ID = 8960
and DATE(AMP_QUERY_TIMESTAMP) = '11-02-2016'
"""
)
最佳答案
您可以跳过所有解析并使用 pandas:
import pandas
如果查询结果已经是 Python 中的字典列表,这会将您的查询结果读入 DataFrame。
data_all_list = [{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]
df = pandas.DataFrame(data_all_list)
如果你确实有字符串,你可以使用read_json
:
data_all_str = """[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]"""
df = pandas.read_json(data_all_str)
进一步的思考让我明白你的 data_all
实际上是一个字典列表的列表,如下所示:
data_all_lol = [data_all_list, data_all_list]
在这种情况下,您需要在传递给 DataFrame 之前连接列表:
df = pandas.DataFrame(sum(data_all_lol, []))
这将以正常的标题+值格式写入:
df.to_csv('filename.csv') # you can also use to_excel
如果你的最终目标只是获得所有特征的平均值,pandas 可以立即做到这一点,使用任意数量的列,正确处理缺失值:
df.mean()
给予
advMatchTypeId -0.239877
atsId -0.135568
calibration_factor 0.906556
deviceTypeId 0.439130
dmaCode -0.251728
id 0.000000
intercept -2.410414
keywordId 0.442240
有关歧义的注意事项
在OP中,很难知道data_all
的类型,因为您显示的代码片段看起来像文字语法中的字典列表,但您说“我提取的列作为字符串”。
请注意以下 IPython session 中输入表示方式之间的差异:
In [15]: data_all_str
Out[15]: '[{"id":0,"prediction_type":"CONV_PROBABILITY","calibration_factor":0.906556,"intercept":-2.410414,"advMatchTypeId":-0.239877,"atsId":-0.135568,"deviceTypeId":0.439130,"dmaCode":-0.251728,"keywordId":0.442240}]'
In [16]: data_all_list
Out[16]:
[{'advMatchTypeId': -0.239877,
'atsId': -0.135568,
'calibration_factor': 0.906556,
'deviceTypeId': 0.43913,
'dmaCode': -0.251728,
'id': 0,
'intercept': -2.410414,
'keywordId': 0.44224,
'prediction_type': 'CONV_PROBABILITY'}]
关于python - 使用 python 将数据组织到适当的列中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40407866/