python - 将 json.dumps 转换为 Python 数据帧

标签 python json pyspark ibm-cloud

我正在使用 IBM Watson 的自然语言理解 API。我使用 API 文档中的以下代码返回存储在 Dataframe 中的 Nike 一些推文的情绪分析:

import json
 from watson_developer_cloud import NaturalLanguageUnderstandingV1
 from watson_developer_cloud.natural_language_understanding_v1 \
   import Features, EntitiesOptions, KeywordsOptions

naturalLanguageUnderstanding = NaturalLanguageUnderstandingV1(
   version='2018-09-21',
   iam_apikey='[KEY HIDDEN]',
   url='https://gateway.watsonplatform.net/natural-language- 
understanding/api')

for tweet in nikedf["text"]:
    response = naturalLanguageUnderstanding.analyze(
      text=tweet,
      features=Features(
        entities=EntitiesOptions(
          emotion=False,
          sentiment=True,
          limit=2),
        keywords=KeywordsOptions(
          emotion=False,
          sentiment=True,
          limit=2))).get_result()
    print(json.dumps(response, indent=2))

我返回一个字符串 json 转储,如下所示。

{
  "usage": {
    "text_units": 1,
    "text_characters": 140,
    "features": 2
  },
  "language": "en",
  "keywords": [
    {
      "text": "Kaepernick7 Kapernick",
      "sentiment": {
        "score": 0.951279,
        "label": "positive"
      },
      "relevance": 0.965894,
      "count": 1
    },
    {
      "text": "campaign",
      "sentiment": {
        "score": 0.951279,
        "label": "positive"
      },
      "relevance": 0.555759,
      "count": 1
    }
  ],
  "entities": [
    {
      "type": "Company",
      "text": "nike",
      "sentiment": {
        "score": 0.899838,
        "label": "positive"
      },
      "relevance": 0.92465,
      "disambiguation": {
        "subtype": [],
        "name": "Nike, Inc.",
        "dbpedia_resource": "http://dbpedia.org/resource/Nike,_Inc."
      },
      "count": 2
    },
    {
      "type": "Company",
      "text": "Kapernick",
      "sentiment": {
        "score": 0.899838,
        "label": "positive"
      },
      "relevance": 0.165888,
      "count": 1
    }
  ]
}
{
  "usage": {
    "text_units": 1,
    "text_characters": 140,
    "features": 2
  },
  "language": "en",
  "keywords": [
    {
      "text": "ORIGINS PAY",
      "sentiment": {
        "score": 0.436905,
        "label": "positive"
      },
      "relevance": 0.874857,
      "count": 1
    },
    {
      "text": "RT",
      "sentiment": {
        "score": 0.436905,
        "label": "positive"
      },
      "relevance": 0.644407,
      "count": 1
    }
  ],
  "entities": [
    {
      "type": "Company",
      "text": "Nike",
      "sentiment": {
        "score": 0.0,
        "label": "neutral"
      },
      "relevance": 0.922792,
      "disambiguation": {
        "subtype": [],
        "name": "Nike, Inc.",
        "dbpedia_resource": "http://dbpedia.org/resource/Nike,_Inc."
      },
      "count": 1
    },
    {
      "type": "TwitterHandle",
      "text": "@IcySoleOnline",
      "sentiment": {
        "score": 0.0,
        "label": "neutral"
      },
      "relevance": 0.922792,
      "count": 1
    }
  ]
}
{
  "usage": {
    "text_units": 1,
    "text_characters": 137,
    "features": 2
  },
  "language": "en",
  "keywords": [
    {
      "text": "RT",
      "sentiment": {
        "score": 0.946834,
        "label": "positive"
      },
      "relevance": 0.911909,
      "count": 2
    },
    {
      "text": "SPOTS",
      "sentiment": {
        "score": 0.946834,
        "label": "positive"
      },
      "relevance": 0.533273,
      "count": 1
    }
  ],
  "entities": [
    {
      "type": "TwitterHandle",
      "text": "@dropssupreme",
      "sentiment": {
        "score": 0.0,
        "label": "neutral"
      },
      "relevance": 0.01,
      "count": 1
    }
  ]
}
{
  "usage": {
    "text_units": 1,
    "text_characters": 140,
    "features": 2
  },
  "language": "en",
  "keywords": [
    {
      "text": "Golden Touch' boots",
      "sentiment": {
        "score": 0,
        "label": "neutral"
      },
      "relevance": 0.885418,
      "count": 1
    },
    {
      "text": "RT",
      "sentiment": {
        "score": 0,
        "label": "neutral"
      },
      "relevance": 0.765005,
      "count": 1
    }
  ],
  "entities": [
    {
      "type": "Company",
      "text": "Nike",
      "sentiment": {
        "score": 0.0,
        "label": "neutral"
      },
      "relevance": 0.33,
      "disambiguation": {
        "subtype": [],
        "name": "Nike, Inc.",
        "dbpedia_resource": "http://dbpedia.org/resource/Nike,_Inc."
      },
      "count": 1
    },
    {
      "type": "Person",
      "text": "Luka Modri\u0107",
      "sentiment": {
        "score": 0.0,
        "label": "neutral"
      },
      "relevance": 0.33,
      "disambiguation": {
        "subtype": [
          "Athlete",
          "FootballPlayer"
        ],
        "name": "Luka Modri\u0107",
        "dbpedia_resource": "http://dbpedia.org/resource/Luka_Modri\u0107"
      },
      "count": 1
    }
  ]
}

如何将其转换为具有标题的数据框:文本、分数和标签(来自 json 转储)?

提前谢谢您!!

最佳答案

您的 json 文本将不容易解析。一种选择是收集列表中的响应并使用它来创建写入 json 并创建数据帧。

import json
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 \
import Features, EntitiesOptions, KeywordsOptions

naturalLanguageUnderstanding = NaturalLanguageUnderstandingV1(
   version='2018-09-21',
   iam_apikey='[KEY HIDDEN]',
   url='https://gateway.watsonplatform.net/natural-language-understanding/api')

responses = []
for tweet in nikedf["text"]:
    response = naturalLanguageUnderstanding.analyze(
      text=tweet,
      features=Features(
        entities=EntitiesOptions(
          emotion=False,
          sentiment=True,
          limit=2),
        keywords=KeywordsOptions(
          emotion=False,
          sentiment=True,
          limit=2))).get_result()
    responses.append(response)

使用响应列表创建 rdd 并解析每一行以创建所需的列:

from pyspark.sql import Row

#Row: text, score, and label 
def convert_to_row(response):
    rows = []
    for keyword in response['keywords']:
        row_dict = {}
        row_dict['text'] = keyword['text']
        row_dict['score'] = keyword['sentiment']['score']
        row_dict['label'] = keyword['sentiment']['label']
        row = Row(**row_dict)
        rows.append(row)
    return rows

sc.parallelize(responses) \
.flatMap(convert_to_row) \
.toDF().show()

关于python - 将 json.dumps 转换为 Python 数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53687099/

相关文章:

javascript - js-xlsx 如何重新排列列?

apache-spark - Scala 和 Python API 中的 LSH

python - Sklearn 高斯混合锁定参数?

python - 如何判断函数和类实例方法是否相同?

python - Scala:带参数的python脚本的系统命令

java - Java-API 的实际 JSON 序列化基准

javascript - OData 格式 = json 的剑道网格数据源

python - PySpark - Hive 上下文不返回结果但 SQL 上下文返回类似查询

pyspark - 齐柏林飞艇无法导入 Pandas 、numpy、scipy

Python混合全局变量和局部变量?