python - 在 JSON/Python 中删除 twitter 扩展推文列的不必要的详细信息

我使用 Twitter 抓取工具下载了有关上次举行的体育赛事的一些推文。不幸的是，由于研究的性质，我无法返回并修改我的抓取工具，因为该事件不会再次发生。推文分为几个类别，例如时间戳、创建日期等。

这些推文存储在 JSON 文件中，我目前正在将它们导出到 pandas

我关注的是每条推文详细信息中的文本和extend_tweet 类别。

Twitter 不久前允许用户发布更长的推文。当涉及到抓取 Twitter 数据时，如果推文低于初始(140？我相信)字符限制，那么整条推文的文本会毫无问题地显示在文本类别中，这正是我 future 研究所需要的.

但是，任何超过字符限制的推文在“文本”类别中都会显示如下:

@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A… <url> StackOverflow 不允许我显示后面的短 URL，但本质上，正如我刚才所说，它是完整帖子的短 twitter URL

如您所见，文本以“...”分隔，后跟一个链接。要查看完整文本，我需要查看“extended_tweet”类别，然后该类别将信息如下所示:

{'full_text': '@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A sort function is a function you send to sort. Learning a new acronym to abstract that adds unnecessary complexity.', 'display_text_range': [18, 229], 'entities': {'hashtags': [], 'urls': [], 'user_mentions': [{'screen_name': 'thedamon', 'name': 'Damon Muma', 'id': 29938474, 'id_str': '29938474', 'indices': [0, 9]}, {'screen_name': 'getify', 'name': 'getify', 'id': 16686076, 'id_str': '16686076', 'indices': [10, 17]}], 'symbols': []}}

正如您所看到的，这比文本更详细。

我目前正在使用 Python 并尝试着理解正则表达式。我可以轻松地将字符串从索引 [i] 切片到索引 [j]，但由于所有推文的长度不同，我需要确保从推文开始的点开始切片， 'full_text': && 'display_text_range'

我并不是要求有人帮我做作业，但我已经被这个问题困扰了一段时间，我最初认为很容易的事情结果证明比我预想的要困难得多。

有没有人给我提供任何可以帮助我自己解决问题的指示或建议？

谢谢

最佳答案

为什么不解析 JSON 来获取 full_text 属性？

import json

data = '''
{"full_text": "@thedamon @getify I worry adding new terms add complexity and may make it harder for people to learn JavaScript. A sort function is a function you send to sort. Learning a new acronym to abstract that adds unnecessary complexity.", "display_text_range": [18, 229], "entities": {"hashtags": [], "urls": [], "user_mentions": [{"screen_name": "thedamon", "name": "Damon Muma", "id": 29938474, "id_str": "29938474", "indices": [0, 9]}, {"screen_name": "getify", "name": "getify", "id": 16686076, "id_str": "16686076", "indices": [10, 17]}], "symbols": []}}'''

parsed_data = json.loads(data)
print(parsed_data['full_text']) # prints full tweet '@thedamon @getify I worry .... unnecessary complexity.'

关于python - 在 JSON/Python 中删除 twitter 扩展推文列的不必要的详细信息，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/60401164/

python - 在 JSON/Python 中删除 twitter 扩展推文列的不必要的详细信息

上一篇：python - 我的所有机器学习模型都获得了 100% 的准确率。我的模型有什么问题

下一篇：python - pandas 中的数据清理 : replacing null values with specific strings if these strings are contained in another column