您好,我正在尝试读取多个文件,创建我需要的特定关键信息的数据帧,然后将每个文件的每个数据帧附加到名为主题的主数据帧。我已经尝试过以下代码。
import pandas as pd
import numpy as np
from lxml import etree
import os
topics = pd.DataFrame()
for filename in os.listdir('./topics'):
if not filename.startswith('.'):
#print(filename)
tree = etree.parse('./topics/'+filename)
root = tree.getroot()
childA = []
elementT = []
ElementA = []
for child in root:
elementT.append(str(child.tag))
ElementA.append(str(child.attrib))
childA.append(str(child.attrib))
for element in child:
elementT.append(str(element.tag))
#childA.append(child.attrib)
ElementA.append(str(element.attrib))
childA.append(str(child.attrib))
for sub in element:
#print('***', child.attrib , ':' , element.tag, ':' , element.attrib, '***')
#childA.append(child.attrib)
elementT.append(str(sub.tag))
ElementA.append(str(sub.attrib))
childA.append(str(child.attrib))
df = pd.DataFrame()
df['c'] = np.array (childA)
df['t'] = np.array(ElementA)
df['a'] = np.array(elementT)
file = df['t'].str.extract(r'([A-Z][A-Z].*[words.xml])#')
start = df['t'].str.extract(r'words([0-9]+)')
stop = df['t'].str.extract(r'.*words([0-9]+)')
tags = df['a'].str.extract(r'.*([topic]|[pointer]|[child])')
rootTopic = df['c'].str.extract(r'rdhillon.(\d+)')
df['f'] = file
df['start'] = start
df['stop'] = stop
df['tags'] = tags
# c= topic
# r = pointerr
# d= child
df['topicID'] = rootTopic
df = df.iloc[:,3:]
topics.append(df)
但是,当我调用主题时,我得到以下输出
topics
Out[19]:_
有人可以让我知道我哪里出了问题吗?如果有任何关于改进我困惑的代码的建议,我们将不胜感激
最佳答案
与列表不同,当您附加到 DataFrame
时,您将返回一个新对象。因此,topics.append(df)
返回一个您永远不会存储在任何地方的对象,并且 topics
仍然是您在第 6 行声明的空 DataFrame
。您可以通过
topics = topics.append(df)
但是,在循环内附加到 DataFrame
是一项非常昂贵的操作。相反,您应该将每个 DataFrame
附加到循环内的列表,并在循环后对 DataFrame
列表调用 pd.concat()
。
import pandas as pd
topics_list = []
for filename in os.listdir('./topics'):
# All of your code
topics_list.append(df) # Lists are modified with append
# After the loop one call to concat
topics = pd.concat(topics_list)
关于python - 附加从文件中读取的多个 pandas DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50653097/