python - 如何从 pandas read_html 读取并平坦/规范化一系列表?

标签 python python-3.x pandas

我正在阅读有关 read_html 的内容pandas 功能,因为我正在从网络中提取一些表格,所以当我这样做时:

import pandas as pd
url_mcc = 'link.com.html'
dfs = pd.read_html(url_mcc)
dfs

我得到以下列表:

[                                        Presentation  \
 0  0.4 mg/mL, 1 mL single-dose vial, package of 2...   
 1  1 mg/mL, 1 mL single-dose vial, package of 25 ...   

   Availability and Estimated Shortage Duration  \
 0             Available for NDC 00517-0401-25.   
 1                                    Available   

                                  Related Information  \
 0  American Regent is currently releasing the 0.4...   
 1  American Regent is currently releasing the 1mg...   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  
 1                         Other  ,
                                         Presentation  \
 0  0.1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N...   

   Availability and Estimated Shortage Duration  Related Information  \
 0                            Product available                  NaN   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ,
                                         Presentation  \
 0  0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)   
 1  0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)   
 2  0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...   
 3  0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...   

         Availability and Estimated Shortage Duration  \
 0  Next delivery: Late October. Estimated recover...   
 1         Next delivery: TBD Estimated recovery: TBD   
 2                                          Available   
 3                                          Available   

                                  Related Information  \
 0  Please check with your wholesaler for availabl...   
 1  Please check with your wholesaler for availabl...   
 2               Shortage per Manufacturer: Available   
 3               Shortage per Manufacturer: Available   

   Shortage Reason (per FDASIA)  
 0                        Other  
 1                        Other  
 2                        Other  
 3                        Other  ,
                                Presentation  \
 0  0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)   

   Availability and Estimated Shortage Duration  \
 0           West-Ward has available inventory.   

                                  Related Information  \
 0  Additional lots are scheduled to be manufactur...   

    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ]

正如您所看到的,列表(或表格?)有重复的列:演示可用性和预计短缺持续时间相关信息短缺原因(根据 FDASIA),因为该网站有 3 个具有相同列的不同表格。因此,我的问题是如何将所有不同的表格或列表扁平化或规范化为一个表格或列表,或多或少像这样:

[                                        Presentation  \
 0  0.4 mg/mL, 1 mL single-dose vial, package of 2...   
 1  1 mg/mL, 1 mL single-dose vial, package of 25 ...   
 2  1 mg/mL; 10 mL Luer-Jet Prefilled Syringe (N... 
 3  0.1 mg/mL; 10 mL Ansyr syringe (NDC 0409-1630-10)   
 4  0.05 mg/mL; 5 mL Ansyr syringe (NDC 0409-9630-05)   
 5  0.1 mg/mL; 5 mL Lifeshield syringe (NDC 0409-4...   
 6  0.1 mg/mL; 10 mL Lifeshield syringe (NDC 0409-...   



   Availability and Estimated Shortage Duration  \
 0             Available for NDC 00517-0401-25.   
 1                                    Available  
 2                            Product available                  NaN   
 0  Next delivery: Late October. Estimated recover...   
 1         Next delivery: TBD Estimated recovery: TBD   
 2                                          Available   
 3                                          Available  
 0  0.4 mg/mL, 20 mL vial (NDC 0641-6006-10)   

   Availability and Estimated Shortage Duration  \
 0           West-Ward has available inventory.   


    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  


                                  Related Information  \
 0  American Regent is currently releasing the 0.4...   
 1  American Regent is currently releasing the 1mg...   
 0  Please check with your wholesaler for availabl...   
 1  Please check with your wholesaler for availabl...   
 2               Shortage per Manufacturer: Available   
 3               Shortage per Manufacturer: Available   
 0  Additional lots are scheduled to be manufactur...   


    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  
 1                         Other  ,



    Shortage Reason (per FDASIA)  
 0  Demand increase for the drug  ,
 0                        Other  
 1                        Other  
 2                        Other  
 3                        Other  ,

最佳答案

我认为你需要concat如果 dfs 是 DataFrames 列表:

df = pd.concat(dfs)

您还可以使用参数ignore_index=True来避免索引中出现重复:

df = pd.concat(dfs, ignore_index=True) 

示例:

df1 = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)

df2 = pd.DataFrame({'A':[3,4,6],
                   'B':[2,3,4],
                   'C':[3,6,0]})

#print (df2)

df3 = pd.DataFrame({'A':[4,7,9],
                   'B':[3,4,5],
                   'C':[5,1,9]})

#print (df3)

dfs = [df1,df2,df3]
print (dfs)
[   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9,    A  B  C
0  3  2  3
1  4  3  6
2  6  4  0,    A  B  C
0  4  3  5
1  7  4  1
2  9  5  9]
df = pd.concat(dfs)
print (df)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
0  3  2  3
1  4  3  6
2  6  4  0
0  4  3  5
1  7  4  1
2  9  5  9

df1 = pd.concat(dfs, ignore_index=True) 
print (df1)
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9
3  3  2  3
4  4  3  6
5  6  4  0
6  4  3  5
7  7  4  1
8  9  5  9

关于python - 如何从 pandas read_html 读取并平坦/规范化一系列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40366105/

相关文章:

python - 后续rolling_apply已弃用

python - 为什么在数据帧上具有中位数的 fillna 仍然在 pandas 中留下 Na/NaN?

python - 绘制几个极坐标图的散点图

python - 如何在 python-fedex 中为国际货件添加海关值(value)?

python - 在python中计算文本文件中的字母

python-3.x - 如何在 Python 3.6 中使用 LXML find() 用变量替换谓词值

python - 无法使用matplotlib显示数据

python - 如何替换 spark 数据框所有列中的多个字符?

python - 如何在Django中实现followers/following

python - 从边界内行进的距离获取位置 X