python - 从excel中提取多个表

标签 python arrays excel dataframe parsing

对于我的问题,我已经看到了许多相关的帖子,但找不到合适的解决方案。
read excel sheet containing multiple tables, tables that have headers with a non white background cell color
这是该excel数据的链接:https://docs.google.com/spreadsheets/d/1m4v_wbIJCGWBigJx53BRnBpMDHZ9CwKf/edit?usp=sharing&ouid=107579116880049687042&rtpof=true&sd=true
到目前为止,我已经尝试过:

import pandas as pd
df = pd.read_excel("dell.xlsx")
df =df.dropna()
上面的代码删除了想要的数据,因为它有 nan。
df.iloc[1,2:5]=['Description','Qty','Price']
    print(df)

nul_rows = list(df[df.isnull().all(axis=1)].index)

list_of_dataframes = []
for i in range(len(nul_rows) - 1):
    list_of_dataframes.append(df.iloc[nul_rows[i]+1:nul_rows[i+1],:])

cleaned_tables = []
for _df in list_of_dataframes:
    cleaned_tables.append(_df.dropna(axis=1, how='all'))

for p in cleaned_tables:
    print(p.dropna())
无法获得我想要的数据,因为它们不是标题格式,它是未命名的。
我想从链接中的excel中提取这些数据“Sku Description”“Qty”“Price”“Total”。
希望我能得到一些回应。
笔记!文件内容和格式总是不同的,因此基于一个文件的解决方案不能用于下一个文件,但标题的名称不会改变,例如数量、描述、总数。

最佳答案

这是从该 excel 文件创建数据框的一种方法。请参阅代码本身中的注释。
您可以取消注释某些 print() 以查看它是如何开发的。
代码

import pandas as pd
import numpy as np


def get_dataframes(fn):
    """
    Returns 3 dataframes from the 3 tables given inside excel file fn.
    """

    # (1) Read file
    df_start = pd.read_excel(fn)
    df = df_start.copy()  # preserve df_start, just in case we need it later
    # print(df.to_string())  # study layout

    # (2) Clean df
    df = pd.DataFrame(np.vstack([df.columns, df]))  # reset column index, so we can see 0, 1, 2 ...
    df = df.drop([0], axis=1)  # drop index column 0, it is not needed
    df = df.dropna(how='all')  # drop rows if all cells are empty
    df = df.reset_index(drop=True)  # reset row index so we can see the correct row no. 0, 1, ...
    df = pd.DataFrame(np.vstack([df.columns, df]))  # reset col index so we can see the correct col no. 0, 1, ...

    # (3) Tasks
    # We have 3 dataframes to create:
    # 1) df1 is for the first table the one with "Description".
    # 2) df2 is the one with the first "Sku Description:".
    # 3) df3 is the one with the second "Sku Description:".

    # (4) Get all the indexes (row and col) of "Description" and "Sku Description:" texts,
    # we will use it to extract data in order to create dataframes df1, df2 and df3.

    # Get the index of "Description" we will get row index and col index from it, by knowing
    # the row index, we will be able to extract at correct row index on this particular table.
    i_desc = df.where(df=='Description').dropna(how='all').dropna(how='all', axis=1)

    # Assume that there is only one "Description" and this is true according to the given sample excel files.
    i_desc_row = i_desc.index[0]
    # print(f'row index of "Description": {i_desc_row}')

    # Get the index of "Sku Description:".
    i_sku = df.where(df=='Sku Description:').dropna(how='all').dropna(how='all', axis=1)

    # There are 2 "Sku Description:", get the row indexes of each.
    i_sku_row = i_sku.index
    # print(f'row indexes of "Sku Description:": {i_sku_row}')

    i_sku_row_1 = i_sku_row[0]
    i_sku_row_2 = i_sku_row[1]
    # print(f'i_sku_row_1: {i_sku_row_1}, i_sku_row_2: {i_sku_row_2}')

    # There are 2 "Sku Description:", get the col indexes of each.
    i_sku_col = i_sku.columns
    # print(f'col indexes of "Sku Description:": {i_sku_col}')
    
    if len(i_sku_col) == 2:
        i_sku_col_1 = i_sku_col[0]
        i_sku_col_2 = i_sku_col[1]
    else:
        i_sku_col_1 = i_sku_col[0]
        i_sku_col_2 = i_sku_col_1
        
    # print(f'i_sku_col_1: {i_sku_col_1}, i_sku_col_2: {i_sku_col_2}')

    # (5) Create df1
    cols = ['Description', 'Qty', 'Price', 'Total']
    df1 = df.iloc[i_desc_row+1 : i_sku_row_1-2, 0:4]  # [start_row:end_row, start_col:end_col]
    df1.columns = cols
    # print(df1)
    df1 = df1.reset_index(drop=True)  # reset the row index so we can see 0, 1, ...
    # print(df1)

    # (6) Create df2
    cols = ['Sku Description:', 'Qty:']
    df2 = df.iloc[i_sku_row_1+1 : i_sku_row_2, i_sku_col_1:i_sku_col_1+2]
    df2.columns = cols
    df2 = df2.reset_index(drop=True)
    # print(df2)

    # (7) Create df3
    cols = ['Sku Description:', 'Qty:']
    df3 = df.iloc[i_sku_row_2+1:, i_sku_col_2:i_sku_col_2+2]
    df3.columns = cols
    df3 = df3.reset_index(drop=True)
    # print(df3)
    
    return df1, df2, df3


def process_file():
    # fn = 'F:\\Downloads\\dell.xlsx'
    fn = 'F:\\Downloads\\dell2.xlsx'
    desc_df, sku1_df, sku2_df = get_dataframes(fn)
    
    print(f'file: {fn}')    
    print(f'desc_df:\n{desc_df}\n')
    print(f'sku1_df:\n{sku1_df}\n')
    print(f'sku2_df:\n{sku2_df}\n')
    
    
# Start
process_file()
输出
file: F:\Downloads\dell2.xlsx
desc_df:
             Description  Qty   Price  Total
0  PowerEdge R640 Server  2.0  6390.0  12780
1  PowerEdge R640 Server  8.0  4360.0  34880

sku1_df:
                                     Sku Description: Qty:
0                               PowerEdge R640 Server  1.0
1                      PowerEdge R640 MLK Motherboard  1.0
2   Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,...  2.0
3   Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,...  2.0
4                        iDRAC Group Manager, Enabled  1.0
5                    iDRAC,Factory Generated Password  1.0
6                       Additional Processor Selected  1.0
7   2.5 Chassis with up to 10 Hard Drives and 3PCI...  1.0
8                                      Standard Bezel  1.0
9                             Riser Config 2, 3x16 LP  1.0
10  PowerEdge R640 Shipping Material for 4 and 10 ...  1.0
11  PowerEdge R640 Shipping(ICC), for 1300W below, V2  1.0
12                                      No Quick Sync  1.0
13                       Dell EMC Luggage Tag for x10  1.0
14                              Performance Optimized  1.0
15                                    3200MT/s RDIMMs  1.0
16           DIMM Blanks for System with 2 Processors  1.0
17       32GB RDIMM, 3200MT/s, Dual Rank 16Gb BASE x8  4.0
18                                  iDRAC9,Enterprise  1.0
19  2.4TB 10K RPM SAS 12Gbps 512e 2.5in Hot-plug H...  8.0
20  3.84TB SSD SATA Read Intensive 6Gbps 512 2.5in...  2.0
21  BOSS controller card + with 2 M.2 Sticks 480GB...  1.0
22                     PERC H750 Adapter, Low Profile  1.0
23                           8 Standard Fans for R640  1.0
24                          Performance BIOS Settings  1.0
25                               Standard 1U Heatsink  2.0
26                          No Internal Optical Drive  1.0
27  Dual, Hot-plug, Redundant Power Supply (1+1), ...  1.0
28               Jumper Cord - C13/C14, 2M, 250V, 10A  2.0
29                 Power Cord - C13, 1.8M, 250V, 10A   2.0
30                     Trusted Platform Module 2.0 V3  1.0
31  PowerEdge R640 No CE Marking, ICC, for 1300W P...  1.0
32          Broadcom 5720 Quad Port 1GbE BASE-T, rNDC  1.0
33                                No Operating System  1.0
34    No Systems Documentation, No OpenManage DVD Kit  1.0
35        Basic Deployment Dell Server R Series 1U/2U  1.0
36  PowerEdge-SE02 Handling n Insurance Charges(In...  1.0
37  ReadyRails Sliding Rails With Cable Management...  1.0
38                                  Unconfigured RAID  1.0
39             UEFI BIOS Boot Mode with GPT Partition  1.0
40                     OpenManage Enterprise Advanced  1.0
41                  Basic Next Business Day 36 Months  1.0
42  ProSupport and Next Business Day Onsite Servic...  1.0
43  ProSupport and Next Business Day Onsite Servic...  1.0
44                  INFO: Thank you for choosing Dell  1.0
45                                    Mod Specs Info   1.0

sku2_df:
                                     Sku Description: Qty:
0                               PowerEdge R640 Server  1.0
1                      PowerEdge R640 MLK Motherboard  1.0
2   Intel Xeon Silver 4216 2.1G, 16C/32T, 9.6GT/s,...  1.0
3                    iDRAC,Factory Generated Password  1.0
4                        iDRAC Group Manager, Enabled  1.0
5   2.5 Chassis with up to 10 Hard Drives and 3PCI...  1.0
6                                      Standard Bezel  1.0
7                             Riser Config 4, 2x16 LP  1.0
8   PowerEdge R640 Shipping(ICC), for 1300W below, V2  1.0
9   PowerEdge R640 Shipping Material for 4 and 10 ...  1.0
10                                      No Quick Sync  1.0
11                       Dell EMC Luggage Tag for x10  1.0
12                              Performance Optimized  1.0
13       32GB RDIMM, 3200MT/s, Dual Rank 16Gb BASE x8  2.0
14                       Blank for 1CPU Configuration  1.0
15                                    3200MT/s RDIMMs  1.0
16                       Blank for 1CPU Configuration  1.0
17                            No Additional Processor  1.0
18                                  iDRAC9,Enterprise  1.0
19  2.4TB 10K RPM SAS 12Gbps 512e 2.5in Hot-plug H...  3.0
20  BOSS controller card + with 2 M.2 Sticks 480GB...  1.0
21                     PERC H750 Adapter, Low Profile  1.0
22                          Performance BIOS Settings  1.0
23                           5 Standard Fans for R640  1.0
24                               Standard 1U Heatsink  1.0
25  1.92TB SSD vSAS Read Intensive 12Gbps 512e 2.5...  2.0
26                          No Internal Optical Drive  1.0
27  Dual, Hot-plug, Redundant Power Supply (1+1), ...  1.0
28                 Power Cord - C13, 1.8M, 250V, 10A   2.0
29              Jumper Cord - C13/C14, 2M, 250V, 10A   2.0
30                     Trusted Platform Module 2.0 V3  1.0
31  PowerEdge R640 No CE Marking, ICC, for 1300W P...  1.0
32          Broadcom 5720 Quad Port 1GbE BASE-T, rNDC  1.0
33                                No Operating System  1.0
34    No Systems Documentation, No OpenManage DVD Kit  1.0
35        Basic Deployment Dell Server R Series 1U/2U  1.0
36        PowerEdge-SE02 Handling n Insurance Charges  1.0
37  ReadyRails Sliding Rails With Cable Management...  1.0
38                                  Unconfigured RAID  1.0
39             UEFI BIOS Boot Mode with GPT Partition  1.0
40                     OpenManage Enterprise Advanced  1.0
41                  Basic Next Business Day 36 Months  1.0
42  ProSupport and Next Business Day Onsite Servic...  1.0
43  ProSupport and Next Business Day Onsite Servic...  1.0
44                  INFO: Thank you for choosing Dell  1.0
45                                    Mod Specs Info   1.0

关于python - 从excel中提取多个表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70286615/

相关文章:

python - 在 mongoalchemy 0.21 中指定连接字符串的位置

python按相同键值计数对字典列表进行排序

ios - 如何使用 swift 检查字符串值是否在数组中

excel - VBA:返回超过 2 个过滤条件

python - spdiags() 函数在 Python 中未按预期工作

python - 使用 Pillow/numpy 在任何分割带上执行 .dot 后,无法合并分割图像

php - 在 PHP 中分解字符串

c++ - 什么是数组到指针衰减?

excel - 使用记录集修改 Excel 数据透视表并刷新 : Exception

php - 如何使用 PHP 修改现有的 excel 文件?