python - 如何使用python将单列文本文件解析为表格?

标签 python web-scraping

我是 StackOverflow 的新手,但我在这个网站上找到了很多答案。我也是一名编程新手,所以我想我会加入并最终成为这个社区的一员——从一个困扰我几个小时的问题开始。

我登录到一个网站,并在 b 标签内抓取了大量文本,以将其转换为合适的表格。生成的 Output.txt 的布局如下所示:

BIN                   STATUS                                                   
8FHA9D8H 82HG9F     RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS          


INVENTORY CODE:   FPBC   *SOUP CANS LENTILS                                 

BIN                   STATUS                                                   
HA8DHW2H HD0138     RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS          
8SHDNADU 00A123     #2956- INVALID STOCK COUPON CODE (MISSING).          
93827548 096DBR     RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS          

有一堆页面具有完全相同的 block ,但我需要将它们组合成一个如下所示的实际表格:

      BIN               INV CODE                          STATUS                                                   
HA8DHW2HHD0138     FPBC-*SOUP CANS LENTILS    RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS          
8SHDNADU00A123     FPBC-*SOUP CANS LENTILS    #2956- INVALID STOCK COUPON CODE (MISSING).          
93827548096DBR     FPBC-*SOUP CANS LENTILS    RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS          
8FHA9D8H82HG9F   SSXR-98-20LM NM CORN CREAM  RECEIVED SUCCESSFULLY AWAITING STOCKING PROCESS  

本质上,此示例中的所有单独文本 block 都将成为此表的一部分,inv 代码与其 Bin 值一起重复。我会发布我解析这些数据的尝试(已经尝试过 Pandas/bs/openpyxl/csv writer),但我承认他们有点尴尬,因为我找不到关于这个特定问题的任何信息。有没有仁慈的灵魂可以帮助我? :)

(另外,我使用的是 Python 2.7)

最佳答案

像下面这样的简单自定义解析器应该可以解决问题。

from __future__ import print_function



def parse_body(s):
    line_sep = '\n'
    getting_bins = False
    inv_code = ''
    for l in s.split(line_sep):
        if l.startswith('INVENTORY CODE:') and not getting_bins:
            inv_data = l.split()
            inv_code = inv_data[2] + '-' + ' '.join(inv_data[3:])
        elif l.startswith('INVENTORY CODE:') and getting_bins:
            print("unexpected inventory code while reading bins:", l)
        elif l.startswith('BIN') and l.endswith('MESSAGE'):
            getting_bins = True
        elif getting_bins == True and l:
            bin_data = l.split()
            # need to add exception handling here to make sure:
            # 1) we have an inv_code
            # 2) bin_data is at least 3 items big (assuming two for
            #    bin_id and at least one for message)
            # 3) maybe some constraint checking to ensure that we have
            #    a valid instance of an inventory code and bin id
            bin_id = ''.join(bin_data[0:2])
            message = ' '.join(bin_data[2:])
            # we now have a bin, an inv_code, and a message to add to our table
            print(bin_id.ljust(20), inv_code.ljust(30), message, sep='\t')
        elif getting_bins == True and not l:
            # done getting bins for current inventory code
            getting_bins = False
            inv_code = ''

关于python - 如何使用python将单列文本文件解析为表格?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38886546/

相关文章:

python - CMake libifport.so.5 : cannot open shared object file: No such file or directory

python - 使用 BeautifulSoup 找不到 'div' 的内容

python - 将 elem.send_keys 用于页面中的句柄 "Infinite Scroll"。在 Python 中使用 Selenium PhantomJS

PHP 循环 INSERT MySQL 获取每个结果

python - 有条件地分配 DataFrame 中另一列的值

python - 摆脱 'Can' t 绘制到关闭窗口的错误

javascript - YouTube API - 抓取视频上传日期

python - 使用 Pyquery、Requests 和 Gadget 选择器提取 Web 元素

python - API设计Python

python - 获取列表列表中每个元素的索引并制作字典