python - 从不均匀间隔的文本文件中提取表数据

标签 python ubuntu awk sed text-processing

         CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender   
Anna (+)            USA        A1          First (100)      Female
(04)                California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) #    Location   Section     Rank (MARKS)     Gender
Bob (-)             USA        A2          First (99)       Male
(07)                Florida    VI
Eva (+)             USA        A4          Second (96)      Female
(12)                Ohio       V           English (99)
                                           Maths(100)
Other records are not available currently.Some records may be present which can be given on request.

使用 pdftotext 从 PDF 获取文本文件。使用下面的AWK命令我得到了上面的数据。
表格数据空间分隔不均匀
删除整行大写的行
删除表格内容后的所有最后一行

pdftotext -layout INPUTFILE.pdf INPUTFILE.txt
awk '/RESULTS/{flag=1;next}/OTHER DATA/{flag=0}flag' INPUTFILE.txt | column -ts $'\t' -n

<小时/> 如何获取制表符分隔格式(下方格式)的表格数据?
以通用方式编码,因此它也适用于其他类型的表。

Name (Roll no) #    Location    Section     Rank (MARKS)    Gender  
Anna (+)            USA         A1          First (100)     Female
(04)                California  V
Bob (-)             USA         A2          First (99)      Male
(07)                Florida     VI
Eva (+)             USA         A4          Second (96)     Female
(12)                Ohio        V           English (99)
                                            Maths (100)

最佳答案

我在这里介绍的方法是一种 awk 方法。其中我将做出以下假设:

  • 标题行姓名(卷号)...性别可以出现多次
  • 标题行下的列表具有固定的字段宽度,但字段宽度未知。我从其中包含 California 的行中推测出这一点,因为该单词后面只有一个空格。
  • 每个标题行之后的字段宽度都可以更改。

awk中,我们可以使用内部变量FIELDWIDTHS设置固定字段宽度:

FIELDWIDTHS # A space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Starting in version 4.2, each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. Assigning a value to FIELDWIDTHS overrides the use of FS and FPAT for field splitting. See Constant Size for more information.

note: this is a gawk extension

为了确定 FIELDWIDTHS 变量,我们将使用 matchRSTART:

RSTART The start index in characters of the substring that is matched by the match() function (see String Functions). RSTART is set by invoking the match() function. Its value is the position of the string where the matched substring starts, or zero if no match was found.

因此,这已经为我们提供了以下内容(注意 OFS 设置为 | 以演示正确的工作行为)

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       if (v==0) {print $1,$2,$3,$4,$5}
       v++; next      
     }
     {print $1,$2,$3,$4,$5}'

这已经输出了

Name (Roll no) #    |Location   |Section     |Rank (MARKS)     |Gender
Anna (+)            |USA        |A1          |First (100)      |Female
(04)                |California |V||
Bob (-)             |USA        |A2          |First (99)       |Male
(07)                |Florida    |VI||
Eva (+)             |USA        |A4          |Second (96)      |Female
(12)                |Ohio       |V           |English (99)|
                    |           |            |Maths(100)|

评论:此时看起来已经“正常”,但请考虑到每个标题行之后的列不需要具有相同的宽度(假设 3)。

您想要一个制表符分隔的列系统,但制表符是邪恶的。一切都取决于您的系统如何解释选项卡的宽度。是 48 还是 17。我在这里提出一个空格分隔的系统。最好的方法是删除每个字段末尾的所有空格,然后使用命令column。这导致:

awk 'BEGIN{OFS="|"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       if (v==0) {print $1,$2,$3,$4,$5}
       v++; next      
     }
     {
       for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
       print $1,$2,$3,$4,$5
     }' <file> | column -t -s '|'

输出:

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender  
Anna (+)          USA         A1       First (100)   Female  
(04)              California  V                              
Bob (-)           USA         A2       First (99)    Male    
(07)              Florida     VI                             
Eva (+)           USA         A4       Second (96)   Female  
(12)              Ohio        V        English (99)          
                                       Maths(100)          

请注意,column 将根据需要调整列,因此它们不必每次都具有相同的宽度。如果您知道列宽,我建议在 awk 中使用 printf 语句,如下所示:

awk 'BEGIN{format="%-18s%-12s%-9s%-14s%-6s\n"}
     /^[- A-Z]*$/{next}          # skips only caps lines
     /Other records might/{next} # skips the last line
     /^Name.*$/{                 # find header line
       match($0,"Location");i2=RSTART;
       match($0,"Section"); i3=RSTART;
       match($0,"Rank");    i4=RSTART;
       match($0,"Gender");  i5=RSTART;
       FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
       $0=$0                     # reprocess header line
       # print header line only the first time
       if (v==0) {printf format,$1,$2,$3,$4,$5}
       v++; next      
     }
     { printf format,$1,$2,$3,$4,$5 }' <file>

作为输出:

Name (Roll no) #  Location    Section  Rank (MARKS)  Gender
Anna (+)          USA         A1       First (100)   Female
(04)              California  V                            
Bob (-)           USA         A2       First (99)    Male  
(07)              Florida     VI                           
Eva (+)           USA         A4       Second (96)   Female
(12)              Ohio        V        English (99)        
                                       Maths(100)          

关于python - 从不均匀间隔的文本文件中提取表数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49271464/

相关文章:

python - json.dumps 'skip over' 具体键?

css - rails : Production assets loading from user's localhost instead of Ubuntu server

java - 未找到 JDK。请验证 STUDIO_JDK、JDK_HOME 或 JAVA_HOME 环境变量是否指向有效的 JDK 安装

awk - 如何使用 awk 重命名重复行?

awk - 如何删除包含某个字符串的行之后的所有行?

bash - 如何计算表中唯一值的唯一值

python - 尽可能快地找到图中任何可行的流

virtualenv : broken `site.py` 下的 Python 2.7

Python套接字子进程连接一段时间后没有回复

linux - 强制引导进入 Linux 命令行