CLASS RECORD OF THE STUDENT FROM THE PREVIOUS BATCH WHO TOPPED
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
ADDITIONAL RECORDS OF THE STUDENTS FROM THE PREVIOUS BATCH NEXT IN LIST
Name (Roll no) # Location Section Rank (MARKS) Gender
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
Other records are not available currently.Some records may be present which can be given on request.
使用 pdftotext 从 PDF 获取文本文件。使用下面的AWK命令我得到了上面的数据。
表格数据空间分隔不均匀。
删除整行为大写的行
删除表格内容后的所有最后一行。
pdftotext -layout INPUTFILE.pdf INPUTFILE.txt
awk '/RESULTS/{flag=1;next}/OTHER DATA/{flag=0}flag' INPUTFILE.txt | column -ts $'\t' -n
<小时/>
如何获取制表符分隔格式(下方格式)的表格数据?
以通用方式编码,因此它也适用于其他类型的表。
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths (100)
最佳答案
我在这里介绍的方法是一种 awk
方法。其中我将做出以下假设:
- 标题行
姓名(卷号)...性别
可以出现多次 - 标题行下的列表具有固定的字段宽度,但字段宽度未知。我从其中包含
California
的行中推测出这一点,因为该单词后面只有一个空格。 - 每个标题行之后的字段宽度都可以更改。
在awk
中,我们可以使用内部变量FIELDWIDTHS
设置固定字段宽度:
FIELDWIDTHS #
A space-separated list of columns that tells gawk how to split input with fixed columnar boundaries. Starting in version 4.2, each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. Assigning a value toFIELDWIDTHS
overrides the use ofFS
andFPAT
for field splitting. See Constant Size for more information.note: this is a
gawk
extension
为了确定 FIELDWIDTHS
变量,我们将使用 match
和 RSTART
:
RSTART
The start index in characters of the substring that is matched by thematch()
function (see String Functions).RSTART
is set by invoking thematch()
function. Its value is the position of the string where the matched substring starts, or zero if no match was found.
因此,这已经为我们提供了以下内容(注意 OFS
设置为 |
以演示正确的工作行为)
awk 'BEGIN{OFS="|"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
if (v==0) {print $1,$2,$3,$4,$5}
v++; next
}
{print $1,$2,$3,$4,$5}'
这已经输出了
Name (Roll no) # |Location |Section |Rank (MARKS) |Gender
Anna (+) |USA |A1 |First (100) |Female
(04) |California |V||
Bob (-) |USA |A2 |First (99) |Male
(07) |Florida |VI||
Eva (+) |USA |A4 |Second (96) |Female
(12) |Ohio |V |English (99)|
| | |Maths(100)|
评论:此时看起来已经“正常”,但请考虑到每个标题行之后的列不需要具有相同的宽度(假设 3)。
您想要一个制表符分隔的列系统,但制表符是邪恶的。一切都取决于您的系统如何解释选项卡的宽度。是 4
、8
还是 17
。我在这里提出一个空格分隔的系统。最好的方法是删除每个字段末尾的所有空格,然后使用命令column
。这导致:
awk 'BEGIN{OFS="|"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
if (v==0) {print $1,$2,$3,$4,$5}
v++; next
}
{
for(i=1;i<=NF;i++) sub(/ *$/,"",$i);
print $1,$2,$3,$4,$5
}' <file> | column -t -s '|'
输出:
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
请注意,column
将根据需要调整列,因此它们不必每次都具有相同的宽度。如果您知道列宽,我建议在 awk
中使用 printf
语句,如下所示:
awk 'BEGIN{format="%-18s%-12s%-9s%-14s%-6s\n"}
/^[- A-Z]*$/{next} # skips only caps lines
/Other records might/{next} # skips the last line
/^Name.*$/{ # find header line
match($0,"Location");i2=RSTART;
match($0,"Section"); i3=RSTART;
match($0,"Rank"); i4=RSTART;
match($0,"Gender"); i5=RSTART;
FIELDWIDTHS= i2-1" "i3-i2" "i4-i3" "i5-i4" 6"
$0=$0 # reprocess header line
# print header line only the first time
if (v==0) {printf format,$1,$2,$3,$4,$5}
v++; next
}
{ printf format,$1,$2,$3,$4,$5 }' <file>
作为输出:
Name (Roll no) # Location Section Rank (MARKS) Gender
Anna (+) USA A1 First (100) Female
(04) California V
Bob (-) USA A2 First (99) Male
(07) Florida VI
Eva (+) USA A4 Second (96) Female
(12) Ohio V English (99)
Maths(100)
关于python - 从不均匀间隔的文本文件中提取表数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49271464/