我必须解析一大堆日志文件,其格式如下。
SOME SQL STATEMENT/QUERY
DB20000I The SQL command completed successfully.
SOME OTHER SQL STATEMENT/QUERY
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command.
编辑1:前3行(包括空行)表示SQL语句执行成功,而接下来的三行显示该语句及其引起的异常。 darioo 下面的回复建议使用 grep
而不是 Java,对于单行 SQL 语句效果非常好。
编辑2:但是,SQL 语句/查询不一定是一行。有时它是一个很大的CREATE PROCEDURE...END PROCEDURE
block 。仅使用 Unix 命令也可以解决这个问题吗?
现在我需要解析整个日志文件并选择所有出现的一对(SQL 语句 + 错误)并将它们写入一个单独的文件中。
请告诉我如何做到这一点!
最佳答案
我的答案将不是基于 Java 的,因为这是一个可以通过非常非常简单的方式解决问题的经典示例。
您所需要的只是工具grep
。如果您使用的是 Windows,则可以找到它 here .
假设您的日志位于文件 log.txt
中,您的问题的解决方案就是这样:
grep -hE --before-context 1 "^DB2[0-9]+E" log.txt > filtered.txt
说明:
-h
- 不打印文件名-E
- 正则表达式搜索--before-context 1
- 这将在发现错误消息之前打印一行(如果所有 SQL 查询都在一行中,这将起作用)^DB2[0-9]+E
- 搜索以“DB2”开头、包含一些数字并以“E”结尾的行
以上表达式将在名为 filtered.txt
的新文件中打印您需要的每一行。
更新:经过一番摸索后,我设法仅使用标准 *nix 实用程序就获得了所需的内容。当心,这并不漂亮。最终的表达:
grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk "/E$/{y=$0;print x, y};{x=$0}" | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}" | sed -e "s/$/ >> filtered.txt/g" > run.bat
说明:
grep -nE "^DB2[0-9]+"log.txt
- 打印以DB2...
开头的行及其开头的行号。示例:
6:DB20000I The SQL command completed successfully. 12:DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. 19:DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. 26:DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. 34:DB20000I The SQL command completed successfully. 41:DB20000I The SQL command completed successfully. 47:DB21034E The command was processed as an SQL statement because it was not a valid Command Line Processor command. 54:DB20000I The SQL command completed successfully.
cut -f 1 -d " "
- prints only the "first column", that is, removes everything after error message. Example:
6:DB20000I 12:DB21034E 19:DB21034E 26:DB21034E 34:DB20000I 41:DB20000I 47:DB21034E 54:DB20000I
gawk "/E$/{y=$0;print x, y};{x=$0}"
- for every line that ends with "E" (an error line), print the line before it and then the error line. Example:
6:DB20000I 12:DB21034E 12:DB21034E 19:DB21034E 19:DB21034E 26:DB21034E 41:DB20000I 47:DB21034E
sed -e "s/:DB2[[:digit:]]\+[IE]//g"
- removes colon and the error message, leaving only line numbers. Example:
6 12 12 19 19 26 41 47
gawk "{print \"sed -n \\\"\" $1+1 \",\" $2 \"p\\\" log.txt \"}"
- formats above lines for sed processing and increments first line number by one. Example:
sed -n "7,12p" log.txt sed -n "13,19p" log.txt sed -n "20,26p" log.txt sed -n "42,47p" log.txt
sed -e "s/$/ >> filtered.txt/g"
- appends>> filtered.txt
to lines, for appending to final output file. Example:
sed -n "7,12p" log.txt >> filtered.txt sed -n "13,19p" log.txt >> filtered.txt sed -n "20,26p" log.txt >> filtered.txt sed -n "42,47p" log.txt >> filtered.txt
> run.bat
- finally, prints the last lines to a batch file namedrun.bat
After you execute this file, content you wanted will appear in filtered.txt
.
Update 2:
Here is another version that works on Ubuntu (previous version was written on Windows):
grep -nE "^DB2[0-9]+" log.txt | cut -f 1 -d " " | gawk '/E/{y=$0;print x, y};{x=$0}' | sed -e "s/:DB2[[:digit:]]\+[IE]//g" | gawk '{print "sed -n \""$1+1" ,"$2 "p\" log.txt" }' | sed -e "s/$/ >> filtered.txt/g" > run.sh
有两件事不适用于以前的版本:
- 出于某种原因,
gawk '/E$/'
无法正常工作(它无法识别 E 位于行尾),所以我只是输入了/E/
因为E
在其他地方找不到。 - 引用,
"
被转换为 gawk 的'
,因为它不喜欢双引号;之后,修改了最后一个 gawk 表达式内的引用
关于java - 使用 Java 有选择地解析日志文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4645432/