我有一个由 HTML 代码组成的文本文件,我需要对其进行操作以使其更具可读性。我的问题是每个文件名有两行不唯一,但我需要区分它们:
编辑
我将在这里为那些提出要求的人提供输入:
<body>
<tbody>
<tr><td><b>Test Suite</b></td></tr>
<tr><td><a href="HAPPY/3_step_minimal_foundation_no_prefill_HAPPY">3_step_minimal_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_no_prefill_HAPPY">fullform_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="HAPPY/fullform_mobile_foundation_no_prefill_HAPPY">fullform_mobile_foundation_no_prefill_HAPPY</a></td></tr>
<tr><td><a href="SAD/3_step_minimal_foundation_SAD">3_step_minimal_foundation_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_SAD">fullform_SAD</a></td></tr>
<tr><td><a href="SAD/fullform_mobile_foundation_SAD">fullform_mobile_foundation_SAD</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/3_step_minimal_foundation_HAPPY_PLUS_OPTIONS">3_step_minimal_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_HAPPY_PLUS_OPTIONS">fullform_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="HAPPY_PLUS_OPTIONS/fullform_mobile_foundation_HAPPY_PLUS_OPTIONS">fullform_mobile_foundation_HAPPY_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/3_step_minimal_foundation_SAD_PLUS_OPTIONS">3_step_minimal_foundation_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_SAD_PLUS_OPTIONS">fullform_SAD_PLUS_OPTIONS</a></td></tr>
<tr><td><a href="SAD_PLUS_OPTIONS/fullform_mobile_foundation_SAD_PLUS_OPTIONS">fullform_mobile_foundation_SAD_PLUS_OPTIONS</a></td></tr>
</tbody></table>
</body>
3_step_minimal_foundation_no_prefill_HAPPY
和
3_step_minimal_foundation_no_prefill_HAPPY
例如需要变成:
3_step_minimal_foundation_no_prefill
和
3_step_minimal_foundation_no_prefill_HAPPY
我当前的文本文件状态:
这是我实现此目的的代码:
$ sed -n '/ref/p' EVERYTHING | awk -F'[/"<> ]+' '{sub("", "", $6); print $6, $7, $8}' | tr -s '[[:space:]]' '\n' | awk -v n=3 '1; NR % n == 0 {print ""}' | sed '/^HAPPY/s/^/Flow Type\: /' | sed '/^SAD/s/^/Flow Type\: /' | sed '$d'
Flow Type: HAPPY
3_step_minimal_foundation_no_prefill_HAPPY
3_step_minimal_foundation_no_prefill_HAPPY
Flow Type: HAPPY
fullform_no_prefill_HAPPY
fullform_no_prefill_HAPPY
Flow Type: HAPPY
fullform_mobile_foundation_no_prefill_HAPPY
fullform_mobile_foundation_no_prefill_HAPPY
Flow Type: SAD
3_step_minimal_foundation_SAD
3_step_minimal_foundation_SAD
Flow Type: SAD
fullform_SAD
fullform_SAD
Flow Type: SAD
fullform_mobile_foundation_SAD
fullform_mobile_foundation_SAD
Flow Type: HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
Flow Type: HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS
fullform_HAPPY_PLUS_OPTIONS
我想要的输出:
Flow Type: HAPPY
Flow Name: 3_step_minimal_foundation_no_prefill
File Name: 3_step_minimal_foundation_no_prefill_HAPPY
Flow Type: HAPPY
Flow Name: fullform_no_prefill
File Name: fullform_no_prefill_HAPPY
Flow Type: HAPPY
Flow Name: fullform_mobile_foundation_no_prefill
File Name: fullform_mobile_foundation_no_prefill_HAPPY
Flow Type: SAD
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_SAD
Flow Type: SAD
Flow Name: fullform
File Name: fullform_SAD
Flow Type: SAD
Flow Name: fullform_mobile_foundation
File Name: fullform_mobile_foundation_SAD
Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: 3_step_minimal_foundation
File Name: 3_step_minimal_foundation_HAPPY_PLUS_OPTIONS
Flow Type: HAPPY_PLUS_OPTIONS
Flow Name: fullform
File Name: fullform_HAPPY_PLUS_OPTIONS
有没有办法可以删除/保留第 N 行中的特定文本?一旦我使每行都是唯一的,就很容易正确地标记每行。
-最佳
最佳答案
好的,对于删除与上一行匹配的行从下划线到行尾的所有内容的基本功能,该过程非常简单。这里有两个选项,100% 未经测试。
在 awk 中:
awk '$0 == last { sub(/_[^_]+$/,""); } { last=$0; } 1' inputfile
在外壳中:
while read line; do
if [ "$line" = "$last" ]; then
line="${line%_*}"
fi
echo "$line"
last="$line"
done < inputfile
但这会改变两行的第二。对于您需要的附加格式,您似乎想要修改两行的第一。这使得事情变得更加复杂......
要从您拥有的文本转到您想要的文本,让我们以不同的方式看待这个问题,并假设这两个重复的行总是出现在以“Flow Type”开头的行之后:”。
awk '
/^Flow Type:/ {
print;
getline one; getline two
if (one == two) {
sub(/_[^_]+$/,"",one);
print "Flow Name: " one;
print "File Name: " two;
} else {
print one; print two
}
next;
}
1
' inputfile
但我们也可以只处理您的原始 HTML。
在 sed 中,模式识别非常有趣。这是 GNU sed 中的一个:
sed -r 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1\nFlow Name: \3\nFile Name: \2|' input.html
这是需要 GNU sed 的换行符 (\n
);从结构上来说,它只是简单的 sed。此解决方案不适用于 *BSD 或 OSX。
EDIT: Per comments on potong's answer, a variation that would work in OSX would be this:
<input.html sed -n 's/^.*"\([^"\/]*\)\/\(\([^"]*\)_\1\)".*/Flow Type: \1|Flow Name: \3|File Name: \2|/p' | tr '|' '\n'`
or if you prefer ERE instead of BRE:
<input.html sed -E 's|<tr><td><a href="([^/]+)/(([^"]+)_[^_]+)".*|Flow Type: \1#Flow Name: \3#File Name: \2#|' | tr '#' '\n'
这解决了 OSX sed 无法将换行符插入 s 替换字符串的限制。相反,我们插入一个未使用的字符,并使用 tr
将其转换为换行符。
要在 awk 中实现相同的目标(即处理 HTML),您可以采用如下方法:
awk '
/<tr><td><a/ {
type=$0; file=$0;
sub(/^[^"]+/,"",type); sub(/\/.*/,"",type);
sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);
name=file; sub(/_[^_]+$/,"",name);
printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);
}' input.html
好的,这是我的最后一次更新。这是您要找的吗?
awk '
/<tr><td><a/ {
type=$0; sub(/^[^"]+"/,"",type); sub(/\/.*/,"",type);
file=$0; sub(/^[^\/]+\//,"",file); sub(/".*/,"",file);
if ( index(file, type) ) {
name=substr(file, 0, index(file, type)-2);
} else {
name=file; sub(/_[^_]+$/,"",name);
}
printf("Flow type: %s\nFlow name: %s\nFile name: %s\n\n", type, name, file);
}'
关于bash - 如何使用 AWK 或 SED 在第 N 行之前打印字符串并从第 N 行删除特定字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32203437/