html - 在 HTML/XML 文件中高效搜索

标签 html ubuntu awk sed grep

用例:我有一个 html 文件,我需要在其中搜索文本。

假设我的文件是 -:

</script> <!-- right side bands --> </div> </div> <div class="fclear"></div> <div class="fk-mainfooter fksk-mainfooter tpadding20 bpadding5 new-vd" id="fk-mainfooter-id"> <div class="fk-content fksk-content bpadding10"> <div class="line tpadding20 bpadding20 footer-dark-top-border"> <div class="unit fk-footer-links-container"> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Help</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/help/payments">Payments</a> <a class="fk-footer-unit fk-footer-link" href="/help/savedcard_how">Saved Cards</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/shipping">Shipping</a> <a class="fk-footer-unit fk-footer-link" href="/s/help/cancellation-returns">Cancellation &amp;
 Returns</a> <a class="fk-footer-unit fk-footer-link" href="/s/help">FAQ</a> <a class="fk-footer-unit fk-footer-link" href="https://seller.flipkart.com/fiv">Report Infringement</a> </div> <div class="unit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart</strong></span> <a class="fk-footer-unit fk-footer-link" href="/s/contact">Contact Us</a> <a class="fk-footer-unit fk-footer-link" href="/about-us">About Us</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="/ol?link=http%3A%2F%2Fflipkartcareers.com%2F">Careers</a> <a class="fk-footer-unit fk-footer-link" href="/ol?link=http%3A%2F%2Fblog.flipkart.com%2F">Blog</a> <a class="fk-footer-unit fk-footer-link" href="/s/press">Press</a> <a class="fk-footer-unit fk-footer-link" target="_blank" href="http://slashn.flipkart.net/">Slash N</a> </div> <div class="unit size1of4 required-tracking" data-tracking-id="ch_vn"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Flipkart eBooks</strong></span> <a class="fk-footer-unit fk-footer-link" href="/ebooks/gettingstarted" target="_blank">eBooks Quick Start Guide</a> <a class="fk-footer-unit fk-footer-link" href="/help/flyteeBookfaq" target="_blank">eBooks FAQ</a> <a class="fk-footer-unit fk-footer-link" href="/ebooks/apps" target="_blank">eBooks App</a> <a href="/mobile-apps" data-tracking-id="mobile_apps"><div class="footer-mobile-apps lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png"></div></a> </div> <div class="lastUnit size1of4"> <span class="fk-footer-sub-head fk-footer-unit"><strong>Misc</strong></span> <a class="fk-footer-unit fk-footer-link" href="http://www.flipkart.com">Online Shopping</a> <a class="fk-footer-unit fk-footer-link" href="/affiliate/" target="_blank">Affiliate</a> <a class="fk-footer-unit fk-footer-link" href="/buy-gift-voucher">e-Gift Voucher</a> <a class="fk-footer-unit fk-footer-link" href="/?sitevariant=mobile">Flipkart lite</a> <a class="fk-footer-unit fk-footer-link" href="/flipkart-first">Flipkart First Subscription</a> <a class="fk-footer-unit fk-footer-link" href="/elearning-faq">eLearning FAQ</a> </div> </div> <div class="lastUnit size10f4 lpadding15 fk-trust-boosters"> <p class="fk-footer-sub-head"><strong>Safe and Secure Shopping</strong></p> <p class="bpadding15 fk-trust-content">All major credit and debit cards are accepted. We also accept payments by <strong>Internet Banking, Cash on Delivery</strong> and <strong>Equated Monthly Installments(EMI).</strong></p> </div> </div> <div class="fk-footer-ssa"> <a href="/account/orders?srcLink=footer" class="login-required"> <ul class="line ssa-block"> <li class="unit size1of3 ssa-unit"><i class="icon track-icon"></i><span class="text">Track your<br /> order</span></li> <li class="unit size1of3 ssa-unit"><i class="icon return-icon"></i><span class="text">Free &amp; easy<br /> returns</span></li> <li class="lastUnit ssa-unit"><i class="icon cancel-icon"></i><span class="text">Online<br /> cancellations</span></li> </ul> </a> </div> <div class="line fk-footer-policy"> <div class="unit tpadding5 tc-links"> <span><span class="policies-title boldtext">Policies:</span> <a href="/s/terms">Terms of use</a> | <a href="/s/paymentsecurity">Security</a> | <a
 href="/s/privacypolicy">Privacy</a> | <a
 href="https://seller.flipkart.com/fiv">Infringement</a></span> <span class="fk-footet-cr">&copy; 2007-2014 <span>Flipkart.com.</span></span> </div> <div class="fk-footer-kit unitExt fk-inline-block"> <strong class="title fk-float-left">Keep in touch</strong> <a class="facebook_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.facebook.com%2Fflipkart"></a> <a class="twitter_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.twitter.com%2Fflipkart"></a> <a class="google-plus_icn inner fk-sprite-hf rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=https%3A%2F%2Fplus.google.com%2F109591199284807005836%2Fposts"></a> <a class="youtube_icn inner rmargin5 lazy-bgImage" data-bgImage="//img5a.flixcart.com/www/prod/images/social-sprite-b3c0ada7.png" target="_blank" href="/ol?link=http%3A%2F%2Fwww.youtube.com%2Fflipkart"></a> </div> </div> <div class="line top-brand-links tpadding10 bpadding10"> <div class="line boldtext bpadding10 top-brands-title">
 Top Stores : <a href="/brands">Brand Directory</a> | <a href="/store-directory">Store Directory</a> </div> <div class="line"> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Most searched for on Flipkart: </div> <div class="lastUnit"> <a href="/games/call-of-duty~series/pr?sid=4rr,tg9"> Call Of Duty</a>
 | <a href="/androidone"> Android One</a>
 | <a href="/offers"> Diwali Offers</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Mobiles: </div> <div class="lastUnit"> <a href="/moto-e/p/itmdvuwsybgnbtha"> Moto E</a>
 | <a href="/q/samsung-mobiles"> Samsung Mobile</a>
 | <a href="/q/micromax-mobiles"> Micromax Mobile</a>
 | <a href="/q/nokia-mobiles"> Nokia Mobile</a>
 | <a href="/q/htc-mobiles"> HTC Mobile</a>
 | <a href="/q/sony-mobiles"> Sony Mobile</a>
 | <a href="/q/apple-mobiles"> Apple Mobile</a>
 | <a href="/q/lg-mobiles"> LG Mobile</a>
 | <a href="/q/karbonn-mobiles"> Karbonn Mobile</a>
 | <a href="/mobiles">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Camera: </div> <div class="lastUnit"> <a href="/q/nikon-cameras"> Nikon Camera</a>
 | <a href="/q/canon-cameras"> Canon Camera</a>
 | <a href="/q/sony-cameras"> Sony Camera</a>
 | <a href="/cameras/samsung~brand/pr?sid=jek,p31"> Samsung Camera</a>
 | <a href="/q/canon-dslr-cameras"> Canon DSLR</a>
 | <a href="/q/nikon-dslr-cameras"> Nikon DSLR</a>
 | <a href="/cameras/dslr~type/pr?sid=jek,p31"> DSLR Camera</a>
 | <a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
 | <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
 | <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
 | <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
 | <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
 | <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
 | <a href="/q/sony-laptops"> Sony Laptop</a>
 | <a href="/q/dell-laptops"> Dell Laptop</a>
 | <a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>
 | <a href="/laptops/toshiba~brand/pr?sid=6bo,b5g"> Toshiba Laptop</a>
 | <a href="/laptops/lg~brand/pr?sid=6bo,b5g"> LG Laptop</a>
 | <a href="/q/hp-laptops"> HP Laptop</a>
 | <a href="/laptops/~notebook/pr?sid=6bo,b5g"> Notebook</a>
 | <a href="/brands/laptops?sid=6bo,b5g">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 TVs: </div> <div class="lastUnit"> <a href="/home-entertainment/tvs/sony~brand/pr?sid=ckf%2Cczl"> Sony TV</a>
 | <a href="/home-entertainment/tvs/samsung~brand/pr?sid=ckf%2Cczl"> Samsung TV</a>
 | <a href="/home-entertainment/tvs/lg~brand/pr?sid=ckf%2Cczl"> LG TV</a>
 | <a href="/home-entertainment/tvs/panasonic~brand/pr?sid=ckf%2Cczl"> Panasonic TV</a>
 | <a href="/home-entertainment/tvs/onida~brand/pr?sid=ckf%2Cczl"> Onida TV</a>
 | <a href="/home-entertainment/tvs/toshiba~brand/pr?sid=ckf%2Cczl"> Toshiba TV</a>
 | <a href="/home-entertainment/tvs/philips~brand/pr?sid=ckf%2Cczl"> Philips TV</a>

问题是大多数 html 页面被认为是 1 行,所以在搜索时 -: grep -F "my text ofinterest"html_file.html - 如果有匹配项,我会看到整个文件被转储下来 - 这不会让我看到上下文 - 并且非常痛苦调试 考虑一下,我正在搜索如果在 HTML 中查看则在 1 行中出现的内容,但在其他情况下则不会。

示例 -: 假设我需要在此文件中搜索“/laptops/samsung~brand/pr?sid=6bo,b5g”,但在 grep 上,如上所述,我看到整个转储(以及更多..) 如何在 html 中有效搜索此用例 - 并仅获取逻辑相邻上下文(grep -A 4 -B 4 不直接应用) - 我可以操纵它来将文件解释为html 然后读取相邻的上下文?

匹配的示例输出 -:

<a href="/camera-accessories/lenses/pr?sid=jek,6l2,e9y"> Camera Lens</a>
 | <a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
 | <a href="/cameras">View all</a> </div> </div> <div class="line brands"> <div class="unit boldtext rpadding10 size1of9">
 Laptops: </div> <div class="lastUnit"> <a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple Laptop</a>
 | <a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
 | <a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
 | <a href="/q/lenovo-laptops"> Lenovo Laptop</a>
 | <a href="/q/sony-laptops"> Sony Laptop</a>
 | <a href="/q/dell-laptops"> Dell Laptop</a>

最好与匹配的突出显示术语 - 在本例中为“/laptops/samsung~brand/pr?sid=6bo,b5g”。

最佳答案

不要使用专门用于处理文本文件的工具,而是使用 HTML 解析器,例如 Perl 的 Mojo::DOM:

use strict;
use warnings;
use feature ":5.10";
use Mojo::DOM;
use List::Util "first";

# construct DOM object from file
my $d = Mojo::DOM->new(do { local $/; <> });

# get all <a> tags
my $a = $d->find("a");                               

# find the index of the one we are interested in
my $href = '/laptops/samsung~brand/pr?sid=6bo,b5g';
my $index = first { $a->[$_]->attr('href') eq $href } 0..$a->size;

# print links
say $a->slice($index-4..$index+4)->map("to_string")->join("\n");

perl script.pl file.html一样运行它。

输出:

<a href="/camera-accessories/tripods/pr?sid=jek,6l2,ce6"> Camera Tripod</a>
<a href="/cameras">View all</a>
<a href="/laptops/apple~brand/pr?sid=6bo,b5g"> Apple 
Laptop</a>
<a href="/laptops/acer~brand/pr?sid=6bo,b5g"> Acer Laptop</a>
<a href="/laptops/samsung~brand/pr?sid=6bo,b5g"> Samsung Laptop</a>
<a href="/q/lenovo-laptops"> Lenovo Laptop</a>
<a href="/q/sony-laptops"> Sony Laptop</a>
<a href="/q/dell-laptops"> Dell Laptop</a>
<a href="/laptops/asus~brand/pr?sid=6bo,b5g"> Asus Laptop</a>

以下是一些未经测试的建议

为了将选项传递给脚本,就像 grep 一样,您可以使用 Getopt::Std :

use Getopt::Std;
our($opt_A, $opt_B) = (0, 0);
getopts('A:B:');

这将允许您将选项 -A-B 传递给脚本,例如 perl script.pl -A 4 -B 4 file.html 。然后,您可以将上面代码中的硬编码 4 更改为 ($index-$opt_A..$index+$opt_B)

要传递模式,您可以指定另一个选项。

要为您感兴趣的行上的输出着色,您可以使用 Term::ANSIColor :

say $a->slice($index-4..$index-1)->map("to_string")->join("\n");
say green $a->[$index]->to_string;
say $a->slice($index+1..$index+4)->map("to_string")->join("\n");

关于html - 在 HTML/XML 文件中高效搜索,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27582664/

相关文章:

html - 图片未显示在表格中

jQuery 遍历两层元素

html - 调整超大屏幕视频高度

shell - 在 unix 中的两个固定格式文件中查找字段值 - 不工作

html - 导航栏(国家图片无响应)

python-3.x - 与外部 MQTT 代理连接时出现问题

linux - 无法在Linux上使用arduino对Atmega1284p进行编程,但可以在Mac上使用

c++ - 使用 -g 选项编译但 "Single stepping until exit from function main, which has no line number information"

csv - (sed/awk) 从文本中提取值到 csv 文件 - 偶数/奇数行模式

awk - 如何提取前两列,然后删除第二列中的部分信息?