我正在编写一个简单的程序来从网页中捕获图像资源。 html 中的图像项如下所示:
CASE1:<img src="http://www.aaa.com/bbb.jpg" alt="title bbb" width="350" height="385"/>
或
CASE2:<img alt="title ccc" src="http://www.ddd.com/bbb.jpg" width="123" height="456"/>
我知道如何分别处理这两种情况,以第一种为例:
String CAPTURE = "<img(?:.*)src=\"http://(.*)\\.jpg\"(?:.*)alt=\"(.*?)\"(?:.*)/>";
DefaultHttpClient client = new DefaultHttpClient();
BasicHttpContext context = new BasicHttpContext();
Scanner scanner = new Scanner(client
.execute(new HttpGet(uri), context)
.getEntity().getContent());
Pattern pattern = Pattern.compile(CAPTURE);
while (scanner.findWithinHorizon(pattern, 0) != null) {
MatchResult r = scanner.match();
String imageUrl = "http://" +r.group(1)+".jpg";
String imageTitle = r.group(2);
//Do something with the image
}
问题是如何编写正确的模式以从包含 CASE1 和 CASE2 的网页源代码中获取所有图像项?我只想扫描页面一次。
最佳答案
使用jsoup
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
...
Document doc;
String userAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0";
try {
// need http protocol
doc = Jsoup.connect("http://domain.tld/images.html").userAgent(userAgent).get();
// get all images
Elements images = doc.select("img");
for (Element image: images) {
// get the values from img attribute (src & alt)
System.out.println("\nImage: " + image.attr("src"));
System.out.println("Alt : " + image.attr("alt"));
}
} catch (IOException e) {
e.printStackTrace();
}
Jsoup, a HTML parser, its “jquery-like” and “regex” selector syntax is very easy to use and flexible enough to get whatever you want.
关于android - 使用正则表达式从网页捕获图像,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23179559/