java - jsoup 获取与它们相关的特定标签和值

标签 java regex jsoup

我是 jsoup 的新手,想更熟悉如何从网站中提取信息。我正在尝试做一些简单的事情:从 eBay 获取一些值(value)。

我想从“本周热门”中获取商品名称、html 链接、价格和销售量(如此处:http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html)

但是我不确定如何进行。

package application;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

import javax.swing.JOptionPane;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetHotSellers {

    public static void main(String[] args) {
        Document doc =  Jsoup.parse(readURL("http://www.ebay.co.uk/sch/Action-Figures/246/bn_1632128/i.html"));

        Elements sold_items = doc.getElementsMatchingText("sold$");   
        for(Element sold : sold_items) {
                System.out.println(sold.text());
        }
    }


     public static String readURL(String url) {

     String fileContents = "";
     String currentLine = "";

     try {
         BufferedReader reader = new BufferedReader(new InputStreamReader(new URL(url).openStream()));
         fileContents = reader.readLine();
         while (currentLine != null) {
             currentLine = reader.readLine();
             fileContents += "\n" + currentLine;
         }
         reader.close();
         reader = null;
     } catch (Exception e) {
         JOptionPane.showMessageDialog(null, e.getMessage(), "Error Message", JOptionPane.OK_OPTION);
         e.printStackTrace();

     }

     return fileContents;
    }

}

这是我得到的。我是否需要改进我的正则表达式,或者我是否需要使用一些更适合我的请求的其他函数?

我当前的输出如下所示:

2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold 12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
2016 8PC Marvel Avengers DC Super Hero Mini Figure Set Fits Lego FROM UK £6.35 381 sold
381 sold
381 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
Despicable Me Minions Supervillain Jet Playset -From the Argos Shop on ebay £7.99 187 sold
187 sold
187 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
Avengers Marvel Titan 12" figure Spider-man Captain Iron man Wolverine Thor Toy £8.69 174 sold
174 sold
174 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
Imaginext Marvel DC Super Hero Squad Figures and Villains Batman Please select £1.99 129 sold
129 sold
129 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
Star Wars Episode The Force Awakens Electronic Chewbacca Mask IN STOCK NOW! £24.99 101 sold
101 sold
101 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
Jurassic World Indominus Rex Chomping Dinosaur 44cm Figure T-Rex Dino Action Toy £26.99 89 sold
89 sold
89 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
12" Avengers Marvel Titan Figures Spider-Man Captain Iron Man Wolverine Thor Toy £7.45 88 sold
88 sold
88 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay £7.99 87 sold
87 sold
87 sold

我想要的输出示例:

Henry Hugglemonster Huggle House Playset. From the Official Argos Shop on ebay || £7.99 || 87 sold || http://link.com

编辑:

刚刚试过类似的东西,但没有运气。

for(String categoryURL : categoryLinksArray) {
    Document doc = Jsoup.parse(readURL(categoryURL));
    Elements sold_items = doc.getElementsByClass("b-block-info-container");
    for(Element sold : sold_items) {
            System.out.println("NAME: " + sold.attr("b-block-info-container__title b-block-info-container__title__ListingSummary") + "\n" + 
                               "PRICE: " + sold.attr("b-block-info-container__price") + "\n" +
                               "SOLD/week: " + sold.attr("item_quantity__hotness") + "\n" +
                               "URL: " + sold.attr("abs:href"));
            System.out.println("--------------------------------------");
    }
}

最佳答案

我做到了,但不是很有效,因为它很慢。

public static void main(String[] args) {

    ArrayList<String> categoryLinksArray = new ArrayList<>();

    Document links = Jsoup.parse(readURL("http://www.ebay.co.uk/sch/allcategories/all-categories"));
    Elements item_categories = links.getElementsByClass("ch");
    for (Element category : item_categories) {
        categoryLinksArray.add(category.attr("abs:href"));
    }

    for (String categoryURL : categoryLinksArray) {
        Document doc = Jsoup.parse(readURL(categoryURL));
        Elements hot_items = doc
                .getElementsByClass("b-module b-module-carousel b-module-deals topSold b-display--portrait");
        for (Element item : hot_items) {

            Elements hot_items_names = item.getElementsByClass(
                    "b-block-info-container__title b-block-info-container__title__ListingSummary");
            Elements hot_items_price = item.getElementsByClass("b-block-info-container__price");
            Elements hot_items_sold = item.getElementsByClass("item_quantity__hotness");
            Elements hot_items_url = item.getElementsByClass("b-block-tile");

            HashMap<String, String> hs_items = new HashMap<>();

            for (Element item_name : hot_items_names) {
                hs_items.put("Name", item_name.text());
            }
            for (Element item_price : hot_items_price) {
                hs_items.put("Price", item_price.text());
            }
            for (Element item_sold : hot_items_sold) {
                hs_items.put("Sold", item_sold.text());
            }
            for (Element item_url : hot_items_url) {
                hs_items.put("URL", item_url.attr("abs:href"));
            }

            System.out.println("Name: " + hs_items.get("Name") + "\n" +
                               "Price: " + hs_items.get("Price") + "\n" +
                               "Sold: " + hs_items.get("Sold") + "\n" +
                               "URL: " + hs_items.get("URL") + "\n" +
                               "----------------------------------");
        }
    }
}

关于java - jsoup 获取与它们相关的特定标签和值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40853610/

相关文章:

java - 限制类只接受任何一种接口(interface)实现而不接受混合?

java - 我是否需要处理或忽略由 OutputStream close() 函数触发的 IOException?

java - 无法在 linux 上使用 rsync 同步单个文件

java - 尝试使用第三个矩阵同时初始化两个矩阵 Row 和 Col

java - 用于匹配 X 数字与数字之间的一个可选逗号的正则表达式?

java - 如何使用jsoup在android中发送POST?

java - 以编程方式设置 jsoup 解析的网页的值

java - 正则表达式在 <a> 之前添加 <span> 标签

c# - 正则表达式提取数字字符串

java - JSoup 不加载整个 HTML