java - 使用 JSoup 聚合数据

标签 java css-selectors web-crawler jsoup

我正在尝试使用 JSouphttp://dictionary.reference.com/browse/quick 获取一些内容。如果您转到该页面,您会发现他们组织数据的方式是将单词“quick”的每个“单词类型”(形容词、动词、名词)呈现为自己的部分,并且每个部分包含 1+ 个定义列表。

为了让事情变得更复杂,每个定义中的每个单词都是另一个 Dictionary.com 页面的链接:

quick
    adjective
        1. done, proceeding, or occurring with promptness or rapidity...
        2. that is over or completed within a short interval of time
        ...
        14. Archaic.
            a. endowed with life
            b. having a high degree of vigor, energy, ...
    noun
        1. living persons; the quick and the dead
        2. the tender, sensitive flesh of the living body...
        ...
    adverb
        ...

我想要做的是使用 JSoup 获取单词类型及其各自的定义作为字符串列表,如下所示:

public class Metadata {
    // Ex: "adjective", "noun", etc.
    private String wordType;

    // Ex: String #1: "1. done, proceeding, or occurring with promptness or rapidity..."
    //     String #2: "that is over or completed within a short interval of time..."
    private List<String> definitions;
}

因此该页面实际上由 List<Metadata> 组成。 ,其中每个Metadata element 是与 1+ 个定义配对的单词类型。

我能够使用一个非常简单的 API 调用找到单词类型列表:

// Contains 1 Element for each word type, like "adjective", "noun", etc.
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk span.pg");

但我正在努力弄清楚其他必要的是什么 doc.select(...)我必须做才能获得每个 Metadata实例。

最佳答案

如果您查看 Jsoup 从该页面获取的 HTML,您将看到类似的内容

  <div class="body"> 
     <div class="pbk"> 
      <span class="pg">adjective </span> 
      <div class="luna-Ent">
       <span class="dnindex">1.</span>
       <div class="dndata">
        done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: 
        <span class="ital-inline">a quick response.</span> 
       </div>
      </div>
      <div class="luna-Ent">
       <span class="dnindex">2.</span>
       <div class="dndata">
        that is over or completed within a short interval of time: 
        <span class="ital-inline">a quick shower.</span> 
       </div>
      </div>
...
     <div class="pbk"> 
      <span class="pg">adverb </span> 
      <div class="luna-Ent">
       <span class="dnindex">19.</span>
       <div class="dndata">
        <a style="font-style:normal; font-weight:normal;" href="/browse/quickly">quickly</a>.
       </div>
      </div> 
     </div> 

所以每个部分

adjective
    1. done, proceeding, or occurring with promptness or rapidity...
    2. that is over or completed within a short interval of time
    ...
    14. Archaic.
        a. endowed with life
        b. having a high degree of vigor, energy, ...
noun
    1. living persons; the quick and the dead
    2. the tender, sensitive flesh of the living body...
    ...
adverb
    ...

<div class="pbk"> 里面其中包含 <span class="pg">adjective </span>包含节的名称和 div 中的定义 <div class="luna-Ent"> 。所以你可以尝试做类似的事情

Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();

Elements sections = doc.select("div.body div.pbk");
for (Element element : sections) {
    String elementType = element.getElementsByClass("pg").text();
    System.out.println("--------------------");
    System.out.println(elementType);

    for (Element definitions : element.getElementsByClass("luna-Ent"))
        System.out.println(definitions.text());

}

此代码将选择所有部分,并使用 element.getElementsByClass("pg") 查找部分名称和定义使用它们位于具有类 luna-Ent 的 div 中的事实element.getElementsByClass("luna-Ent") (如果您想跳过数字 1.2. 您可以选择 dndata 类而不是 luna-Ent )

输出:

--------------------
adjective
1. done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: a quick response.
2. that is over or completed within a short interval of time: a quick shower.
3. moving, or able to move, with speed: a quick fox; a quick train.
4. swift or rapid, as motion: a quick flick of the wrist.
5. easily provoked or excited; hasty: a quick temper.
6. keenly responsive; lively; acute: a quick wit.
7. acting with swiftness or rapidity: a quick worker.
8. prompt or swift to do something: quick to respond.
9. prompt to perceive; sensitive: a quick eye.
10. prompt to understand, learn, etc.; of ready intelligence: a quick student.
11. (of a bend or curve) sharp: a quick bend in the road.
12. consisting of living plants: a quick pot of flowers.
13. brisk, as fire, flames, heat, etc.
14. Archaic. a. endowed with life. b. having a high degree of vigor, energy, or activity.
--------------------
noun
15. living persons: the quick and the dead.
16. the tender, sensitive flesh of the living body, especially that under the nails: nails bitten down to the quick.
17. the vital or most important part.
18. Chiefly British. a. a line of shrubs or plants, especially of hawthorn, forming a hedge. b. a single shrub or plant in such a hedge.
--------------------
adverb
19. quickly.

关于java - 使用 JSoup 聚合数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19966209/

相关文章:

java - onBackPressed() 方法在 Android 抽屉导航中无法按预期工作

java - 如何将一个具体方法作为参数传递给另一个具体方法

java - 使用随机分配图像将相同的图像分配给所有 JButton

java - 找不到@SpringBootConfiguration,您需要在测试中使用@ContextConfiguration 或@SpringBootTest(classes=...)

javascript - 编写一个简单的自定义阅读更多关于需要重构的 jquery 的点击功能

php - 在 php 和 mysql 中从维基百科中提取内容

css - 是否有 CSS 父级选择器?

CSS 3 nth of type 仅限于类

python - 网络爬虫 - TooManyRedirects : Exceeded 30 redirects. (python)

c# - simhash函数真的那么靠谱吗?