java - Jsoup 从 div 的子级中抓取文本

标签 java html css jsoup

我正在尝试提取链接上的产品评论 - Moto X使用 JSoup 但它抛出 NullPointerException。另外,我想提取点击评论的“阅读更多”链接后显示的文本。

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class JSoupEx
{
    public static void main(String[] args) throws IOException
    {
      Document doc = Jsoup.connect("https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA").get();
      Element ele = doc.select("div[class=qwjRop] > div").first();
      System.out.println(ele.text());
    }
}

有什么解决办法吗?

最佳答案

正如 Gherkin 所建议的,使用开发人员工具中的网络选项卡,我们会看到一个接收评论(JSON 格式)作为响应的请求:

https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=0

使用像 JSON.simple 这样的 JSON 解析器我们可以提取评论作者、有用性和文本等信息。

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36";
String reviewApiCall = "https://www.flipkart.com/api/3/product/reviews?productId=MOBEFM5HAFRNSJJA&count=15&ratings=ALL&reviewerType=ALL&sortOrder=MOST_HELPFUL&start=";
String xUserAgent = userAgent + " FKUA/website/41/website/Desktop";
String referer = "https://www.flipkart.com/moto-x-play-with-turbo-charger-white-16-gb/product-reviews/itmefzwvdejejvth?pid=MOBEFM5HAFRNSJJA";
String host = "www.flipkart.com";
int numberOfPages = 2; // first two pages of results will be fetched

try {
    // loop for multiple review pages
    for (int i = 0; i < numberOfPages; i++) {
        // query reviews
        Response response = Jsoup.connect(reviewApiCall+(i*15)).userAgent(userAgent).referrer(referer).timeout(5000)
                .header("x-user-agent", xUserAgent).header("host", host).ignoreContentType(true).execute();

        System.out.println("Response in JSON format:\n\t" + response.body() + "\n");

        // parse json response
        JSONObject jsonObject = (JSONObject) new JSONParser().parse(response.body().toString());
        jsonObject = (JSONObject) jsonObject.get("RESPONSE");
        JSONArray jsonArray = (JSONArray) jsonObject.get("data");

        for (Object object : jsonArray) {
            jsonObject = (JSONObject) object;
            jsonObject = (JSONObject) jsonObject.get("value");
            System.out.println("Author: " + jsonObject.get("author") + "\thelpful: "
                    + jsonObject.get("helpfulCount") + "\n\t"
                    + jsonObject.get("text").toString().replace("\n", "\n\t") + "\n");
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}

输出

Response in JSON format:
    {"CACHE_INVALIDATION_TTL":"132568825671","REQUEST":null,"REQUEST-ID": [...] }

Author: Flipkart Customer   helpful: 140
    A great phone at an affordable price with
    -an outstanding camera
    -great battery life
    -an excellent display
    -premium looks
     the flipkart delivery was also fast and perfect.

Author: Vaibhav Yadav   helpful: 518
    I m writing this review after using 2 months..
    First of all ..I must say this is one of the best product ..camera quality is best in natural lights or daytime..but in low light and in the night..camera quality is not so good but it's ok..
    It has good battery backup ..last one day on 3g usage ..while using 4g ..it lasts for about 10-12 hour..
    Turbo charges is good..although ..my charger is not working..
    Only problem in this phone is ..while charging..this phone heats a lot..this may b becoz of turbo charger..if u r using other charger than it does not heat..

Author: KAPIL CHOPRA    helpful: 9
[...]

注意:输出被截断 ([...])

关于java - Jsoup 从 div 的子级中抓取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39850547/

相关文章:

java - 使用外部随机数源创建 RSA key (Java)

javascript - 防止 Enter 键用作鼠标单击

css - 如何在不显示的情况下水平居中div :table?

java - 使用 Table 标签为 gmail 和 yahoo 设计模板?

html - 无法修复第 n 个子边距

java - 程序终止而不打印任何内容

java - 如何获取类的实例?

python - Django select 如何记住选择的值?

javascript - 等间隔 PHP 生成的 div(不使用 flex)

java - 在 Spring MVC Controller 上处理异常