我需要把所有 9 张 table 拿下来:
https://www.basketball-reference.com/players/c/collijo01.html
我当前的代码只处理 1 个表。我将 .first() 切换到 .last() ,但这不起作用。我尝试使用 ("table.totals") 按名称获取表格,但也失败了。
public static void getData(String url) throws IOException
{
String fileName = "table.csv";
FileWriter writer = new FileWriter(fileName);
Document doc = Jsoup.connect(url).get();
Element tableElement = doc.select("table").first();
System.out.println(doc);
Elements tableHeaderEles = tableElement.select("thead tr th");
for (int i = 0; i < tableHeaderEles.size(); i++) {
writer.append(tableHeaderEles.get(i).text());
if(i != tableHeaderEles.size() -1){
writer.append(',');
}
}
writer.append('\n');
System.out.println();
Elements tableRowElements = tableElement.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if(j != rowItems.size() -1){
writer.append(',');
}
}
writer.append('\n');
}
writer.close();
}
我从网站上完美地获得了第一个表,但无法继续前进。有谁知道如何获取所有表或根据ID抓取表吗?
编辑:如果有人想自己完全测试此编码的输出
public static void read(String file) throws IOException
{
Scanner scanner = new Scanner(new File(file));
scanner.useDelimiter(",");
while(scanner.hasNext()){
System.out.print(scanner.next()+"|");
}
scanner.close();
}
最佳答案
您已经选择了所有表格,但您明确只获得了第一个表格:
Element tableElement = doc.select("table").first();
相反,您可以轻松地迭代所有这些:
Elements tableElements = doc.select("table");
for (Element tableElement : tableElements) {
// for each of selected tables
}
因此,经过一些修改以获得唯一的文件名后,代码将如下所示:
public static void getData(String url) throws IOException {
String html = Jsoup.connect(url).execute().body();
// this one is tricky as it contains tables as commented out HTML, and shows them using javascript code
// so I'm using dirty replace to remove comment tags before parsing to make tables visible to Jsoup
html = html.replaceAll("<!--", "");
html = html.replaceAll("-->", "");
Document doc = Jsoup.parse(html);
Elements tableElements = doc.select("table");
int number = 1;
for (Element tableElement : tableElements) {
String tableId = tableElement.id();
if (tableId.isEmpty()) {
// skip table without id
continue;
}
tableId = " with id " + tableId;
String fileName = "table" + number++ + tableId + ".csv";
FileWriter writer = new FileWriter(fileName);
System.out.println(doc);
Elements tableHeaderEles = tableElement.select("thead tr th");
for (int i = 0; i < tableHeaderEles.size(); i++) {
writer.append(tableHeaderEles.get(i).text());
if (i != tableHeaderEles.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
System.out.println();
Elements tableRowElements = tableElement.select(":not(thead) tr");
for (int i = 0; i < tableRowElements.size(); i++) {
Element row = tableRowElements.get(i);
Elements rowItems = row.select("td");
for (int j = 0; j < rowItems.size(); j++) {
writer.append(rowItems.get(j).text());
if (j != rowItems.size() - 1) {
writer.append(',');
}
}
writer.append('\n');
}
writer.close();
}
}
回答你的第二个问题:
grab tables based on ID
而不是选择所有表格中的第一个表格:
Element tableElement = doc.select("table").first();
选择ID为advanced
的表中的第一个表:
Element tableElement = doc.select("table#advanced").first();
其他建议:
您作为 select(...)
参数提供的内容是 CSS selectors .
关于java - 如何使用 JSoup 从网站获取多个表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54753426/