我有一个程序,可以从 PubMed 网站的许多文章中提取某些元素(文章作者姓名)。虽然该程序在我的电脑(Windows)中正常工作,但当我尝试在 unix 上运行它时返回一个空列表。我怀疑这是因为unix系统中的语法应该有些不同。问题是 JSoup 文档没有提及某些内容。有人知道这方面的事情吗?我的代码是这样的:
Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
System.out.println("connected");
Elements authors = doc.select("div.auths >*");
System.out.println("number of elements is " + authors.size());
最终的 System.out.println 总是说大小为 0,因此它不能做更多的事情。
提前致谢
完整示例:
protected static void searchLink(HashMap<String, HashSet<String>> authorsMap, HashMap<String, HashSet<String>> reverseAuthorsMap,
String fileLine
) throws IOException, ParseException, InterruptedException
{
JSONParser parser = new JSONParser();
JSONObject jsonObj = (JSONObject) parser.parse(fileLine.substring(0, fileLine.length() - 1 ));
String pmidString = (String)jsonObj.get("pmid");
System.out.println(pmidString);
Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
System.out.println("connected");
Elements authors = doc.select("div.auths >*");
System.out.println("found the element");
HashSet<String> authorsList = new HashSet<>();
System.out.println("authors list hashSet created");
System.out.println("number of elements is " + authors.size());
for (int i =0; i < authors.size(); i++)
{
// add the current name to the names list
authorsList.add(authors.get(i).text());
// pmidList variable
HashSet<String> pmidList;
System.out.println("stage 1");
// if the author name is new, then create the list, add the current pmid and put it in the map
if(!authorsMap.containsKey(authors.get(i).text()))
{
pmidList = new HashSet<>();
pmidList.add(pmidString);
System.out.println("made it to searchLink");
authorsMap.put(authors.get(i).text(), pmidList);
}
// if the author name has been found before, get the list of articles and add the current
else
{
System.out.println("Author exists in map");
pmidList = authorsMap.get(authors.get(i).text());
pmidList.add(pmidString);
authorsMap.put(authors.get(i).text(), pmidList);
//authorsMap.put((String) authorName, null);
}
// finally, add the pmid-authorsList to the map
reverseAuthorsMap.put(pmidString, authorsList);
System.out.println("reverseauthors populated");
}
}
我有一个线程池,每个线程使用这个方法来填充两个 map 。 fileline 参数是一行,我将其解析为 json 并保留“pmid”字段。我使用该字符串访问本文的 url,并解析 HTML 以获取作者姓名。其余的应该可以工作(它在我的电脑上确实可以工作),但是因为authors.size始终为0,所以直接位于元素数量System.out下面的for根本不会被执行。
最佳答案
我尝试过example做你正在尝试的事情:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test {
public static void main (String[] args) throws IOException {
String docId = "24312906";
if (args.length > 0) {
docId = args[0];
}
String url = "http://www.ncbi.nlm.nih.gov/pubmed/" + docId;
Document doc = Jsoup.connect(url).timeout(60000).userAgent("Mozilla/25.0").get();
Elements authors = doc.select("div.auths >*");
System.out.println("os.name=" + System.getProperty("os.name"));
System.out.println("os.arch=" + System.getProperty("os.arch"));
// System.out.println("doc=" + doc);
System.out.println("authors=" + authors);
System.out.println("authors.length=" + authors.size());
for (Element a : authors) {
System.out.println(" author: " + a);
}
}
}
我的操作系统是Linux:
# uname -a
Linux graphene 3.11.0-13-generic #20-Ubuntu SMP Wed Oct 23 07:38:26 UTC 2013 x86_64 x86_64 x86_64
GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 13.10
Release: 13.10
Codename: saucy
运行该程序会产生:
os.name=Linux
os.arch=amd64
authors=<a href="/pubmed?term=Liu%20W%5BAuthor%5D&cauthor=true&cauthor_uid=24312906">Liu W</a>
<a href="/pubmed?term=Chen%20D%5BAuthor%5D&cauthor=true&cauthor_uid=24312906">Chen D</a>
authors.length=2
author: <a href="/pubmed?term=Liu%20W%5BAuthor%5D&cauthor=true&cauthor_uid=24312906">Liu W</a>
author: <a href="/pubmed?term=Chen%20D%5BAuthor%5D&cauthor=true&cauthor_uid=24312906">Chen D</a>
这似乎有效。也许问题出在 fileLine 上?你能打印出“url”的值吗:
System.out.println("url='" + "http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString+ "'");
由于您没有从代码中获得异常,我怀疑您正在获得一份文档,而不是您的代码所预期的文档。打印出该文档以便您可以查看收到的内容可能也会有所帮助。
关于java - JSoup 从 unix 中的 HTML 选择,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20476188/