我想使用 JSoup (java) 解析此 HTML 中的数据(公司名称、位置、工作描述...)。尝试迭代职位列表时我陷入困境
从 HTML 中提取的内容是许多“JOBLISTING”div 之一,我想对其进行迭代并从中提取数据。我只是无法处理如何迭代特定的 div 对象。抱歉这个菜鸟问题,但也许有人可以帮助我已经知道要使用哪个函数。选择?
<div class="between_listings"><!-- local.spacer --></div>
<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">
<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
<img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
</a>
</div>
<div class="job_info">
<div class="h3 job_title">
<a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
<span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
</a>
</div>
<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Delivery Hero Holding GmbH</span>
</div>
</div>
<div class="job_location_date">
<div class="job_location target-location">
<div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">
<div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="addressLocality"> Berlin</span>
</div>
<span class="location_actions">
<a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
<span class="location-icon"><!-- --></span>
<span class="location-label">Google Maps</span>
</a>
</span>
</div>
</div>
<div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>
<div class="job_actions">
</div>
</div>
<div class="between_listings"><!-- local.spacer --></div>
文件输入 = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");//将文件加载到extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/ ");元素 jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]");//其他元素属性 System.out.println(jobListingElements);
Java代码:
File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements jobListingElements = ParseResult.select(".joblisting");
for (Element jobListingElement: jobListingElements) {
jobListingElement.select(".companyName span[itemprop=\"name\"]");
// other element properties
System.out.println(jobListingElements);
}
谢谢!
最佳答案
那么你的 Jsoup 文档正确吗?如果 css 类 joblisting
没有出现在其他地方,这似乎很容易。
Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
Elements jobTitleElement = element.select(".job_title span");
Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
String companyName = companyNameElement.text();
String jobTitle = jobTitleElement.text();
System.out.println(companyName);
System.out.println(jobTitle);
}
我不知道为什么属性[itemprop*=\"name\"]
选择器找不到跨度(进一步阅读: http://jsoup.org/cookbook/extracting-data/selector-syntax )
明白了:span[itemprop=name] 没有任何引号或转义符。其他属性或值也应该有助于获得更具体的选择。
关于java - 如何使用 JSoup (java) 正确解析数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24820430/