我正在尝试使用 JSOUP 解析包含表内字段的表单,以获取字段及其标签。我遇到的问题是我找不到标签的任何模式或共同属性。下面是 HTML 页面的示例,其中标签标记为 Label 1、Label 2 等,字段标记为 field_1、field_2 等。
<form id="some_form" method="post" action="some_page.do">
<div class="main_div">
<table id="main_table" class="table_class">
<tr>
<td colspan="10" align="center" class="pad_bottom pad_top">
Label 1:
<input type="text" name="field_1" value="Field 1 value" id="field_1"/>
</td>
</tr>
<tr>
<td colspan="10" align="center">
Label 2:
<span class="radio_class"><input type="radio" name="field_2" value="No" checked="checked" class="radio_field" id="field_2"/> No</span>
<span class="radio_class"><input type="radio" name="field_2" value="Yes" class="radio_field" id="field_2"/> Yes</span><br/>
<span class="extra">Some text to ignore</span>
More text to ignore
</td>
</tr>
<tr>
<td colspan="10" align="center">
<table width="90%">
<tr>
<td class="td_class">
Some text to ignore
</td>
<td class="td_class">
Some text to ignore
</td>
</tr>
<tr>
<td align=3"left" class="another_td_class">
Label 3<br/>
More text for label 3
</td>
<td align="left" class="another_td_class">
<input type="hidden" name="field_3_hidden" value="1" id="field_3"/>
<span class="radio_class"><input type="radio" name="field_3" value="1" id="field_3"/>1</span> 1<br/>
<span class="radio_class"><input type="radio" name="field_3" value="2" checked="checked" onfocus="" id="field_3"/> 2</span> 2<br/>
<br/>
</td>
<tr>
<tr>
</table>
</td>
</tr>
<tr>
<td class="heading" colspan="2" width="50%">Label 4</td>
<td class="heading" width="50%">Label 5</td>
</tr>
<tr>
<td align="center" class="td_class nowrap">
<input type="integer" name="field_4a" maxlength="2" size="2" value="42" class="integer_class" id="field_4"/>
Additional text for label 4
<br/>
<span class="span_class">Text to ignore</span>
</td>
<td class="td_class nowrap">
<input type="radio" name="field_4b" value="A" class="radio_class" id="field_4b"/>A<br/>
<input type="radio" name="field_4b" value="B" checked="checked" class="radio_class" id="field_4b"/>B
<br/>
</td>
<td align="center" class="td_class nowrap">
<input type="radio" name="field_5" value="C" checked="checked" class="radio_class" id="field_5"/>C
<input type="radio" name="field_5" value="D" class="radio_class" id="field_5"/>D
<br/>
</td>
</tr>
</table>
</div>
</form>
我最接近的是下面的代码,但是标签通常位于不同的位置,有时还有额外的文本,我仍然遇到问题。
Set<MyElement> myElements = new HashSet<MyElement>();
Element mainDiv = page.select("div.main_div").first();
if (mainDiv != null) {
Elements children = mainDiv.children();
Elements tds = children.select("td");
for (Element td : tds) {
Elements inputs = td.select("input");
for (Element input : inputs) {
String field = input.id();
if (field != null && !field.isEmpty()) {
String label = td.text();
MyElement myElement = new MyElement(field, label);
myElements.add(myElement);
}
}
}
}
如果标签没有任何模式或通用属性,我想我想做的事情是不可能的,但这是我第一次使用 JSOUP,所以我希望有一些我不知道的东西这会让我做到这一点。
最佳答案
可能有一种方法......这看起来是理解标签的良好开端。
到目前为止匹配字段标签的粗略系统是:
- 本文前面的文字
<td>
细胞 - 前面的文字
<td>
该行的单元格 <td>
中的文本上列中的单元格
要做到这一点,需要绕着 <table>
走动。找出每个单元格中的文本并按(行,单元格)存储这些文本,以便从字段中返回。哪里有colspan
值,显然算作多个单元格。
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TableParser {
private Map<Integer, Map<Integer, String>> cells = new LinkedHashMap<>();
private void parseTable(Element table) {
int rowNum = 0;
for (Element row : table.select("> tbody > tr")) {
parseRow(rowNum++, row);
}
}
private void parseRow(int rowNum, Element row) {
int columnNum = 0;
for (Element cell : row.select("> td")) {
String colspanText = cell.attr("colspan");
int colspan = 1;
if (StringUtils.isNotBlank(colspanText))
colspan = Integer.parseInt(colspanText);
addCell(rowNum, columnNum, colspan, cell);
parseCell(rowNum, columnNum, cell);
columnNum += colspan;
}
}
private void addCell(int rowNum, int columnNum, int colspan, Element cell) {
Map<Integer, String> rowCells = cells.computeIfAbsent(rowNum,
r -> new LinkedHashMap<>());
for (int i = 0; i < colspan; i++)
rowCells.put(columnNum + i, labelIn(cell));
}
private String labelIn(Element cell) {
return cell.textNodes().get(0).text().trim();
}
private String cellAt(int rowNum, int columnNum) {
Map<Integer, String> rowCells = cells.get(rowNum);
if (rowCells == null)
return null;
return rowCells.get(columnNum);
}
private void parseCell(int rowNum, int columnNum, Element cell) {
// Don't drill down into the nested table yet
if (!cell.select("table").isEmpty())
return;
for (Element input : cell.select("input")) {
String label = labelIn(cell);
if (StringUtils.isBlank(label))
label = cellAt(rowNum, columnNum - 1);
if (StringUtils.isBlank(label))
label = cellAt(rowNum - 1, columnNum);
System.out.println(String.format("%s->%s at (%d,%d)", label,
input.attr("name"), rowNum, columnNum));
}
}
public static void main(String[] args) throws IOException {
Document doc = Jsoup.parse(new java.io.File("/temp/labels.html"),
java.nio.charset.StandardCharsets.UTF_8.name());
for (Element table : doc.select("table")) {
new TableParser().parseTable(table);
}
}
}
需要做更多的事情来组合标签等,并且可能在每个单元格中读取更多的文本,但这是迄今为止的结果:
Label 1:->field_1 at (0,0)
Label 2:->field_2 at (1,0)
Label 2:->field_2 at (1,0)
Label 4->field_4a at (4,0)
Label 4->field_4b at (4,1)
Label 4->field_4b at (4,1)
Label 5->field_5 at (4,2)
Label 5->field_5 at (4,2)
Label 3->field_3_hidden at (1,1)
Label 3->field_3 at (1,1)
Label 3->field_3 at (1,1)
关于java - JSOUP 解析表格中的表单字段和标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49096409/