java - 删除所有 html 标记

我有一个包含完整 XML get 请求的字符串。

在请求中，有很多 HTML 和一些我想删除的自定义命令。

我知道这样做的唯一方法是使用 jSoup .

现在，由于请求来源的网站还具有自定义命令，因此我无法完全删除所有代码。

例如，这是我想要“清理”的字符串:

\u0027s normal text here\u003c/b\u003e http://a_random_link_here.com\r\n\r\nSome more text here

如您所见，自定义命令前面都有反斜杠。

我该如何使用 Java 删除这些命令？

如果我使用正则表达式，如何对其进行编程，使其仅删除命令，而不删除命令后面的任何内容？ (因为如果我进行软编码:我事先不知道命令的大小，并且我不想对所有命令进行硬编码)。

最佳答案

参见http://regex101.com/r/gJ2yN2

正则表达式 (\\.\d{3,}.*?\s|(\\r|\\n)+) 可删除您指出的内容。

结果(用单个空格替换匹配项):

normal text here http://a_random_link_here.com Some more text here

如果这不是您想要的结果，请使用预期结果编辑您的问题。

编辑正则表达式解释:

()  - match everything inside the parentheses (later, the "match" gets replaced with "space")
\\  - an 'escaped' backslash (i.e. an actual backslash; the first one "protects" the second
      so it is not interpreted as a special character
.   - any character (I saw 'u', but there might be others
\d  - a digit
{3,} - "at least three"
.*? - any characters, "lazy" (stop as soon as possible)
\s  - until you hit a white space
|   - or
()  - one of these things
\\r - backslash - r (again, with escaped '\')
\\n - backslash - n

关于java - 删除所有 html 标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20819988/

java - 删除所有 html 标记

上一篇：java - JUnit 测试 Java

下一篇：java - 尝试使用 GSON 反序列化 JSON 字符串时出现错误