java - 将允许和不允许的 URL 规则附加到 java 列表中

标签 java regex robots.txt

我正在尝试使用以下代码捕获java中robots.txt文件的允许和不允许的规则:-

package robotest;
public class RoboTest {
    public static void main(String[] args) {
    String robo="user-agent:hello user-agent:ppx user-agent:bot allow:/world disallow:/ajax disallow:/posts user-agent:abc allow:/myposts/like disallow:/none user-agent:* allow:/world";
    String[] strarr=robo.split(" ");
    String[] allowed={};
    String[] disallowed={};
    boolean new_block=false;
    boolean a_or_d=false;
    for (String line: strarr){
        if(line!=""){
        if(line.contains("user-agent:pp")==false && a_or_d){
            break;
        }
        if (line.contains("user-agent:ppx")||(new_block )){
            new_block=true;
            System.out.println(line);
            if(line.contains("allow") || line.contains("disallow")){
            a_or_d=true;
            }
            if(line.contains("allow:")){
            //append to allowed
            }
            if(line.contains("disallowed")) {
            //append to disallowed
            }
        }
        }
        System.out.println(allowed);;
    }
    }    
}

代码没有像我预期的那样正常工作。 robots.txt字符串的规则是用空格分隔。我想捕获用户代理 ppx 的规则。代码应在发现 user-agent:ppx 后查找允许或禁止 block 并将其附加到列表中。但它不起作用并且也令人困惑。我对java中的正则表达式也是新手。有什么办法可以解决这个问题。

最佳答案

对代码进行一些最小修改:

String robo = "user-agent:hello user-agent:ppx user-agent:bot allow:/world disallow:/ajax disallow:/posts user-agent:abc allow:/myposts/like disallow:/none user-agent:* allow:/world";
String[] strarr = robo.split(" ");
Set<String> allowed = new HashSet<>();
Set<String> disallowed = new HashSet<>();
Pattern allowPattern = Pattern.compile("^allow:\\s*(.*)");
Pattern disallowPattern = Pattern.compile("^disallow:\\s*(.*)");
boolean isUserAgentPpx = false;
boolean a_or_d = false;
for (String line : strarr) {
  line = line.trim();

  // Skip empty lines
  if (line.isEmpty()) continue;

  if (line.startsWith("user-agent:")) {
    // If previous lines were allowed/disallowed rules, then start a new user-agent block
    if (a_or_d) {
      a_or_d = false;
      isUserAgentPpx = false;
    }
    // Skip block of user-agent if we already found 'user-agent: ppx' or 'user-agent: *'
    if (isUserAgentPpx) continue;
    if (line.matches("^user-agent:\\s*(ppx|\\*)$")) {
      isUserAgentPpx = true;
    }
    continue;
  }

  // Process block of allow/disallow
  a_or_d = true;
  if (isUserAgentPpx) {
    Matcher allowMatcher = allowPattern.matcher(line);
    if (allowMatcher.find()) {
     allowed.add(allowMatcher.group(1));
    }
    Matcher disallowMatcher = disallowPattern.matcher(line);
    if (disallowMatcher.find()) {
      disallowed.add(disallowMatcher.group(1));
    }
  }
}

System.out.println("Allowed rules for Ppx:");
for (String s : allowed) {
  System.out.println(s);
}
System.out.println("Disallowed rules for Ppx:");
for (String s : disallowed) {
  System.out.println(s);
}

我正在使用Set<String>存储规则以避免重复。

关于java - 将允许和不允许的 URL 规则附加到 java 列表中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60815556/

相关文章:

javascript - 3个大写字母的正则表达式,不多也不少

wordpress - 我无法在服务器中找到 robots.txt 文件,但它显示在 url 中

java - 注销 Web 应用程序时从 Webview 返回应用程序

java - JDA 语音 Activity 跟踪

python - 允许/在 django url

css - 样式表在呈现后被 Google 阻止

java - 生命游戏问题

java - 在框架内打印框架

sql - 正则表达式中的 Mysql 字段名称