java - 将字符串分组到数组中

标签 java regex string split

我有这些字符串;

wordsExpanded="test |  is |  [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] |  test |  [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] |  [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]"

interpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}"

我需要的输出是这样的字符串;

finalOutput="test |  is | thirty four | test | 3 | 1 "

基本上,解释字符串包含确定已使用哪个组所需的信息。 对于第一个,我们使用了,因此正确的字符串是“(34)”而不是“( 3 4 )” 第二个是“( 3 )”,然后是“( 1 )”

这是到目前为止我的代码;

package com.test.prova;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Prova {

    public static void main(String[] args) {
        String nlInterpretation="{<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}";
        String inputText="this is 34 test 3 1";
        String grammar="test is [(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}] test [(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}] [(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}]";

        List<String> matchList = new ArrayList<String>();
        Pattern regex = Pattern.compile("[^\\s\"'\\[]+|\\[([^\\]]*)\\]|'([^']*)'");
        Matcher regexMatcher = regex.matcher(grammar);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            } else if (regexMatcher.group(2) != null) {
                matchList.add(regexMatcher.group(2));
            } else {
                matchList.add(regexMatcher.group());
            }
        } 

        String[] xx = matchList.toArray(new String[0]);
        String[] yy = inputText.split(" ");

        matchList = new ArrayList<String>();
        regex = Pattern.compile("[^<]+|<([^>]*)>");
        regexMatcher = regex.matcher(nlInterpretation);
        while (regexMatcher.find()) {
            if (regexMatcher.group(1) != null) {
                matchList.add(regexMatcher.group(1));
            }
        } 
        String[] zz = matchList.toArray(new String[0]);
        System.out.println(String.join(" | ",zz));

        for (int i=0; i<xx.length; i++) {
            if (xx[i].contains("number_type_")) {
                matchList = new ArrayList<String>();
                regex = Pattern.compile("[^\\(]+|<([^\\)]*)>.*[^<]+|<([^>]*)>");
                regexMatcher = regex.matcher(xx[i]);
                while (regexMatcher.find()) {
                    if (regexMatcher.group(1) != null) {
                        matchList.add(regexMatcher.group(1));
                    } else if (regexMatcher.group(2) != null) {
                        matchList.add(regexMatcher.group(2));
                    } else {
                        matchList.add(regexMatcher.group());
                    }
                } 
                System.out.println(String.join(" | ",matchList.toArray(new String[0])));
            }
            System.out.printf("%02d\t%s\t->%s\n", i, yy[i], xx[i]);
        }
    }
}

生成的输出如下;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is  ->is
thirty four) {<number_type_0 words>} |  3  4 ) {<number_type_0 digits>}
02  34  ->(thirty four) {<number_type_0 words>}( 3  4 ) {<number_type_0 digits>}
03  test    ->test
three) {<number_type_1 words>} |  3 ) {<number_type_1 digits>}
04  3   ->(three) {<number_type_1 words>}( 3 ) {<number_type_1 digits>}
one) {<number_type_2 words>} |  1 ) {<number_type_2 digits>}
05  1   ->(one) {<number_type_2 words>}( 1 ) {<number_type_2 digits>}

我想要的更像是这样;

number_type_2 digits | number_type_1 digits | number_type_0 words
00  this    ->test
01  is      ->is
02  34      ->thirty four
03  test    ->test
04  3       ->3
05  1       ->1

最佳答案

我正在编写一个基于以下假设的解决方案:字符串 interpretation格式保持不变,即 {<number_type_2 digits> <number_type_1 digits> <number_type_0 words>}它不会改变。

我将描述 Java 7Java 8 方法。我非常清楚地表明,我的算法在指数时间内运行,并且这是一种简单直接的方法。我无法在短时间内更快地想到任何事情。

让我们开始浏览代码:

Java-7 风格

/*
     * STEP 1: Create a method that accepts wordsExpanded and
     * interpretation Strings
     */
    public static void parseString(String wordsExpanded, String interoperation) {
        /*
         * STEP 2: Remove leading and tailing curly braces form
         * interoperation String
         */
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        /*
         * STEP 3: Split your interoperation String at '>'
         * because we need individual interoperations  like
         * "<number_type_2 words" to compare. 
         */
        String[] allInterpretations = interoperation.split(">");

        /*
         * STEP 4: Split your wordsExpanded String at '|'
         * to get each word.
         */
        String[] allWordsExpanded = wordsExpanded.split("\\|");

        /*
         * STEP 5: Create a resultant StringBuilder
         */
        StringBuilder resultBuilder = new StringBuilder();

        /*
         * STEP 6: Iterate over each words form wordsExpanded
         * after splitting.
         */
        for(String eachWordExpanded : allWordsExpanded){
            /*
             * STEP 7: Remove leading and tailing spaces
             */
            eachWordExpanded = eachWordExpanded.trim();
            /*
             * STEP 8: Remove leading and tailing curly braces
             */
            eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
            eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

            /*
             * STEP 9: Now, iterate over each interoperation.
             */
            for(String eachInteroperation : allInterpretations){
                /*
                 * STEP 10: Remove the leading and tailing spaces
                 * from each interoperations.
                 */
                eachInteroperation = eachInteroperation.trim();

                /*
                 * STEP 11: Now append '>' to end of each interoperation
                 * because we'd split each of them at '>' previously.
                 */
                eachInteroperation = eachInteroperation + ">";

                /*
                 * STEP 12: Check if each eordExpanded contains any of the
                 * interoperation. 
                 */
                if(eachWordExpanded.contains(eachInteroperation)){

                    /*
                     * STEP 13: If each interoperation contains
                     * 'word', goto STEP 14.
                     * ELSE goto STEP 18.
                     */
                    if(eachInteroperation.contains("words")){
                        /*
                         * STEP 14: Remove that interoperation from the
                         * each wordExpanded String.
                         * 
                         * Ex: if the interoperation is <number_type_2 words>
                         * and it is found in the wordExpanded, remove it.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 15: Now change the interoperation to digits.
                         * Ex: IF the interoperation is <number_type_2 words>,
                         * change that to <number_type_2 digits> and also remove them.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("words", "digits");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 16: Remove leading and tailing square braces
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");

                        /*
                         * STEP 17: Remove any numbers in the form ( 3 ),
                         * since we are dealing with words.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                    }else{
                        /*
                         * STEP 18: Remove the interoperation just like STEP 14.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");
                        /*
                         * STEP 19: Now, change interoperations to words just like STEP 15,
                         * since we are dealing with digits here and then, remove it from the
                         * each wordExpanded String.
                         */
                        eachInteroperation = eachInteroperation.replaceAll("digits", "words");
                        eachWordExpanded = eachWordExpanded.replaceAll(eachInteroperation, "");

                        /*
                         * STEP 20: Remove the leading and tailing square braces.
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                        /*
                         * STEP 21: Remove the words in the form '(thirty four)'
                         */
                        eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                        eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                    }
                }else{
                    continue;
                }
            }
            /*
             * STEP 22: Build your result object
             */
            resultBuilder.append(eachWordExpanded + "|");
        }
        /*
         * FINAL RESULT
         */
        System.out.println(resultBuilder.toString());
}

等效的Java-8样式如下:

public static void parseString(String wordsExpanded, String interoperation) {
        interoperation= interoperation.replaceAll("\\{", "");
        interoperation = interoperation.replaceAll("\\}", "");

        String[] allInterpretations = interoperation.split(">");

        StringJoiner joiner = new StringJoiner("");
        Set<String> allInterOperations = Arrays.asList(interoperation.split(">"))
            .stream()
            .map(eachInterOperation -> {
            eachInterOperation = eachInterOperation.trim();
            eachInterOperation = eachInterOperation + ">";
            return eachInterOperation;
        }).collect(Collectors.toSet());

        String result = Arrays.asList(wordsExpanded.split("\\|"))
        .stream()
        .map(eachWordExpanded -> {
        eachWordExpanded = eachWordExpanded.trim();
        eachWordExpanded = eachWordExpanded.replaceAll("\\{", "");
        eachWordExpanded = eachWordExpanded.replaceAll("\\}", "");

        for(String eachInterOperation : allInterOperations){
            if(eachWordExpanded.contains(eachInterOperation)){
                if(eachInterOperation.contains("words")){
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("words", "digits");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(0-9)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("(\\s)+", " ");
                }else{
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachInterOperation = eachInterOperation.replaceAll("digits", "words");
                    eachWordExpanded = eachWordExpanded.replaceAll(eachInterOperation, "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\[", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("[(A-Za-z)+]", "");
                    eachWordExpanded = eachWordExpanded.replaceAll("\\s", "");
                }
            }else{
                continue;
            }
        }
        return eachWordExpanded;
    }).collect(Collectors.joining("|"));

    System.out.println(result);
}

使用不同的互操作字符串对上述方法运行以下测试,例如:

{<number_type_2 words> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 words> <number_type_0 words>}
{<number_type_2 digits> <number_type_1 digits> <number_type_0 digits>}
{<number_type_2 words> <number_type_1 digits> <number_type_0 digits>}

将产生类似(Java-7结果)的结果:

test|is|thirty four |test|three |one |
test|is|thirty four |test|three |1|
test|is|34|test|3|1|
test|is|34|test|3|one |

(Java-8 结果)

test|is|thirty four|test|three|one
test|is|thirty four|test|three|1
test|is|34|test|3|1
test|is|34|test|3|one

我希望这就是您想要实现的目标。

关于java - 将字符串分组到数组中,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42196533/

相关文章:

string - Haskell 函数取出最后一次出现的输入字符

java - 字符串长度 - 结果不正确?

java - 我可以在java中连续分配对象吗?

java - Netty 4 有多稳定?

javascript - 为什么我的正则表达式中缺少一个字符?

c# - 正则表达式在包含单词的行的开头拆分

java - HashSet 如何维护桶?什么数据结构用于此?

java - 无法在 IntelliJ Idea 2017.3.4 上使用 Maven 创建 RESTful Web 服务

regex - 如何通过保持目录结构完整来同步路径中具有匹配模式的文件?

java - 使用字符串缓冲区从 ArrayList 中获取单行或双行字符串