我想使用 Java 8 流来获取字符串流(例如从纯文本文件中读取)并生成句子流。我假设句子可以跨越界限。
例如,我想从:
"This is the", "first sentence. This is the", "second sentence."
至:
"This is the first sentence.", "This is the second sentence."
我可以看到可以得到句子部分的流,如下所示:
Pattern p = Pattern.compile("\\.");
Stream<String> lines
= Stream.of("This is the", "first sentence. This is the", "second sentence.");
Stream<String> result = lines.flatMap(s -> p.splitAsStream(s));
但是我不确定如何生成一个流来将片段连接成句子。我想以一种惰性的方式执行此操作,以便仅读取原始流中需要的内容。有什么想法吗?
最佳答案
将文本分解成句子并不像仅仅寻找点那么容易。例如,您不想在“史密斯先生”之间分开......
幸运的是,已经有一个 JRE 类来处理这个问题,即 BreakIterator
。它没有 Stream
支持,因此为了将它与流一起使用,需要一些支持代码:
public class SentenceStream extends Spliterators.AbstractSpliterator<String>
implements Consumer<CharSequence> {
public static Stream<String> sentences(Stream<? extends CharSequence> s) {
return StreamSupport.stream(new SentenceStream(s.spliterator()), false);
}
Spliterator<? extends CharSequence> source;
CharBuffer buffer;
BreakIterator iterator;
public SentenceStream(Spliterator<? extends CharSequence> source) {
super(Long.MAX_VALUE, ORDERED|NONNULL);
this.source = source;
iterator=BreakIterator.getSentenceInstance(Locale.ENGLISH);
buffer=CharBuffer.allocate(100);
buffer.flip();
}
@Override
public boolean tryAdvance(Consumer<? super String> action) {
for(;;) {
int next=iterator.next();
if(next!=BreakIterator.DONE && next!=buffer.limit()) {
action.accept(buffer.subSequence(0, next-buffer.position()).toString());
buffer.position(next);
return true;
}
if(!source.tryAdvance(this)) {
if(buffer.hasRemaining()) {
action.accept(buffer.toString());
buffer.position(0).limit(0);
return true;
}
return false;
}
iterator.setText(buffer.toString());
}
}
@Override
public void accept(CharSequence t) {
buffer.compact();
if(buffer.remaining()<t.length()) {
CharBuffer bigger=CharBuffer.allocate(
Math.max(buffer.capacity()*2, buffer.position()+t.length()));
buffer.flip();
bigger.put(buffer);
buffer=bigger;
}
buffer.append(t).flip();
}
}
有了这个支持类,您可以简单地说,例如:
Stream<String> lines = Stream.of(
"This is the ", "first sentence. This is the ", "second sentence.");
sentences(lines).forEachOrdered(System.out::println);
关于Java 8 句子流,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31148693/