Java : linear algorithm but non-linear performance drop, 从何而来？

我目前在自然语言处理方面开发的应用程序存在严重的性能问题。基本上，对于给定的文本，它会收集各种数据并进行一些数字运算。

对于每一个句子，它的作用完全相同。用于收集统计数据的算法不会随着先前读取的数据而变化，因此保持不变。

问题是处理时间根本不是线性变化的:10k 句子 1 分钟，100k 1 小时，1M 天...

我尽我所能，从重新实现基本数据结构到对象池再到回收实例。行为不会改变。我得到了时间的非线性增加，这似乎无法通过更多的 HashMap 冲突、IO 等待或任何东西来证明是合理的!数据一增加，Java就开始卡顿了，感觉很无奈。

如果您想要一个示例，只需尝试以下操作:计算大文件中每个单词的出现次数。部分代码如下所示。通过这样做，我在 10 万个句子上花费了 3 秒，在 160 万个句子上花费了 326 秒……所以乘数是 110 倍而不是 16 倍。随着数据的增长，情况只会变得更糟......

这是一个代码示例: 请注意，我通过引用比较字符串(出于效率原因)，这要归功于“String.intern()”方法，该方法为每个字符串返回一个唯一的引用。在上面给出的数字的整个过程中， map 永远不会重新散列。

public class DataGathering
{
 SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);

 private void makeCounts(String path) throws IOException
 {

  BufferedReader file_src = new BufferedReader(new FileReader(path));

  String line_src;

  int n = 0;
  while (file_src.ready())
  {
   n++;

   if (n % 10000 == 0)
    System.out.print(".");

   if (n % 100000 == 0)
    System.out.println("");

   line_src = file_src.readLine();

   String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

   for (int i = 0; i < src_tokens.length; i++)
   {
    String src = src_tokens[i].intern();
    counts.bump(src);
   }

  }
  file_src.close();
 }

 public static void main(String[] args) throws IOException
 {
  String path = "some_big_file.txt";

  long timestamp = System.currentTimeMillis();

  DataGathering dg = new DataGathering();
  dg.makeCounts(path);


  long time = (System.currentTimeMillis() - timestamp) / 1000;
  System.out.println("\nElapsed time: " + time + "s.");
 }
}

public class SimpleRefCounter<K>
{
 static final double GROW_FACTOR = 2;
 static final double LOAD_FACTOR = 0.5;

 private int capacity;

 private Object[] keys;
 private int[] counts;


 public SimpleRefCounter()
 {
  this(1000);
 }

 public SimpleRefCounter(int capacity)
 { 
  this.capacity = capacity;
  keys = new Object[capacity];
  counts = new int[capacity];
 }



 public synchronized int increase(K key, int n)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
  {
   key_count++;
   keys[id] = key;

   if (key_count > LOAD_FACTOR * capacity)
   {
    resize((int) (GROW_FACTOR * capacity));
   }
  }


  counts[id] += n;

  total += n;

  return counts[id];
 }



 public synchronized void resize(int capacity)
 {
  System.out.println("Resizing counters: " + this);

  this.capacity = capacity;

  Object[] new_keys = new Object[capacity];
  int[] new_counts = new int[capacity];

  for (int i = 0; i < keys.length; i++)
  {
   Object key = keys[i];
   int count = counts[i];

   int id = System.identityHashCode(key) % capacity;

   while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
    id = (id + 1) % capacity;

   new_keys[id] = key;
   new_counts[id] = count;
  }

  this.keys = new_keys;
  this.counts = new_counts;
 }


 public int bump(K key)
 {
  return increase(key, 1);
 }

 public int get(K key)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
   return 0;
  else
   return counts[id];
 }
    }

有什么解释吗？想法？有什么建议吗？

...而且，正如开头所说，它不是针对这个玩具示例，而是针对更一般的情况。在更复杂和更大的程序中，同样的爆炸行为无缘无故地发生。

最佳答案

与其感到无助，不如使用分析器!这会告诉您所有这些时间都花在了您的代码中的确切位置。

关于Java : linear algorithm but non-linear performance drop, 从何而来？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2275646/

Java : linear algorithm but non-linear performance drop, 从何而来？

上一篇：java - statement.execute() 在 PL/SQL 末尾返回带有斜杠的错误

下一篇：java: race conditions - 有没有办法确保几行代码一起执行？