您好,我在优化 burrows wheeler transform 时遇到一些困难.我正在尝试转换文本文件,但是转换像圣经这样的大型文本文件需要很长时间。
关于如何进行的任何想法?
public BurrowsWheelerTransformEncoder()
{
}
private String originalSuffix(int index, String string)
{
String temp = (string.substring(index,string.length()) + string.substring(0,index));
//this bit just 'compresses' each transformation of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so minimal amount of memory is used when it is stored in an array
return temp.substring(0,5)+
//the last character of the transformation is kept
temp.charAt(temp.length()-1);
}
private String compressedSuffix(String string)
{
//this method just 'compresses' original piece of text by producing
//a prefix, so 'abracadabra' just becomes 'abrac'
//this is so comprisons won't take so long
return string.substring(0,5)+string.charAt(string.length()-1);
}
public static void main(String args[]) throws Exception
{
BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));
String text = "";
//the row in the sorted array where the original text can be found
int originalRow = 0;
//system time when program began
long startTime = System.nanoTime();
//get text from file
while(input.ready())
{
text += input.readLine();
}
//create a new array to hold all transformations
String[] textArray = new String[text.length()];
int length = text.length();
//get individual transformations and put in array
for(int i = 0; i < text.length(); i++)
{
textArray[i] = encoder.originalSuffix(i,text);
//for debugging large text files, prints progress after every 10k'th
//transformation
if(i%10000==0)
System.out.println(i+"/"+length);
}
//uses java's internal methods to sort the array, presumably
//the most efficient way to do the sort (for now)
Arrays.sort(textArray);
String compressedOriginalText = encoder.compressedSuffix(text);
//print the results
for(int i = 0; i < textArray.length; i++)
{
if(textArray[i].equals(compressedOriginalText))
{
originalRow = i;
}
if(i%100==0)
{
System.out.println();
}
System.out.print(textArray[i].charAt(textArray[i].length()-1));
}
System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
}
最佳答案
对于编码情况,您不需要实际构建字符串数组 - 使用 int(或 long,具体取决于您的文件大小)数组来存储旋转字符串开始的索引。
- 创建一个初始化为 [0 1 2 3 ... n] 的数组
使用以下 compareTo 对数组进行排序(假设
compareTo()
可以访问原始字符串,original
):int compareTo(int a, int b){ int compare, len = original.length(); do{ char _a = original.charAt(a), _b = original.charAt(b); compare = _a-_b; a++; b++; if(a>=len)a-=len; if(b>=len)b-=len; }while(compare==0); return compare; }
记下数组中“0”的索引,并将其作为“起始”值添加到输出中
对于逆转,我们再次希望避免为像圣经一样大的文本构建整个表格。我们可以利用第一行和最后一行中的相同标记始终处于相同顺序这一事实来做到这一点。这是真的,因为第一行是排序的,token是循环排列的:对于最后一行连续的三个b,后面的token是排序的,所以b是排序的。所以要反转:
- 对输出标记进行排序。除了存储排序后的标记外,还存储每个标记开始的索引。因此,对于未排序的标记“nbnaaa”,您将存储 [3 4 5 2 0 1] 和“aaabnn”。 重要:您必须在此步骤中使用稳定排序。
使用前面提到的“start”值重建字符串:
string decode(string sorted, int[]index, int start){ string answer = ""+sorted.charAt(start); int next = index[start]; while(next!=start){ answer = sorted.charAt(next) + answer; next = index[next]; } return answer; }
关于java - Burrows Wheeler 变换的优化,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6000328/