我正在尝试比较 php 中的 2 个 csv 文件,方法是将它们导入多维数组并使用 array_diff 函数找出差异。
我使用的方法是
1) 获取预期 csv 的每条记录并转储到 arr1
2) 获取实际csv的每条记录并转储到arr2
3) 使用 array_multisort 对 array1 进行排序
4) 使用 array_multisort 对 array2 进行排序
5) 使用 array_diff 函数比较每条记录(例如 arr1[0][1] vs arr2[0][1])
我的目标是在尽可能短的时间内使用 php 脚本比较文件。我发现上述方法是最短的(最初尝试将 csv 内容转储到 MySQL 中并使用 db 查询进行比较,但由于某些未知原因,查询工作速度太慢以至于它在超时后崩溃了我的 Apache 服务器)
我的 csv 文件大小最大为 300mb,但通常是 70k 条记录,20 列,大小为 10mb
我正在粘贴我所做的代码(w.r.t 上述步骤)
$header='';
$file_handle = fopen($fileExp, "r");
$k=0;
while ($data=fgetcsv($file_handle,0,$_POST['dl1'])) {
if(count($data)==1 && $data[0]=='')
continue;
else
{
$urarr1[$k]='';
for($i=0;$i<count($data);$i++)
{
if(in_array($i,$exclude_cols,true))
$rarr1[$k][$i]='NTBT';
else
$rarr1[$k][$i]=trim($data[$i]);
}
$k++;
}
}
fclose($file_handle);
echo '<br>Exp Record count: '.count($rarr1);
$header.='<br>Exp Record count: '.count($rarr1);
$hrow=$rarr1[0]; //fetch header row and then unset it
unset($rarr1[0]);
array_multisort($rarr1); //need to sort on all 20 columns asc
$rarr1=array_values($rarr1); //re-number the array
//writing the sorted o/p to file...debugging purposes
$fp = fopen($_POST['op'].'/file1.csv', 'w');
foreach ($rarr1 as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
//Repeat for actual .csv
$file_handle = fopen($fileAct, "r");
$k=0;
while ($data=fgetcsv($file_handle,0,$_POST['dl2'])) {
if(count($data)==1 && $data[0]=='')
continue;
else
{
for($i=0;$i<count($data);$i++)
{
if(in_array($i,$exclude_cols,true))
$rarr2[$k][$i]='NTBT';
else
$rarr2[$k][$i]=trim($data[$i]);
}
$k++;
}
}
fclose($file_handle);
unset($file_handle);
echo '<br>Act Record count: '.count($rarr2);
$header.='<br>Act Record count: '.count($rarr2);
unset($rarr2[0]);
array_multisort($rarr2);
$rarr2=array_values($rarr2);
$fp = fopen($_POST['op'].'/file2.csv', 'w');
foreach ($rarr2 as $fields) {
fputcsv($fp, $fields);
}
fclose($fp);
///Comparison logic
$header.= '<br>';
$header.= '<table>';
$header.= '<th>RECORD_ID</th>';
for($i=0;$i<count($hrow);$i++)
{
$header.= '<th>'.$hrow[$i].'_EXP</th>';
$header.= '<th>'.$hrow[$i].'_ACT</th>';
}
$r=array();
for($i=0;$i<count($rarr1);$i++)
{
if(array_diff($rarr1[$i],$rarr2[$i]) || array_diff($rarr2[$i],$rarr1[$i]))
{
$r[$i]=array_unique(array_merge(array_keys(array_diff($rarr1[$i],$rarr2[$i])),array_keys(array_diff($rarr2[$i],$rarr1[$i]))));
foreach($r[$i] as $key=>$v)
{
if(in_array($v,$calc_cols))
{
if(abs($rarr1[$i][$v]-$rarr2[$i][$v])<0.2)
{
unset($r[$i][$key]);
}
}
elseif(is_numeric($rarr1[$i][$v]) && is_numeric($rarr2[$i][$v]) && !in_array($v,$calc_cols) && ($rarr1[$i][$v]-$rarr2[$i][$v])==0)
{
unset($r[$i][$key]);
}
}
if(empty($r[$i]))
unset($r[$i]);
if(isset($r[$i]))
{
$header.= '<tr>';
$header.= '<td>'.$i.'</td>';
for($j=0;$j<count($rarr1[$i]);$j++)
{
if(in_array($j,$r[$i]))
{
$header.= '<td style="color:orange">'.$rarr1[$i][$j].'</td>';
$header.= '<td style="color:orange">'.$rarr2[$i][$j].'</td>';
}
else
{
$header.= '<td >'.$rarr1[$i][$j].'</td>';
$header.= '<td >'.$rarr2[$i][$j].'</td>';
}
}
$header.= '</tr>';
}
}
}
$header.= '</table>';
//print_r($r);
echo '<br>';
// if(!isset($r))
// $r[0]=0;
echo 'Differences :'.count($r) ;
$header.= '<br>';
$header.= 'Differences :'.count($r) ;
$time_end = microtime(true);
$execution_time = ($time_end - $time_start)/60; //dividing with 60 will give the execution time in minutes other wise seconds
echo '<br><b>Total Execution Time:</b> '.$execution_time.' Mins'; //execution time of the script
虽然最初我发现这适用于大多数文件,但后来我发现对于某些文件,由于未知原因,array_multisort 正在对 arr1 和 arr2 进行不同的排序,即使内容看起来相同......我不确定这是不是由于数据类型不匹配而发生,但我也尝试了类型转换,但它仍然以不同的方式排序但相同的数组
有人可以提出上面代码中可能有什么问题吗?另外,考虑到我上面提到的需求,有没有更方便的方式通过php来实现??也许是比较 .csv 文件或其他东西的 php 插件?
编辑:根据要求提供样本数据。只是一个快照,实际会有更多的列和行。如上所述,.csv 文件大小远远超过 10mb!文件一和文件二
236|INPQR|31-AUG-12|200 |INR| 664|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
236|INPQR|31-AUG-12|200 |INR| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 664|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6652|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
225|INPZQ|31-AUG-12|200 |INR| 6652|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP
236|INPQR|31-AUG-12|200 |USD| 664|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP
236|INPQR|31-AUG-12|200 |INR| 664|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
236|INPQT|31-AUG-12|200 |INR| 6653|AAAAAA,PPPPP |0 |0 |0 |0 |0
236|INPQR|31-AUG-12|200 |USD| 6655|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |USD| 6652|AAAAAA,PPPPP |0 |38972944.8 |0 |0 |38972944.8
225|INPZQ|31-AUG-12|200 |INR| 6652|AAAAAA,PPPPP |0 |63919609.97 |0 |0 |63919609.97
225|INPZQ|31-AUG-12|200 |USD| 6654|AAAAAA,PPPPP |0 |0 |0 |0 |0
225|INPZQ|31-AUG-12|200 |INR| 6654|AAAAAA,PPPPP
更新:2 个 csv 文件可以包含不同的日期格式,并且每个文件都可以代表不同格式的数字,例如 1.csv 可以将 12-jan-2013 和 0.01 作为第一行....2 .csv 将有 01/12/2013 和 .01 因此我认为散列不起作用
最佳答案
有许多不同的方法可以比较两个 CSV 文件。我使用一种方法来检查两个文件中的不同行。我考虑到您想从行中删除某些列。
我没有使用排序,因为我检查一行是否在另一个文件中,而不是它是否在同一位置。原因很简单:如果有一行不匹配,并且排在文件的开头,那么这一行之后的所有行都会不同。
例子:
file1: file2:
1|a 1|a
2|b 2|b
3|c 3|c
4|d 4|d
5|e 1|e
After sorting
file1: file2:
1|a 1|a
2|b 1|e
3|c 2|b
4|d 3|c
5|e 4|d
Now the rows 2, 3, 4, and 5 are all marked as different, because they do not match if you check per line. But in fact only 1 row is different.
在下面的代码中,您将看到关于我为什么做某事的评论。我还在几个大型 CSV 文件(~45mb 和 100.000 行)上测试了代码,并在每次检查不到 10 秒的时间内获得了不同行的数量。
<?php
set_time_limit(0);
//create a function to create the CSV arrays.
//If you create the code twice like you did, you are bound to make a mistake or change something in one place and not the other.
//Obviously that could lead to sorting two equal files differently.
function CsvToArray($file)
{
$exclude_cols = array(2); //you didnt provide it, so for testig i remove the date col because its always the same
//load file contents into variable and trim it
$data = trim(implode('', file($file)));
//strip \r new line to make sure only \n is used
$data = str_replace("\r", "", $data);
//strip all spaces from |
$data = preg_replace('/\s\s+\|/', '|', $data);
$data = preg_replace('/\|\s\s+/', '|', $data);
//strip all spaces from each line
$data = preg_replace('/\s\s+\n/', "\n", $data);
$data = preg_replace('/\n\s\s+/', "\n", $data);
//each line to seperate row
$data = explode("\n", $data);
//each col to seperate record
//This is only needed for comparisment if you want to remove certain cols
//if thats not needed, you can skip this part
foreach($data as $k=>$v)
$data[$k] = explode('|', $v);
//get the header. Its always the first row
//array_shift will return the first element and remove it from the dataset
$header = array_shift($data);
//turn the array around, by making the row the key and count howmany times it shows
$ar = array();
foreach ($data as $row) {
//remove unwanted cols
//if you dont want to remove certain cols, skip this and the implode part and use $ar[$row]++
foreach($exclude_cols as $c)
$row[$c] = '';
//implode the remaining
$key = implode('', $row);
//you can use str_to_lower($key) for case insensive matching
$ar[$key]++;
}
return $ar;
}
function CompareTwoCsv($file1, $file2)
{
$start = microtime(true);
$ar1 = CsvToArray($file1);
$ar2 = CsvToArray($file2);
//check for differences.
$diff = 0;
foreach($ar1 as $k=>$v) {
//the second array doesnt contain the key (is row) so there is a difference
if (!array_key_exists($k, $ar2)) {
$diff+=$v; //all rows that are in the first array are different
continue;
}
$c2 = $ar2[$k];
if ($v == $c2) //row is in both file an equal number of times
continue;
$diff += max($v, $c2) - min($v, $c2); //add the number of different rows
}
$ar1_count = count($ar1);
$ar2_count = count($ar2);
//if ar2 has more records. Every row that is more, is different.
if ($ar2_count>$ar1_count)
$diff += $ar2_count - $ar1_count;
$end = microtime(true);
$difftime = $end - $start;
//debug output
echo "We found ".$diff." differences in the files. it took ".$difftime." seconds<hr>";
}
//test and test2 are two files with ~100.000 rows based on the data you supplied.
//They have many equal rows in the files, so the array returned from CsvToArray is small
CompareTwoCsv("test.txt", "test.txt");
//We found 0 differences in the files. it took 5.6848769187927 seconds
CompareTwoCsv("test.txt", "test2.txt");
//We found 17855 differences in the files. it took 6.6002569198608 seconds
CompareTwoCsv("test2.txt", "test.txt");
//We found 17855 differences in the files. it took 7.5223989486694 seconds
//randomly generated files with 100.000 rows. Very little duplicate data;
CompareTwoCsv("largescv1.txt", "largescv2.txt");
//We found 98250 differences in the files. it took 5.4302139282227 seconds
?>
结果:
关于php - array_multisort 对 2 个相同的多维数组进行不同的排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14253568/