我目前正在尝试将 20GB 的 CSV 文件(大约 6400 万行,58 列)导入到 mssql 数据库中。
首先,我尝试使用 SSIS 执行此操作,但速度太慢,我决定尝试使用 Powershell,并在这里找到了一个不错的查询:
High performance import of csv
查询速度非常快,我每分钟插入大约 100 万行。 但是,我需要能够处理嵌入在引号中的分隔符,如下所示:Column1,"Car,plane,boat",Column3
我按照作者的建议使用正则表达式进行了切换:
$null = $datatable.Rows.Add($line.Split($csvdelimiter))
至:
$null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
完整查询:
# Database variables
$sqlserver = "server"
$database = "database"
$table = "tablename"
# CSV variables
$csvfile = "filepath"
$csvdelimiter = ","
$firstRowColumnNames = $true
$fieldsEnclosedInQuotes = $true
# Handling of regex for comma problem
if ($fieldsEnclosedInQuotes) {
$csvSplit = "($csvdelimiter)"
$csvsplit += '(?=(?:[^"]|"[^"]*")*$)'
} else { $csvsplit = $csvdelimiter }
$regexOptions = [System.Text.RegularExpressions.RegexOptions]::ExplicitCapture
################### No need to modify anything below ###################
Write-Host "Script started..."
$elapsed = [System.Diagnostics.Stopwatch]::StartNew()
[void][Reflection.Assembly]::LoadWithPartialName("System.Data")
[void][Reflection.Assembly]::LoadWithPartialName("System.Data.SqlClient")
# 50k worked fastest and kept memory usage to a minimum
$batchsize = 50000
# Build the sqlbulkcopy connection, and set the timeout to infinite
$connectionstring = "Data Source=$sqlserver;Integrated Security=true;Initial Catalog=$database;"
$bulkcopy = New-Object Data.SqlClient.SqlBulkCopy($connectionstring, [System.Data.SqlClient.SqlBulkCopyOptions]::TableLock)
$bulkcopy.DestinationTableName = $table
$bulkcopy.bulkcopyTimeout = 0
$bulkcopy.batchsize = $batchsize
# Create the datatable, and autogenerate the columns.
$datatable = New-Object System.Data.DataTable
# Open the text file from disk
$reader = New-Object System.IO.StreamReader($csvfile)
$firstline = (Get-Content $csvfile -First 1)
$columns = [regex]::Split($firstline, $csvSplit, $regexOptions)
if ($firstRowColumnNames -eq $true) { $null = $reader.readLine() }
foreach ($column in $columns) {
$null = $datatable.Columns.Add()
}
# Read in the data, line by line
while (($line = $reader.ReadLine()) -ne $null) {
$null = $datatable.Rows.Add($([regex]::Split($line, $csvSplit, $regexOptions)))
$i++; if (($i % $batchsize) -eq 0) {
$bulkcopy.WriteToServer($datatable)
Write-Host "$i rows have been inserted in $($elapsed.Elapsed.ToString())."
$datatable.Clear()
}
}
# add in all the remaining rows since the last clear
if($datatable.rows.count -gt 0) {
$bulkcopy.writetoserver($datatable)
$datatable.clear()
}
# Clean Up
$reader.Close(); $reader.Dispose()
$bulkcopy.Close(); $bulkcopy.Dispose()
$datatable.Dispose()
Write-Host "Script complete. $i rows have been inserted into the database."
Write-Host "Total Elapsed Time: $($elapsed.Elapsed.ToString())"
# Sometimes the Garbage Collector takes too long to clear the huge datatable.
[System.GC]::Collect()
pause
使用正则表达式需要更长的时间:
每 50 000k 行 24 秒(处理 qoutes 中嵌入的分隔符)
每 50 000k 行 2 秒(不进行处理)
我做错了什么吗? 正则表达式是正确的方法吗? 我可以以任何方式提高查询性能吗?或者这种性能损失是我必须接受的吗?
更新:添加完整查询
最佳答案
对于大型 CSV,我会使用 Microsoft.VisualBasic.FileIO.TextFieldParser 。 所有解析(高级,请参阅示例)都在那里有效完成。
不必担心“VisualBasic”,它是 .NET 的一部分。应显式添加程序集,仅此而已。
这是带有一些评论的操作示例
# temp data
Set-Content z.csv @'
column1,column2,column3
"data,
""1a""",data2a,data3a
data1b, data2b ,data3b
'@
Add-Type -AssemblyName Microsoft.VisualBasic
$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $PSScriptRoot\z.csv #! full path
$reader.SetDelimiters(',') # default is none
$reader.TrimWhiteSpace = $false # default is true
while(!$reader.EndOfData) {
$reader.LineNumber #! it counts not empty lines
$reader.ReadFields() | %{ "data: '$_'" }
}
$reader.Close()
Remove-Item z.csv
关于.net - 我可以提高此 CSV 导入查询性能吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60946853/