我有一个很大的JSON行文件,具有4.000.000行,我需要从每行转换几个事件。结果CSV文件包含15.000.000行。如何优化此脚本?
我正在使用Powershell core 7,大约需要50个小时才能完成转换。
我的Powershell脚本:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = @()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
这是JSON的结构。
{
"id": "111111111",
"name": {
"name": "Test Company GmbH",
"legalForm": "GmbH"
},
"address": {
"street": "Berlinstr.",
"postalCode": "11111",
"city": "Berlin"
},
"status": "liquidation",
"events": {
"items": [{
"type": "Liquidation",
"date": "2001-01-01",
"description": "Liquidation"
}, {
"type": "NewCompany",
"date": "2000-01-01",
"description": "Neueintragung"
}, {
"type": "ControlChange",
"date": "2002-01-01",
"description": "Tested Company GmbH"
}]
},
"relatedCompanies": {
"items": [{
"company": {
"id": "2222222",
"name": {
"name": "Test GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}, {
"company": {
"id": "33333",
"name": {
"name": "Test2 GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}]
}
}
最佳答案
根据评论:Try to avoid using the increase assignment operator ( +=
) to create a collection。
请改用PowerShell管道,例如:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
Get-Content $Importfile -Encoding $encoding | Foreach-Object {
$json = $_ | ConvertFrom-Json
$json | ConvertFrom-Json | Foreach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 30000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
更新2020-05-07
根据评论和问题的额外信息,我编写了一个可重复使用的小cmdlet,该cmdlet使用PowerShell管道读取
.jsonl
(Json Lines)文件。它收集每一行,直到找到一个结束的'}'字符,然后检查有效的json字符串(使用 Test-Json
,因为其中可能嵌入了对象。如果有效,则在管道中释放释放对象,然后再次开始收集行:Function ConvertFrom-JsonLines {
[CmdletBinding()][OutputType([Object[]])]Param (
[Parameter(ValueFromPipeLine = $True, Mandatory = $True)][String]$Line
)
Begin { $JsonLines = [System.Collections.Generic.List[String]]@() }
Process {
$JsonLines.Add($Line)
If ( $Line.Trim().EndsWith('}') ) {
$Json = $JsonLines -Join [Environment]::NewLine
If ( Test-Json $Json -ErrorAction SilentlyContinue ) {
$Json | ConvertFrom-Json
$JsonLines.Clear()
}
}
}
}
您可以像这样使用它:
Get-Content .\file.jsonl | ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8
关于json - 如何优化此Powershell脚本,将JSON转换为CSV?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61612772/