xml - 使用 perl 解析大型 (100 Mb) XML 文件时出现 "Out of memory"

标签 xml perl xml-twig

我在解析大型(100 Mb)XML 文件时遇到错误“内存不足

use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code, "\n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh, ">>", "Result.csv" or die $!;
    print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
    close $fh;
}

我能做些什么来解决这个错误问题?

最佳答案

处理不适合内存的大型 XML 文件是一件很重要的事情 XML::Twig advertises :

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit (...)


问题中发布的代码根本没有利用 XML::Twig 的优势(使用 simplify 方法并没有使它变得更好比 XML::Simple )。

代码中缺少的是“twig_handlers”或“twig_roots”,它们实质上导致解析器高效地关注 XML 文档的相关部分。

不看 XML 很难说 processing the document chunk-by-chunkjust selected parts是要走的路,但任何一个都应该解决这个问题。

所以代码应该如下所示(逐 block 演示):

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => \&count_bedrooms,
                                        }
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig, $element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump \%bedrooms;

关于xml - 使用 perl 解析大型 (100 Mb) XML 文件时出现 "Out of memory",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7293687/

相关文章:

perl - 使用 XML::Twig 处理嵌套元素

xml - 使用 XSL 生成在同一脚本中执行的动态 XSL?

c# - 如何使 C# Web 服务生成 soapenv 命名空间而不是 soap?

perl - 要求使用别名的 Perl 模块

windows - 通过windows CMD执行perl脚本时出现SSHAuthenticationError

以 256 退出时的 Perl 脚本行为

xml - 在 Twig 中打印 XML 的内容

asp.net - 在 ASP.NET 中读取 xml facebook 文件

c# - 更改支持的方向