mysql - 解析 PubMed XML 以提交到 mySQL 数据库 (XML::Twig)

标签 mysql perl xml-parsing xml-twig pubmed

我是 XML::Twig 的新手,我正在尝试解析 PubMed XML 2.0 最终摘要以放入 mySQL 数据库中。我已经走到这一步了:

#!/bin/perl -w
use strict;
use DBI;
use XML::Twig;

my $uid = "";
my $title = "";
my $sortpubdate = "";
my $sortfirstauthor = "";
my $dbh = DBI->connect ("DBI:mysql:medline:localhost:80",
                           "root", "mysql");
my $t= new XML::Twig(   twig_roots => { 'DocumentSummary' => $uid => \&submit },
                        twig_handlers => { 'DocumentSummary/Title' => $title, 'DocumentSummary/SortPubDate' => $sortpubdate, 'DocumentSummary/SortFirstAuthor' => $sortfirstauthor});
$t->parsefile('20112.xml'); 
$dbh->disconnect();
exit;

sub submit
    {   my $insert= $dbh->prepare( "INSERT INTO medline_citation (uid, title, sortpubdate, sortfirstauthor) VALUES (?, ?, ?, ?);");
        $insert->bind_param( 1, $uid);
        $insert->bind_param( 2, $title);
        $insert->bind_param( 3, $sortpubdate);
        $insert->bind_param( 4, $sortfirstauthor);
        $insert->execute();
        $t->purge;
    }

但是 Perl 似乎由于某种原因停滞了。我这样做对吗?我正在尝试使用 twig_roots 来减少解析量,因为我只对几个字段感兴趣(这些是大文件)。

这是一个 XML 示例:

<DocumentSummary uid="22641317">
    <PubDate>2012 Jun 1</PubDate>
    <EPubDate></EPubDate>
    <Source>Clin J Oncol Nurs</Source>
    <Authors>
        <Author>
            <Name>Park SH</Name>
            <AuthType>
                Author
            </AuthType>
            <ClusterID>0</ClusterID>
        </Author>
        <Author>
            <Name>Knobf MT</Name>
            <AuthType>
                Author
            </AuthType>
            <ClusterID>0</ClusterID>
        </Author>
        <Author>
            <Name>Sutton KM</Name>
            <AuthType>
                Author
            </AuthType>
            <ClusterID>0</ClusterID>
        </Author>
    </Authors>
    <LastAuthor>Sutton KM</LastAuthor>
    <Title>Etiology, assessment, and management of aromatase inhibitor-related musculoskeletal symptoms.</Title>
    <SortTitle>etiology assessment and management of aromatase inhibitor related musculoskeletal symptoms </SortTitle>
    <Volume>16</Volume>
    <Issue>3</Issue>
    <Pages>260-6</Pages>
    <Lang>
        <string>eng</string>
    </Lang>
    <NlmUniqueID>9705336</NlmUniqueID>
    <ISSN>1092-1095</ISSN>
    <ESSN>1538-067X</ESSN>
    <PubType>
        <flag>Journal Article</flag>
    </PubType>
    <RecordStatus>
        PubMed - in process
    </RecordStatus>
    <PubStatus>4</PubStatus>
    <ArticleIds>
        <ArticleId>
            <IdType>pii</IdType>
            <IdTypeN>4</IdTypeN>
            <Value>N1750TW804546361</Value>
        </ArticleId>
        <ArticleId>
            <IdType>doi</IdType>
            <IdTypeN>3</IdTypeN>
            <Value>10.1188/12.CJON.260-266</Value>
        </ArticleId>
        <ArticleId>
            <IdType>pubmed</IdType>
            <IdTypeN>1</IdTypeN>
            <Value>22641317</Value>
        </ArticleId>
        <ArticleId>
            <IdType>rid</IdType>
            <IdTypeN>8</IdTypeN>
            <Value>22641317</Value>
        </ArticleId>
        <ArticleId>
            <IdType>eid</IdType>
            <IdTypeN>8</IdTypeN>
            <Value>22641317</Value>
        </ArticleId>
    </ArticleIds>
    <History>
        <PubMedPubDate>
            <PubStatus>entrez</PubStatus>
            <Date>2012/05/30 06:00</Date>
        </PubMedPubDate>
        <PubMedPubDate>
            <PubStatus>pubmed</PubStatus>
            <Date>2012/05/30 06:00</Date>
        </PubMedPubDate>
        <PubMedPubDate>
            <PubStatus>medline</PubStatus>
            <Date>2012/05/30 06:00</Date>
        </PubMedPubDate>
    </History>
    <References>
    </References>
    <Attributes>
        <flag>Has Abstract</flag>
    </Attributes>
    <PmcRefCount>0</PmcRefCount>
    <FullJournalName>Clinical journal of oncology nursing</FullJournalName>
    <ELocationID></ELocationID>
    <ViewCount>0</ViewCount>
    <DocType>citation</DocType>
    <SrcContribList>
    </SrcContribList>
    <BookTitle></BookTitle>
    <Medium></Medium>
    <Edition></Edition>
    <PublisherLocation></PublisherLocation>
    <PublisherName></PublisherName>
    <SrcDate></SrcDate>
    <ReportNumber></ReportNumber>
    <AvailableFromURL></AvailableFromURL>
    <LocationLabel></LocationLabel>
    <DocContribList>
    </DocContribList>
    <DocDate></DocDate>
    <BookName></BookName>
    <Chapter></Chapter>
    <SortPubDate>2012/06/01 00:00</SortPubDate>
    <SortFirstAuthor>Park SH</SortFirstAuthor>
</DocumentSummary>

谢谢!

最佳答案

我这样做的方法是为 DocumentSummary 使用一个处理程序,它提供数据库,然后清除记录。没有比这更花哨的了。

此外,我发现 DBIx::Simple,嗯,比原始 DBI 更易于使用,它负责为我准备和缓存语句:

#!/bin/perl

use strict;
use warnings;

use DBIx::Simple;
use XML::Twig;

my $db = DBIx::Simple->connect ("dbi:SQLite:dbname=t.db"); # replace by your DSN

my $t= XML::Twig->new(   twig_roots => { DocumentSummary => \&submit },)
                ->parsefile('20112.xml'); 

$db->disconnect();
exit;

sub submit
    { my( $t, $summary)= @_;
      my $insert= $db->query( "INSERT INTO medline_citation (uid, title, sortpubdate, sortfirstauthor) VALUES (?, ?, ?, ?);",
                              $summary->att( 'uid'),
                              map { $summary->field( $_) } ( qw( Title SortPubDate SortFirstAuthor))
                            );
      $t->purge;
    }

如果您想了解 map { $summary->field( $_) } ( qw( Title SortPubDate SortFirstAuthor)),它只是一种更高级(恕我直言,更易于维护)的编写方式$summary->field('Title'),$summary->field('SortPubDate'),$summary->field('SortFirstAuthor')

关于mysql - 解析 PubMed XML 以提交到 mySQL 数据库 (XML::Twig),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10837564/

相关文章:

MySQL:一个巨大的表。无法查询,哪怕是简单的select!

php - 分组依据计数不起作用 Mysql Php

mysql - DBD::MySQL 在 OSX v10.6.6 上的安装错误

linux - 比较两个 excel 表并打印差异?

java - 读取文件并根据多个条件剪切每一行

iphone - 下载项目时更新 TableView

php - 基本的PHP sql注入(inject)问题

mysql - SQL查询以获取表中的最大值

Android Annotations + RestTemplate - 获取响应 xml 作为字符串

Python:xPath 在 ElementTree 中不可用