perl - 将 .sgm 转换成 .txt

标签 perl text speech sgml

我有一些 .sgm 格式的文件,我必须对它们进行评估(应用语言模型并获得文本的复杂性)。

主要问题是我需要这些纯格式的文件,即 txt 格式。但是,我一直在互联网上搜索在线转换或某种执行此操作的脚本,但找不到。

除此之外,我的一位老师用 perl 发给我这个命令:

perl -n 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file

我从未使用过 perl,老实说,对此一无所知。我想我已经安装了 perl:

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl".  If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.

顺便说一下,我使用的是 Mac OS X。

示例 .sgm 文件:

<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>

输出 .txt 文件:

This is perfectly illustrated by the UKIP numbties banning people with HIV. You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome. You raise a straw man and then knock it down with thinly veiled homophobia.

最佳答案

您可以尝试使用此脚本从文件中去除 SGML 标签:

#!/usr/bin/env perl
use strict;
use warnings;

use HTML::Parser;

my $file = $ARGV[0];

HTML::Parser->new(default_h => [""],
    text_h => [ sub { print shift }, 'text' ]
  )->parse_file($file) or die "Failed to parse $file: $!";

按如下方式使用:

./strip_sgml.pl file.sgm > file.txt

关于perl - 将 .sgm 转换成 .txt,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36827515/

相关文章:

regex - 如何使用 Perl 正则表达式删除所有连字符?

perl - 为什么我的文件内容/用户输入不匹配? (缺少 chomp 规范)

perl - 一次处理一封邮件

audio - 如何从音频文件中分离男声和女声(在 C++ 或 Java 中)

java - 如何在 java sphinx4 项目中导入和使用经过训练的声学模型

perl - 我应该如何序列化一个 Moose 对象数组?

css - 如何在响应式 2 列布局上调整文本跟随图像大小?

android - 在纯原生 Android 应用程序中呈现文本

Swift 标签不适合文本

javascript - 如何在javascript中捕获音频?