r - 如何使用 R 将 rtf 字符串转换为纯文本?

标签 r string rtf plaintext

我有很多 rtf字符串 ( Base64 ),我想使用 R 获取纯文本。可能吗?下面有一个例子。

有很多使用其他语言的方法,但如果我找到一种“R 方法”来完成这项工作,它将非常有用。

rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="

plainText <- function(rtfString)

# The result will be something similar to this:

plainText

[1] "Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um show de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo.\nEstão de parabéns o Paysandu, o Governo do Estado, que deu um show de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor.\nPaysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará.\nMuito obrigado, Sr. Presidente."

最佳答案

一些包和一些正则表达式的组合可以完成这个:

library(RCurl)
library(stringr)
library(magrittr)

decode_rtf <- function(txt) {

  txt %>%
    base64Decode %>%
    str_replace_all("\\\\'e3", "ã") %>%
    str_replace_all("\\\\'e1", "á") %>%
    str_replace_all("\\\\'e9", "é") %>%
    str_replace_all("\\\\'e7", "ç") %>%
    str_replace_all("\\\\'ed", "í") %>%
    str_replace_all("\\\\'f3", "ó") %>%
    str_replace_all("\\\\'ea", "ê") %>%
    str_replace_all("\\\\'e0", "à") %>%
    str_replace_all("(\\\\[[:alnum:]']+|[\\r\\n]|^\\{|\\}$)", "") %>%
    str_replace_all("\\{\\{[[:alnum:]; ]+\\}\\}", "") %>%
    str_trim

}

rtfString <- "e1xydGYxXGFuc2lcYW5zaWNwZzEyNTJcZGVmZjBcZGVmbGFuZzEwNDZcZGVmbGFuZ2ZlMTA0NlxkZWZ0YWI3MDl7XGZvbnR0Ymx7XGYwXGZzd2lzc1xmcHJxMlxmY2hhcnNldDAgQXJpYWw7fX0NClx2aWV3a2luZDRcdWMxXHBhcmRcc2w0ODBcc2xtdWx0MVxxalxmMFxmczI0XHRhYlxiIE8gU1IuIERVRElNQVIgUEFYSVVCQSBcYjAgKFBTREItUEEuIFNlbSByZXZpc1wnZTNvIGRvIG9yYWRvci4pIC0gU3IuIFByZXNpZGVudGUsIFNyYXMuIGUgU3JzLiBQYXJsYW1lbnRhcmVzLCBvY3VwbyBlc3RhIHRyaWJ1bmEgcGFyYSBwYXJhYmVuaXphciBhIHRvcmNpZGEgcGFyYWVuc2UuIE8gZnV0ZWJvbCBwYXJhZW5zZSBkZXUgdW0gXGkgc2hvd1xpMCAgZGUgY2l2aWxpZGFkZSBuZXN0ZSBmaW5hbCBkZSBzZW1hbmEsIGUgYSB0b3JjaWRhIGJpY29sb3IgZG8gUGFwXCdlM28gZGEgQ3VydXp1LCBvIFBheXNhbmR1LCBlc3RcJ2UxIGRlIHBhcmFiXCdlOW5zLCBwb2lzIHNhZ3JvdS1zZSBjYW1wZVwnZTNvIGRvIHByaW1laXJvIHR1cm5vIGVtIGNpbWEgZG8gc2V1IG1haW9yIHJpdmFsLCB2ZW5jZW5kbyBvIHZhbG9yb3NvIENsdWJlIGRvIFJlbW8uDQpccGFyIFx0YWIgRXN0XCdlM28gZGUgcGFyYWJcJ2U5bnMgbyBQYXlzYW5kdSwgbyBHb3Zlcm5vIGRvIEVzdGFkbywgcXVlIGRldSB1bSBcaSBzaG93IFxpMCBkZSBvcmdhbml6YVwnZTdcJ2UzbywgYSBKdXN0aVwnZTdhIHBhcmFlbnNlLCBhIHBvbFwnZWRjaWEsIG9zIFwnZjNyZ1wnZTNvcyBkZSBzZWd1cmFuXCdlN2EgZG8gRXN0YWRvLiBFbmZpbSwgbWFpcyB1bWEgdmV6LCBwYXJhYlwnZTlucyBcJ2UwIHRvcmNpZGEgYmljb2xvci4NClxwYXIgXHRhYiBQYXlzYW5kdSwgbXVpdGFzIGUgbXVpdGFzIGdsXCdmM3JpYXMgdm9jXCdlYSBhaW5kYSBkYXJcJ2UxIHBhcmEgZXNzYSBzdWEgYnJpbGhhbnRlIHRvcmNpZGEsIHF1ZSBcJ2U5IGEgdG9yY2lkYSBiaWNvbG9yIGRlIEJlbFwnZTltIGRvIFBhclwnZTEuDQpccGFyIFx0YWIgTXVpdG8gb2JyaWdhZG8sIFNyLiBQcmVzaWRlbnRlLg0KXHBhciANClxwYXIgXHBhcmRcc2EyMDBcc2wyNzZcc2xtdWx0MSANClxwYXIgDQpccGFyIFxwYXJkXHNsNDgwXHNsbXVsdDFccWogDQpccGFyIH0NCgA="

decode_rtf(rtfString)

## [1] "O SR. DUDIMAR PAXIUBA  (PSDB-PA. Sem revisão do orador.) - Sr. Presidente, Sras. e Srs. Parlamentares, ocupo esta tribuna para parabenizar a torcida paraense. O futebol paraense deu um  show  de civilidade neste final de semana, e a torcida bicolor do Papão da Curuzu, o Paysandu, está de parabéns, pois sagrou-se campeão do primeiro turno em cima do seu maior rival, vencendo o valoroso Clube do Remo.  Estão de parabéns o Paysandu, o Governo do Estado, que deu um  show  de organização, a Justiça paraense, a polícia, os órgãos de segurança do Estado. Enfim, mais uma vez, parabéns à torcida bicolor.  Paysandu, muitas e muitas glórias você ainda dará para essa sua brilhante torcida, que é a torcida bicolor de Belém do Pará.  Muito obrigado, Sr. Presidente."

我敢肯定这可能会遇到一些边缘情况,但这对您来说绝对是一个开始。

关于r - 如何使用 R 将 rtf 字符串转换为纯文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31101433/

相关文章:

'[<-.data.frame' 中的 R 错误...替换有 # 项,需要 #

r - 将 R 中的高度从英尺 (6-1) 转换为英寸 (73)

r - ddply + 汇总函数列名输入

C:可以反转字符数组,但不能反转字符串

根据定界符反转/翻转两个字符串的两侧?

html - Tridion - 在 RTF 中插入 HTML(文字和部分元素)

r - 更灵活的引文格式

java - 在 Java 中从给定的字母表生成所有长度为 N 的单词

python - 使用 Python 编辑 RTF 文件

.net - .Net 1.1 Windows 的 RTF 控件