正则表达式提取双引号和引号中的字符串 R

标签 r regex stringi

我有一个包含“文本”列的数据框。此列的每一行都填充了媒体文章中的文本。

我试图提取一个像这样出现的字符串:“term”(包括该术语周围的双引号)。我尝试使用以下正则表达式来捕获单词夹在两个双引号之间的实例:

stri_extract_all_regex(df$text, '"(.+?)"')

这似乎捕获了我正在寻找的某些实例,但在其他情况下(我知道满足标准)却没有。它还捕获似乎只是较长文本的引用(而不是引用文本的其他实例)。以下是使用上述内容的结果:

[[19]]
[1] "\"play a constructive and positive role\""                                                                                                                                                           
[2] "\"active and hectic reception\""  
[3] "\"[term]\""

我只想将“term”作为输出(包括双引号)。我试图找到该术语在引号中单独使用的实例。

Example in R: 
test <- c(df$text[12], df$text[18])
res <- stri_extract_all_regex(test, '"\\S+"') 
unlist(res)
[1] "\"Rohingya\"" "\"Bengali\""  NA   

print(test)
[1] "Former UN general secretary Kofi Annan will advise Myanmar's government on resolving conflicts in Rakhine State, the office of the state counsellor announced today.Former UN Secretary General Kofi Annan speaks at the opening of the Consciouness Summit on climate change in Paris, France on July 21, 2015. Photo: EPARakhine State, one of the poorest in the Union, was wracked by sectarian violence in 2012 that forced more than 100,000 – mostly Muslims who ethnically identify as Rohingya – into squalid displacement camps where they face severe restrictions on movement as well as access to health care, education, and other other basic services.Addressing the ongoing crises has posed one of the most troubling challenges to Daw Aung San Suu Kyi's National League for Democracy-led government.Earlier today, the government announced the formation of an advisory panel that will be chaired by former UN chief, and focus on \"finding lasting solutions to the complex and delicate issues in the Rakhine State\".The board will submit recommendations to the government on \"conflict prevention, humanitarian assistance, rights and reconciliation, institution-building and promotion of development of Rakhine State,\" a statement from the state counsellor's office said.The statement did not use the word \"Rohingya\". Daw Aung San Suu Kyi has come under fire both at home and from international rights groups for failing prioritise to address the group's plight and seeking to placate hardline Buddhist nationalists by avoiding the politically-charged term. The government has already requested that the US Embassy and other diplomatic groups avoid the term Rohingya, and in June, she proposed \"Muslim community of Rakhine State\".The proposed neutral terminology, which the state counsellor ordered government officials to adopt, sparked mass protests in Rakhine State and in Yangon by hardline nationalists, who insist on use of the term \"Bengali\" that was also preferred by the previous government's to suggest the group's origins in neighbouring Bangladesh.In July UN special rapporteur for human rights Yanghee Lee urged the government to make ending \"institutionalised discrimination\" against the Rohingya and other Muslims in Rakhine an urgent priority.Myanmar also announced this week that current UN Secretary General Ban Ki-moon will attend the highly-anticipated 21st Century Panglong conference at the end of the month.The five-day talks, aimed at ending a host of complicated border ethnic conflicts that have lasted for decades, will begin on August 31."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             

[2] "Thousands of Kaman Muslims from the Rakhine State capital Sittwe obtained identity cards this week, some two years after they applied for the documents.“Now the problem is solved. The Kaman got national ID cards. We had proposed to the government and immigration office to work on the process of giving ID cards to Kaman,” said U Tin Hlaing Win, general secretary of the Kaman National Development Party (KNDP).The Kaman, as one of 135 officially recognised ethnic groups in Myanmar entitled to full citizenship, had struggled to get authorities to grant them the IDs due to the complex ethnic demographics and fraught identity politics of Rakhine State.Complicating the process for the Kaman applicants has been the population of more than 1 million Muslims in Rakhine State who self-identify as Rohingya, most of whom are stateless. “Citizenship scrutiny” programs to issue some form of identification to the minority group by the previous and current governments have been met with resistance by Rakhine nationalists.The shared Muslim faith of the Kaman and Rohingya became just one aspect of a contentious debate over terminology earlier this year, when State Counsellor Daw Aung San Suu Kyi put forward the phrase “the Muslim community in Rakhine State” to refer to self-identifying Rohingya in an attempt to chart a middle course on the issue of lexicon. “Rohingya” stirs passions among Buddhist nationalists, who insist that they be called “Bengalis” to imply that they are illegal immigrants from neighbouring Bangladesh, despite many tracing familial lineage in Rakhine State back generations.Some Kaman have since viewed the state counsellor’s edict warily, concerned that the Kaman identity might be conflated with that of the Rohingya – who are not entitled to citizenship – and could jeopardise Kaman prospects for ID cards and full rights under the law.Violence in 2012 between Buddhists and Muslims in Rakhine State affected Rakhine Buddhists, Kaman and Rohingya, but the latter suffered the brunt of casualties and displacement.U Tin Hlaing Win told The Myanmar Times last week that some of the Kaman Muslims displaced by the conflict in the island town of Rambre had also recently received national ID cards.“The Kaman are ethnics belonging to Myanmar,” said U Than Htun Aung, a senior immigration officer for Rakhine State. “Township immigration officers will examine [legitimate Kaman claims to citizenship] according to the process and they will ensure they get their rights.”Around 2000 Kaman applied for national ID cards in 2014, but only 38 people were issued the documents.The others were told they had not received IDs because of the purported existence of “fake Kaman”.More than 100,000 people are thought to hold government-issued national ID cards identifying them as Kaman, but KNDP research in 2013 estimated the actual ethnic Kaman population to be about 50,000.U Tin Hlaing Win said sorting out the “fake Kaman” issue was not solely the responsibility of Kaman people, adding that immigration officers through the years, and generations of ethnically mixed marriages and the offspring they produced, were also to blame for the confusion.“According to our research and knowledge tracing family trees, some Kaman identity-card holders were Rakhine plus Bengali or Rakhine plus Indian, not Kaman. It [identity problems] should be solved by three groups – we Kaman, the Rakhine and immigration authorities,” U Tin Hlaing Win told The Myanmar Times last week.What most seem to agree on is that “real Kaman” deserve the documentation they need to enjoy the full rights of citizenship.Ethnic Rakhine youth leader Ko Khine Lamin said, “The Rakhine objected to national ID cards for Kaman because of the controversy over fake Kaman. But there are real Kaman who have lived in Rakhine State since a long, long time ago. They should get their ethnic rights through careful examination by immigration officers.”Kaman politicians are not satisfied with their victory this week and are trying to meet with Rakhine State Chief Minister U Nyi Pu to raise other difficulties Kaman people face, such as transportation barriers. They also intend to ask the chief minister for rehabilitation programs for Kaman internally displaced people, as well as education and health support for the broader Kaman community."

上面的代码只能返回[1]中的项。

最佳答案

"(.+?)" 模式匹配 ",然后是除换行符之外的 任何 字符,尽可能少,直到最近的(最左边的)"。这意味着它也可以匹配空格,从而匹配“发挥建设性和积极的作用”“积极而忙碌的接待”

要匹配双引号之间的一系列非空白字符,您需要使用

stri_extract_all_regex(df$text, '"\\S+"')

"\S+" 模式匹配 ",然后是 1 个或多个非空白字符,最后是结束 "

请参阅regex demo .

如果您只想匹配双引号之间的单词字符(字母、数字、_),请使用

'"\\w+"'

参见another regex demo .

要匹配大引号,请使用 '["“]\\S+["”]' 正则表达式:

> res <- stri_extract_all_regex(test, '["“]\\S+["”]')
> unlist(res)
[1] "\"Rohingya\""         "\"Bengali\""          "\u0093Rohingya\u0094"
[4] "\u0093Bengalis\u0094"

如果您需要“规范化”双引号,请使用

> gsub("[“”]", '"', unlist(res))
[1] "\"Rohingya\"" "\"Bengali\""  "\"Rohingya\"" "\"Bengalis\""

关于正则表达式提取双引号和引号中的字符串 R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45728022/

相关文章:

java - 使用 String.split() 如何根据排除某个字符串的正则表达式拆分字符串

python - 不匹配正则表达式中的一组字符

java - Spring Integration 不允许负面回顾

r - 软件包 ‘stringr’ 和 ‘stringi’ 的安装具有非零退出状态

r - 如何在 Windows 10 上的 R 中安装 stringi 包?

删除非 ASCII 值然后降低文本会出错

r - 加载 dplyr 包时更改 stats::lag 的行为

r - R中的分组计算

facebook - 从 R 访问 Facebook API 进行文本挖掘

r - 如何使用r中的for循环使用先前的观察来预测下一个时期?