r - 使用 R 进行阿拉伯文本挖掘

关闭。这个问题需要更多focused .它目前不接受答案。

想改善这个问题吗？更新问题，使其仅关注一个问题 editing this post .

7年前关闭。

Improve this question

我是一个新用户，我只是想在 R 上的工作中获得帮助。我在做阿拉伯语文本挖掘，我很想得到一些在这个领域有经验的人的帮助。到目前为止，我觉得要规范化阿拉伯文本，甚至 R 也不会在控制台中打印阿拉伯字符。我现在卡住了，我不知道像在 Weka 中进行挖掘或任何其他方式那样改变语言是否正确。任何人都可以告诉我是否有人在使用 R 挖掘阿拉伯语文本方面取得了任何成就？
顺便说一下，我正在研究阿拉伯语推文数据集分析。我花了一个月的时间来获取数据。而且我不知道我需要多长时间来预处理文本。

最佳答案

我在这方面没有太多经验，但是当我尝试这样做时，阿拉伯字符没有问题:

require(tm)
require(tm.plugin.webmining)
require(SnowballC)

corpus <- WebCorpus(GoogleNewsSource("سلام"))
corpus
inspect(corpus)

tdm <- TermDocumentMatrix(corpus)

确保在您的操作系统和 IDE 上安装正确的字体。

```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```

在 OS X 中，阿拉伯字符被正确表示:

```{r}
str(myCorpus_df[1:2])
```

List of 2
 $ 1:List of 2
  ..$ content: chr "The CHRONICLE EYE  Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"


 $ 2:List of 2
  ..$ content: chr "RT @######## جبهة النصرة مهاجرينها وأنصارها  مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "2"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
 - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

当我在两个操作系统(OS X 和 Win 7)上检查阿拉伯语单词的编码时，它似乎编码得很好:

```{r}
Encoding("لمياه_و_الإصحا")
```

[1] "UTF-8"

这也可能有帮助:
Reading arabic data text in R and plot()

关于r - 使用 R 进行阿拉伯文本挖掘，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25654921/

r - 使用 R 进行阿拉伯文本挖掘

上一篇：ruby-on-rails - 为不同的用户类型添加子域

下一篇：visual-c++ - Visual Studio 2013 创建更大的 exe - 没有 MFC