4

I created a script for the frequency of words in a document using the object and a dictionary documentTermMatrix in R. The script works on individual words and not on the compound word es. "foo" "bar" "foo bar"

This is the code

require(tm)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar"))))

But the result is

Terms

Docs bar foo  foo bar

   1   1   1        0

I would have to find one "foo bar" = 1

How can I fix this?

1 Answer 1

4

The problem is that DocummentTermMatrix(...) is tokenizing at word breaks be default. You need at least bigrams.

Credit to this post for the basic approach.

library(tm)
library(RWeka)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
myDict   <- c("foo","bar","foo bar")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer,
                                                  dictionary=myDict)))
# <<DocumentTermMatrix (documents: 1, terms: 3)>>
# ...
#     Terms
# Docs bar foo foo bar
#    1   1   1       1

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.