All Questions
Tagged with computational-linguistics n-grams
13 questions
1
vote
1
answer
211
views
how to interpret probabilities of sequences given by ngram language modelling?
Question about ngram models, might be a stupid question:
With ngram models, the probability of a sequence is the product of the conditional probabilities of the n-grams into which the sequence can be ...
1
vote
0
answers
33
views
Language model created with SRILM does not sum to 1
I created an n-gram language model on the Penn Treebank using the following command:
ngram-count -text $trainfile -order 5 -lm $temp/templm.ptb -gt3min 1 -gt4min 1 -kndiscount -interpolate -unk
This ...
0
votes
1
answer
4k
views
Add-1 laplace smoothing for bigram implementation8
I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I ...
2
votes
0
answers
415
views
Simple bigram letter model
I am working through an exercise where, given a set of corpora, I will implement a simple model on a test corpus to determine the most likely corpus.
Say the corpora with which I want to learn are ...
1
vote
1
answer
263
views
train a language model with google ngrams [closed]
I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is ...
9
votes
3
answers
4k
views
What's the real need for an end-symbol in n-gram models?
There's a footnote in Jurafsky & Martin (2008, p.89) pointing out that, without an end-symbol, an n-gram model would not be "a true probability distribution". Even after seeking the paper they've ...
4
votes
2
answers
434
views
Probabilities for 2-grams are higher than 1-grams in arpa file produced by kenlm
I'm using the 1 billion word language corpus to build a model with 1 and 2-grams. When using the lmplz program that comes with kenlm, I noticed that the arpa file seems to have higher probabilities ...
4
votes
2
answers
714
views
I am looking for an Arabic ngram corpus
I am working in a project where i need to use an ngram model. So, i want to know if an Arabic ngram corpus exist. I have tried to find a corpus but all my researches failed. I know that for languages ...
3
votes
2
answers
431
views
What is the most efficient way to store n-grams in a database / data structure?
Let's assume we have Google's 1T n-grams. I want to be able to:
Search for n-grams containing all of a set of words (such as finding all n-grams containing the words "dog" and "bone" in any position)
...
2
votes
1
answer
1k
views
Is perplexity in SRILM normalized for sentence length?
If I generate a language model with SRILM's ngram-count and then use ngram -unk -ppl text -lm model to get log probabilities and perplexity values, are the perplexities normalized for sentence length?
5
votes
1
answer
5k
views
Common English bigrams / trigrams - recognising that a jumble of letters contain only valid English words
I have a database of one million strings which I want to rank against one another so that I can tell which contain meaningful English words / sentences.
These strings contain no spaces or punctuation....
11
votes
1
answer
715
views
The power of trigram language models (2nd order Markov models)
Many people in computational linguistics seem to mention the unexpected power of trigram (or 2nd order Markov) models for language modeling. For instance, it has been stated (verbally) to me on ...
9
votes
2
answers
2k
views
Are there any statistics or web services for n-grams of frequent English words?
I found this for six common subjects. But it doesn't contain the complete statistics about all common English words.