Skip to main content

All Questions

Filter by
Sorted by
Tagged with
1 vote
1 answer
211 views

how to interpret probabilities of sequences given by ngram language modelling?

Question about ngram models, might be a stupid question: With ngram models, the probability of a sequence is the product of the conditional probabilities of the n-grams into which the sequence can be ...
Gog's user avatar
  • 11
1 vote
0 answers
33 views

Language model created with SRILM does not sum to 1

I created an n-gram language model on the Penn Treebank using the following command: ngram-count -text $trainfile -order 5 -lm $temp/templm.ptb -gt3min 1 -gt4min 1 -kndiscount -interpolate -unk This ...
Kim Yung's user avatar
0 votes
1 answer
4k views

Add-1 laplace smoothing for bigram implementation8

I am doing an exercise where I am determining the most likely corpus from a number of corpora when given a test sentence. I am trying to test an and-1 (laplace) smoothing model for this exercise. I ...
Héctor's user avatar
  • 29
2 votes
0 answers
415 views

Simple bigram letter model

I am working through an exercise where, given a set of corpora, I will implement a simple model on a test corpus to determine the most likely corpus. Say the corpora with which I want to learn are ...
Héctor's user avatar
  • 29
1 vote
1 answer
263 views

train a language model with google ngrams [closed]

I want to find a conditional probability of a word given its previous set of words. I plan to use Google N-grams for the same. However, being such a huge resource as it is, I don't think it is ...
Riken Shah's user avatar
9 votes
3 answers
4k views

What's the real need for an end-symbol in n-gram models?

There's a footnote in Jurafsky & Martin (2008, p.89) pointing out that, without an end-symbol, an n-gram model would not be "a true probability distribution". Even after seeking the paper they've ...
mcrisc's user avatar
  • 191
4 votes
2 answers
434 views

Probabilities for 2-grams are higher than 1-grams in arpa file produced by kenlm

I'm using the 1 billion word language corpus to build a model with 1 and 2-grams. When using the lmplz program that comes with kenlm, I noticed that the arpa file seems to have higher probabilities ...
kristianp's user avatar
  • 141
4 votes
2 answers
714 views

I am looking for an Arabic ngram corpus

I am working in a project where i need to use an ngram model. So, i want to know if an Arabic ngram corpus exist. I have tried to find a corpus but all my researches failed. I know that for languages ...
Riadh Belkebir's user avatar
3 votes
2 answers
431 views

What is the most efficient way to store n-grams in a database / data structure?

Let's assume we have Google's 1T n-grams. I want to be able to: Search for n-grams containing all of a set of words (such as finding all n-grams containing the words "dog" and "bone" in any position) ...
mtanti's user avatar
  • 131
2 votes
1 answer
1k views

Is perplexity in SRILM normalized for sentence length?

If I generate a language model with SRILM's ngram-count and then use ngram -unk -ppl text -lm model to get log probabilities and perplexity values, are the perplexities normalized for sentence length?
L3viathan's user avatar
  • 123
5 votes
1 answer
5k views

Common English bigrams / trigrams - recognising that a jumble of letters contain only valid English words

I have a database of one million strings which I want to rank against one another so that I can tell which contain meaningful English words / sentences. These strings contain no spaces or punctuation....
StuR's user avatar
  • 153
11 votes
1 answer
715 views

The power of trigram language models (2nd order Markov models)

Many people in computational linguistics seem to mention the unexpected power of trigram (or 2nd order Markov) models for language modeling. For instance, it has been stated (verbally) to me on ...
Julie's user avatar
  • 377
9 votes
2 answers
2k views

Are there any statistics or web services for n-grams of frequent English words?

I found this for six common subjects. But it doesn't contain the complete statistics about all common English words.
ARZ's user avatar
  • 233