5

As you know:

Almost all languages have the lexical categories noun and verb, but beyond these there are significant variations in different languages.

We want tag POS of some text of a lot of different languages, but I need a tag set that be consistent among different languages, e.i. if a POS exist in some language I need tag it with same tag string. Is there any tagging set that contains tag for at least for multiple languages?

I need a structure like following:

TAG         A            OM 
        (adverb)    (object mark)  
------------------------------------
Arabic      +            -
Persian     +            +
English     +            -

Update: The article that is mentioned in Bjerva's answer is actually a "universal tag set for common tag set", I need a tag set that is not less informative and remain all POS imaginable among languages, but use same tag for POS that are common in a subset of languages. As illustrated in above table

1

2 Answers 2

4

This article suggests a tag set consisting of 12 universal tags. The tag set can be downloaded from here.

Extract from the site:

Currently, mappings for 25 treebanks covering the following 22 languages are available:

Arabic, Basque, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, French, German, Greek, Hungarian, Italian, Japanese, Korean, Portuguese, Russian, Slovene, Spanish, Swedish, Turkish.

Hope this helps!

1
  • 2
    From the paper: "... the following twelve POS tags: NOUN (nouns), VERB (verbs), ADJ (adjectives), ADV (adverbs), PRON (pronouns), DET (determiners and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), ‘.’ (punctuation marks) and X (a catch-all for other categories such as abbreviations or foreign words)."
    – jlawler
    Commented Nov 14, 2012 at 16:36
1

I'm sorry but this cannot work without significantly stretching the boundaries of "categories" and making the task in a way arbitrary.

While it's true that nouns and verbs seem to be distinguished (though not everywhere by the same means) in many (but not all) languages (and some languages do have this distinction, but it's much weaker than in, say, Latin), everything beyond that is highly dependent on the specific language.

In particular the suggestion that "determiners" are in any way universal is just incredibly short-sighted.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.