https://universaldependencies.org/ Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD是一个为了对不同人类语言的语法(词性、词法特性、句法依赖)进行连续标记的框架。
Morphology
The morphological specification of a (syntactic) word in the UD scheme consists of three levels of representation:
- A lemma representing the semantic content of the word.
- A part-of-speech tag representing the abstract lexical category associated with the word.
- A set of features representing lexical and grammatical properties that are associated with the particular word form.
UD对一个词汇的词法形态表示由3级表示构成:
- 词汇的词元。The LEMMA field should contain the canonical or base form of the word. LEMMA域包含词汇的基本形式。
- 词性标签。
- 代表词汇在词法和语法上的属性的特征集合。
词性标签(Part-of-Speech tags)
UD只定义了17种通用词性标签(universal POS tags),更细粒度的词性分类采用通用属性(universal features)。
- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary
- CCONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other
CoNLL-U format定义了额外的词性标签XPOS。不同的语言有不同的XPOS。
每个词汇有且只能有一个POS tag。
特征(Features)
Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. 特征是关于词语及其词性、词形属性的额外信息。
特征的表示形式是 Name=Value,每个词语可以拥有多个特征,特征之间通过“|”分割,例如:Gender=Masc|Number=Sing。
UD的inventory of features定义了词汇的特征。
特征的分类包含以下:
- Lexical features:词素、词元的属性。
- Inflectional features:屈折(?)属性。(屈折语)
- Layered features:详见https://universaldependencies.org/u/overview/feat-layers.html
Syntax
Syntactic annotation in the UD scheme consists of typed dependency relations between words. UD scheme的语法标注包含词与词之间的类型化依赖关系。
Universal dependencies详见https://universaldependencies.org/u/dep/index.html