本文主要介绍dependency parsing,包括:dependency parsing和constituency parsing的区别与优点;实现dependency parsing的两种主要方法(trasition based 和 MST);用RNNs实现dependency parsing。
原文参考
[1] Jurafsky and Martin Ch. 14 (3rd edition)
正文为英文
Intro
Constituents vs. Dependencies
- Syntax is often described in terms of constituency
- Traditional grammars model constituent structure:
- they capture the configurational patterns of sentences.
- In other languages (e.g., German), there is little evidence for the existence of a VP constituent.
- leave out much semantic information
- A dependency structure consists of dependency relations, which are binary and asymmetric.
- Dependency syntax is closer to semantics
- Dependency syntax is still (usually) tree-like: connected, single headed, acyclic.
- Dependency Trees can be non-projective; A non-projective dependency grammar is not context-free.
Parsers
Dependency Parsing
Different from constituency, why consider dependency parsing as a distinct topic:
- context-free parsing algorithms base their decisions on adjacency;
- Long-distance dependency (non-adjacency): in a dependency structure, a dependent need not be adjacent to its head (even if the structure is projective);
- we need new parsing algorithms to deal with non-adjacency (and with non-projectivity if present).
Two types of dependency parsers
- graph-based dependency parsing, based on maximum spanning trees (MST parser)
- transition-based dependency parsing, an extension of shift-reduce parsing (MALT parser)
Maximum Spanning Trees (MST)
- Goal: find the highest scoring dependency tree $\mathbf{y}$ in the space of all possible trees for a sentence $\mathbf{x}$.
Formulize:
- The score of a dependency edge $(i,j)$ is a function $s(i,j)$.
- Then the score of dependency tree $\mathbf{y}$ for sentence $\mathbf{x}$ is:
process:
- start with a totally connected graph $G$, i.e., assume a directed edge between every pair of words;
- assume you have a scoring function that assigns a score $s(i,j)$ to every edge $(i,j)$;
- find the maximum spanning tree (MST) of $G$, i.e., the directed tree with the highest overall score that includes all nodes of $G$;
- this is possible in $O(n^2)$ time using the Chu-Liu-Edmonds algorithm:
it finds a MST which is not guaranteed to be projective;
actually $O(mn)$ where $m$ is number of edges, $n$ is number of nodes .
- the highest-scoring parse is the MST of $G$.
Scoring edges with a neural network
- An effective formulation:
- we get $\mathbf{a}_i$ by concatenating the hidden states of a forward and backward RNN at position $i$.
- the function $g({\mathbf {a}_j,\mathbf{a}_i})$ computes an association score telling us how much word $w_i$ prefers word $w_j$ as its head.
- Association scores is a useful way to select from a dynamic group of candidates, and underly the idea of attention.
Transition-based dependency parsing
- for a give parse state, the transition system defines a set of actions which the parser can take.
- if more than one action is applicable, a classifier (e.g., a SVM) is used to decide which action to take
- Also require a mechanism to compute scores over a set of (possible dynamic) candidates: In the paper, use MLP to calculate the score.
- Understand: Process of transition-based parsing
Comparision
Comparing MST and transition-based parsers:
- the MST parser selects the globally optimal tree, given a set of edges with scores;
- it can naturally handle projective and non-projective trees;
- a transition-based parser makes a sequence of local decisions about the best parse action;
- it can be extended to projective dependency trees by changing the transition set;
- accuracies are similar, but transition-based is faster;
- both require dynamic classifiers, and these can be implemented using neural networks, conditioned on bidirectional RNN encodings of the sentence.
References
[1] Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London:: Pearson.