Dependency Parsing

本文主要介绍dependency parsing，包括：dependency parsing和constituency parsing的区别与优点；实现dependency parsing的两种主要方法（trasition based 和 MST）；用RNNs实现dependency parsing。

原文参考

[1] Jurafsky and Martin Ch. 14 (3rd edition)

[2] Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations（Eliyahu Kiperwasser, Yoav Goldberg, 2016）

正文为英文

Intro

Constituents vs. Dependencies

Syntax is often described in terms of constituency
Traditional grammars model constituent structure:

they capture the configurational patterns of sentences.
In other languages (e.g., German), there is little evidence for the existence of a VP constituent.
leave out much semantic information

A dependency structure consists of dependency relations, which are binary and asymmetric.

Dependency syntax is closer to semantics
Dependency syntax is still (usually) tree-like: connected, single headed, acyclic.
Dependency Trees can be non-projective; A non-projective dependency grammar is not context-free.

Parsers

Dependency Parsing

Different from constituency, why consider dependency parsing as a distinct topic:
- context-free parsing algorithms base their decisions on adjacency;
- Long-distance dependency (non-adjacency): in a dependency structure, a dependent need not be adjacent to its head (even if the structure is projective);
- we need new parsing algorithms to deal with non-adjacency (and with non-projectivity if present).
Two types of dependency parsers
- graph-based dependency parsing, based on maximum spanning trees (MST parser)
- transition-based dependency parsing, an extension of shift-reduce parsing (MALT parser)

Maximum Spanning Trees (MST)

Goal: find the highest scoring dependency tree $\mathbf{y}$ in the space of all possible trees for a sentence $\mathbf{x}$.
Formulize:
- The score of a dependency edge $(i,j)$ is a function $s(i,j)$.
- Then the score of dependency tree $\mathbf{y}$ for sentence $\mathbf{x}$ is: $s(\mathbf{x},\mathbf{y})= \sum \limits_{(i,j) \in \mathbf{y}}s(i,j)$
process:
- start with a totally connected graph $G$, i.e., assume a directed edge between every pair of words;
- assume you have a scoring function that assigns a score $s(i,j)$ to every edge $(i,j)$;
- find the maximum spanning tree (MST) of $G$, i.e., the directed tree with the highest overall score that includes all nodes of $G$;
- this is possible in $O(n^2)$ time using the Chu-Liu-Edmonds algorithm:
it finds a MST which is not guaranteed to be projective;
actually $O(mn)$ where $m$ is number of edges, $n$ is number of nodes .
- the highest-scoring parse is the MST of $G$.

Scoring edges with a neural network
- An effective formulation: $s(i,j) = P_{head}(w_j \mid w_i, \mathbf{x}) = \frac{\exp(g(\mathbf{a}_j,\mathbf{a}_i))}{\sum_{k=0}^{|x|}\exp(g({\mathbf {a}_k,\mathbf{a}_i}))}$
- we get $\mathbf{a}_i$ by concatenating the hidden states of a forward and backward RNN at position $i$.
- the function $g({\mathbf {a}_j,\mathbf{a}_i})$ computes an association score telling us how much word $w_i$ prefers word $w_j$ as its head.
- Association scores is a useful way to select from a dynamic group of candidates, and underly the idea of attention.

Transition-based dependency parsing

for a give parse state, the transition system defines a set of actions which the parser can take.
if more than one action is applicable, a classifier (e.g., a SVM) is used to decide which action to take
Also require a mechanism to compute scores over a set of (possible dynamic) candidates: In the paper, use MLP to calculate the score.
Understand: Process of transition-based parsing

Comparision

Comparing MST and transition-based parsers:

the MST parser selects the globally optimal tree, given a set of edges with scores;
it can naturally handle projective and non-projective trees;
a transition-based parser makes a sequence of local decisions about the best parse action;
it can be extended to projective dependency trees by changing the transition set;
accuracies are similar, but transition-based is faster;
both require dynamic classifiers, and these can be implemented using neural networks, conditioned on bidirectional RNN encodings of the sentence.

References

[1] Jurafsky, D., & Martin, J. H. (2014). Speech and language processing (Vol. 3). London:: Pearson.

[2] Kiperwasser, E., & Goldberg, Y. (2016). Simple and accurate dependency parsing using bidirectional LSTM feature representations. arXiv preprint arXiv:1603.04351.

Modified 18th-May-2018 12:00