Unsupervised PoS Tagging

本文主要介绍无监督贝叶斯pos tagging方法，实验结果和分析。简单概括最大似然估计(MLE)，隐藏马尔可夫模型(HMM)。

原文参考 A Fully Bayesian Approach to Unsupervised Part-of-Speech Tagging (Sharon Goldwater, Thomas L. Griffiths, 2016)

还是懒癌，暂时是英文的

Intro

Inference for HMMs

Notation: Given that $t_{i-1} = t$, the value of $t_i$ is drawn from a multinominal dstribution with parameters $\tau^{(t)}$ $\begin{eqnarray} t_i \mid t_{i-1}=t,\quad \tau^{(t)} \sim Multinominal (\tau^{(t)})\\ w_i \mid t_{i}=t, \quad \omega^{(t)} \sim Multinominal (\omega^{(t)}) \end{eqnarray}$
Inference: decoding, applying the model at the test time, we need to know $\mathbf \theta$ and we can compute $P(\mathbf t, \mathbf w)$ $P(\mathbf t, \mathbf w) = \prod \limits_{i=1}^n P(t_i \mid t_{i-1})= \prod \limits_{i=1}^n \tau^ {(t_{i-1})} \omega_{w_i}^{t_i}$
Compute $P(\mathbf w)$, such as language model $P(\mathbf w) = \sum \limits _ {\mathbf t} P(\mathbf w, \mathbf t)$
Also $P(\mathbf t \mid \mathbf w) $, such as PoS tagger $P(\mathbf t \mid \mathbf w) = \frac{P(\mathbf t, \mathbf w)}{P(\mathbf w)}$

Parameter Estimation for HMMs

Estimation: training the model, determing its params. A procedure to set $\mathbf \theta$ based on data.
Bayes Rule $P(\theta \mid w) = \frac{P(w \mid \theta)P(\theta)}{P(w)} \propto P(w \mid \theta)P(\theta)$
Could use MLE, Bayesian estimation (actually no parameter estimation)

Maximum Likelihood Estimation

Choose the $\mathbf \theta$ that makes the data most probable, ignore the prior term. Equivalent to assuming a uniform prior. $\hat \theta = argmax _ \theta P(\mathbf w \mid \theta)$
supervised systems, \textit{relative frequency estimate} is equivalent to the MLE. $\tau_{t'}^{(t)} = \frac{n(t,t')}{n(t)} \quad \quad \omega_{w}^{(t)} = \frac{n(t,w)}{n(t)}$
unsupervised systems, expectation maximization (EM) algorithm to estimate $\theta$
Process:
- E-step: use current estimate of $\theta$ to compute expected counts of hidden events $n$.
- M-step: recompute $\theta$ using expected counts.s
Examples: forward-backward algorithm for HMMs, inside-outside algorithm for PCFGs, k-meaning clustering.
EM Works well on : word alignment for machine translation; speech recognition….
Often fails:
- probabilistic context-free grammars (PCFG): highly sensitive to initialisation; F-score reported are generally low.
- For HMMs, even very small amounts of training data have been show to work better than EM.
- similar picture for many other tasks

Main Model

Bayesian HMM

Parameter Estimation: we are not interested in the value of $\theta$ $\begin{eqnarray} P(w_i \mid \mathbf w_{1:i-1}) &=& \int P(w_i \mid \theta) P(\theta \mid \mathbf w_{1:i-1}) d \theta\\ P(\mathbf t \mid w) &=& \int P(\mathbf t \mid \mathbf w,\theta) P(\theta \mid \mathbf w) d \theta \end{eqnarray}$
Bayesian integration: Integrating over $\theta$ gives us an average over all possible parameters values.
- accounts for uncertainty as to the exact value of $\theta$
- models the shape of the distribution over $\theta$
- increase robustness: there may be a range of good values of $\theta$
- we can use priors favouring sparse solutions.
Model:
$\begin{eqnarray} t_i \mid t_{i-1}=t,\quad \tau^{(t)} &\sim& Multinominal (\tau^{(t)})\\ w_i \mid t_{i}=t,\quad \omega^{(t)} &\sim& Multinominal (\omega^{(t)})\\ \tau^{(t)} \mid \alpha &\sim& Dirichlet(\alpha)\\ \omega^{(t)} \mid \beta &\sim& Dirichlet(\beta) \end{eqnarray}$
intergrate out the parameters $\mathbf \theta = (\mathbf \tau, \mathbf \omega)$, we calculate probablity for each of $T$ possible tags
$\begin{eqnarray} P(t_i \mid \mathbf t_{1:i-1}; \alpha) &=& \frac{n(t_{i-1},t_i) + \alpha}{n(t_{i-1})+ T\alpha}\\ P(w_i \mid t_i, \mathbf t_{1:i-1}, \mathbf w; \beta) &=& \frac{n(t_{i},w_i) + \beta}{n(t_{i})+ W_{t_i}\beta} \end{eqnarray}$
For inference, $P(\mathbf t \mid \mathbf w)$ using an estimation method called \textit{Gibbs sampling}.

Dirichlet Distribution

a distribution over distribution, a prior.
Definition $P(\theta) = \frac{1}{Z}\prod \limits _ {j=1}^ K \theta_j ^{\alpha_j -1}$
$\alpha$ is params. of Dirichlet Distribution, we usually use symmetric Dirichlets, where $\alpha_1 … \alpha_K$ is equal to $\beta$. Denote Dirichlet($\beta$) to mean Dirichlet($\beta,…,\beta)$
property:
- With $\beta >1$, we would \textbf{prefer} uniform distributions over $\theta$
- With $\beta = 1$, we have no preference on $\theta$, does not mean we choose uniform.
- With $\beta < 1$, we prefer sparse (skewed) distributions

Evaluation of BHMM

Compare with MLHMM and CRF/CE
- Results
  - Intergrating over parameters is useful in itself, even with uninformative prior ($\alpha = \beta =1$):
  - Better prior can help even more, though do not reach the state of the art.
- The BHMM indentifies a sequence of tages that have high prob. over a range of parameter values, rather than choosing tags based on the single best set of paprameters (MLHMM).
- The smaller effect of $\beta$:
  
  although the true output distribution tend to be sparse as well, the level of sparseness depends on the tag (consider function words vs. content words in particular). Therefore, the value of $\beta$ that accurately reflects the most probable output distributions for some tags may be a poor choice for other tags.
- The trasition probabilities matrix is sparse: the optimal value of $\alpha$ is 0.003.
Sytactic clustering
- With MLHMM: different tokens of the same word type are usually assigned to the same cluster, but types are assigned to clusters more or less at random, and all clusters have approximately the same number of types.
- BHMM: the clusters found by BHMM tend to be more coherent and more variable in size
- BHMM transition matrix is sparse, MLHMM is not.

Summary

Unsupervised PoS tagging is useful to build lexica and taggers for new language or domains;
maximum likelihood HMM with EM performs poorly; (see MLE)
Bayesian HMM with Gibbs sampling can be used instead;
- no intereted in params, instead of parameter estimation, we do integration.
- Gibbs Sampling (for inference, see below)
the Bayesian HMM improves performance by averaging out uncertainty;
- MLE assume uniform values over $\mathbf \theta$, with $100\%$ certainty, and then pick one set of $\theta$ which maximum the liklihood.
- Bayesian HMM does not make that assumption, and if we choose a certain hyperparameter $\alpha$ and $\beta$, we have different certainty of the set of $\theta$.
  - With $\alpha,~ \beta >1$, we would \textbf{prefer} uniform values over $\mathbf \theta$
  - With $\beta = 1$, we have no preference on $\mathbf \theta$, does not mean we choose uniform.
  - With $\beta < 1$, we prefer sparse value of $\mathbf \theta$.
it also allows us to use priors that favour sparse solutions as they occur in language data.
Other types of discrete latent variable models (e.g. for syntax or semantics) use similar methods.

References

Goldwater, S., & Griffiths, T. (2007). A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 744-751).

Modified 18th-May-2018 12:00