11751 Week3 Digest

This digest contains two components: the concepts I failed to make sense of in class and important sections.

Out-of-Vocabulary (OOV)

  • Definition: In plain language, OOV occurs when a word appears in the test lexicon but does not occur appear in the training data. More technically speaking, Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in a natural language processing environment.
  • How to handle OOV:
    • <unk> token
    • Spell check: only works for mis-spelled words. Can’t do new words.
    • Subwords: use facebook’s fasttext library, or sklearn.feature_extraction.text with analyzer set to char or char_wb. More details.
    • BPE: more recommended one.
      How does it work? solution

      BPE ensures that the most common words are represented in the vocabulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.

      Another thing about BPE is that its granularity is somewhere between words (too large, $|\mathcal{V}|$ can be 100k) and characters (too few, only 26). BPE’s vocab size is a good middle point, you can change the vocab size, and it will generate the lexicon with a subword-based tokenization algorithm.


  • Soft Alignment: For each phoneme sequence, which frames belong to which phoneme sequences are probability distributions. Attention-based asr is based on software.

    Soft Alignment Example

  • Hard Alignment: No probability distributions. Each frame belongs to only one phoneme sequences.

    Hard Alignment Example
    We can Use Trellis to align phoneme and frames. Below is and example where $N=3$ and $T=5$

    Trellis Example

Acoustic Model

Unlike the novel attention-based end-to-end ASR, traditional ASR is hmm-based. It helps to understand the basics of ASR. Traditional hmm-based ASR composed of four components, shown as below.

Acoustic Model in HMM-based ASR Pipeline

We have talked about the first feature extraction, and will try to factorize the acoustic model. Features and Phonemes in lexicon can be represented in $O$ and $L$ respectively.
$$O=(O_t\in R^D|t=1,\cdots,T)$$

Assume that alignment information is given, then acoustic model can be written as

& = p(O_{1:T1}|O_{T_1+1:T2},\cdots,l_1,l_2,\cdots)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& \vdots\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2}|l_2)\cdots\\
& = \prod_{i=1}^{J}p(O_{T_{j-1}+1:T_j}|l_j)\\

Two rules has been appied:

Feature Extraction


Ziang Zhou

Posted on


Updated on


Licensed under
