11751 Week3 Digest

2022-09-192022-09-19courses4 minutes read (About 669 words)0 visits

This digest contains two components: the concepts I failed to make sense of in class and important sections.

Out-of-Vocabulary (OOV)

Definition: In plain language, OOV occurs when a word appears in the test lexicon but does not occur appear in the training data. More technically speaking, Out-of-vocabulary (OOV) are terms that are not part of the normal lexicon found in a natural language processing environment.
How to handle OOV:
- <unk> token
- Spell check: only works for mis-spelled words. Can’t do new words.
- Subwords: use facebook’s fasttext library, or sklearn.feature_extraction.text with analyzer set to char or char_wb. More details.
- BPE: more recommended one.
  How does it work? solution
  
  BPE ensures that the most common words are represented in the vocabulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.
  
  Another thing about BPE is that its granularity is somewhere between words (too large, $|\mathcal{V}|$ can be 100k) and characters (too few, only 26). BPE’s vocab size is a good middle point, you can change the vocab size, and it will generate the lexicon with a subword-based tokenization algorithm.

Alignment

Soft Alignment: For each phoneme sequence, which frames belong to which phoneme sequences are probability distributions. Attention-based asr is based on software.
Hard Alignment: No probability distributions. Each frame belongs to only one phoneme sequences.

We can Use Trellis to align phoneme and frames. Below is and example where $N=3$ and $T=5$

Acoustic Model

Unlike the novel attention-based end-to-end ASR, traditional ASR is hmm-based. It helps to understand the basics of ASR. Traditional hmm-based ASR composed of four components, shown as below.

Acoustic Model in HMM-based ASR Pipeline

We have talked about the first feature extraction, and will try to factorize the acoustic model. Features and Phonemes in lexicon can be represented in $O$ and $L$ respectively.
$$O=(O_t\in R^D|t=1,\cdots,T)$$

$$L=(l_i\in{/AA/,/AE/,\cdots}|i=1,\cdots,J)$$
Assume that alignment information is given, then acoustic model can be written as

$$\begin{split}
p(O|L)&=p(O_{1:T1},O_{T_1+1:T2},\cdots|l_1,l_2,\cdots)\\
& = p(O_{1:T1}|O_{T_1+1:T2},\cdots,l_1,l_2,\cdots)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2},\cdots|l_1, l_2,\cdots)\\
& \vdots\\
& = p(O_{1:T1}|l_1)p(O_{T_1+1:T2}|l_2)\cdots\\
& = \prod_{i=1}^{J}p(O_{T_{j-1}+1:T_j}|l_j)\\
\end{split}$$

Two rules has been appied:

Feature Extraction

MFCC: Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. The best explanation for MFCC and its extraction process.
Pitch: In speech, the relative highness or lowness of a tone as perceived by the ear, which depends on the number of vibrations per second produced by the vocal cords. Pitch is the main acoustic correlate of tone and intonation.

11751 Week3 Digest

https://realzza.github.io/courses/11751-Week3-Digest/

Author

Ziang Zhou

Posted on

2022-09-19

Updated on

2022-09-19

Licensed under

11751

11751 Week3 Digest

Out-of-Vocabulary (OOV)

Alignment

Acoustic Model

Feature Extraction

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Catalogue

follow.it