11751 Week5 Digest
Previous Homework Question
There was a question in hw3, asking which one of the four is not end-to-end model. I remembered that I searched online for a while, because concepts such as CTC, RNN-Transducer are rather new to me. Fortunately, this week’s lectures helped solve that question.
Connectionist Temporal Classification (CTC)
I mentioned some resources to learn about the big picture of CTC previously here. We will look into some important details here.
Task Description
CTC algorithm allows RNN to directly learn the sequence data without having to label the correspondence information between the input and output sequences in the training data in advance. Also, CTC achieve better application results in sequence learning tasks.
We can tell from the pipeline that CTC model maps acoustic feature frames to phonemes in an end-to-end fashion.
Alignment
Take the word see as an example.
- First, we form a $W$ sequence of see. $W$ = {s,e,e}
- Next, we insert and wrap with the blank token between each characters. $W’$= {
<b>
,s,<b>
,e,<b>
,e,<b>
} - Next, align with the feature frames! Length of frame array is $T$, since $T$ can vary, we could be shrinking or expanding the $W’$ sequence. No matter which way, there are two shared rules.
- All characters can be repeated in the process of alignment.
- Skipping rules of
<b>
:- {…, s,
<b>
, e, …}: We can skip - {…, e,
<b>
, e, …}: We cannot skip
- {…, s,
- Finally, the aligned sequence $Z$ has the length of $T$, and may look like one of the following:
- Z = {
<b>
, s, e,<b>
, e} - Z = {s,
<b>
, e,<b>
, e} - Z = {s, s, e,
<b>
, e}
…
No matter which possible alignment $Z$, it is clear that we followed the two rules.
- Z = {
Note
- $f:Z\rightarrow W$: many to one mapping
- remove repeated tokens
- remove
<b>
blank tokens
- $f^{-1}:W\rightarrow Z$: one to many mapping
An example: Trellis of Z
We will consider all possible paths $Z$, and will use EM algorithm to find the path with the largest probability. But first, we can use random variable $P(W|O)$ to represent the alignment.
$$\begin{split}
p(W|O)&=\sum_{Z\in f^{-1}}p(Z|O)\\
&=\sum_{Z\in f^{-1}}\prod_{t=1}^{T}p(z_t|z_{1:t-1},O)\\
&=\sum_{Z\in f^{-1}}\prod_{t=1}^{T}p(z_t|O)\\
\end{split}$$
- Rules applied:
- Product Rule
- Chain rule
- C.I.A
Note:
- for each alignment $z_t$, $O$ does not include a subscription. We choose to use entire sequence information.
- $p(z_t|O)$ is represented by a neural network (Bidirectional LSTM or self-attention)
RNN-Transducer
Task Description
Transducer, in general, means to transform one modality into another. In this case, transform input features to output texts. RNN-Transducer is also an End-to-End solution.
Alignment
- Each frame in temporal dimension will be consumed by one alignment.
- The start token
<s>
consumes the first time step. - $W=T+J$: The length of $W$ is the number of edges.
Similar to CTC, we can use random variable $P(W|O)$
$$\begin{split}
p(W|O)&=\sum_{Z\in f^{-1}}p(Z|O)\\
&=\sum_{Z\in f^{-1}}\prod_{k=1}^{T+j}p(z_k|z_{1:k-1},O)\\
&=\sum_{Z\in f^{-1}}\prod_{k=1}^{T+j}p(z_t|f(z_{1:k-1}),O)\\
\end{split}$$
But in the third step, RNN-Transducer does not use the C.I.A.
Comparison with CTC
Two differences:
- Length of alignment $Z$ is different. $Z_{ctc}=T$, $Z_{RNN-T}=T+J$
- In CTC, each transition/arrow must consume one speech frame.
11751 Week5 Digest