11751 Week5 Digest

Previous Homework Question

There was a question in hw3, asking which one of the four is not end-to-end model. I remembered that I searched online for a while, because concepts such as CTC, RNN-Transducer are rather new to me. Fortunately, this week’s lectures helped solve that question.

Previou Homework Problem

Connectionist Temporal Classification (CTC)

I mentioned some resources to learn about the big picture of CTC previously here. We will look into some important details here.

Task Description

CTC algorithm allows RNN to directly learn the sequence data without having to label the correspondence information between the input and output sequences in the training data in advance. Also, CTC achieve better application results in sequence learning tasks.

CTC End-to-End Pipeline
We can tell from the pipeline that CTC model maps acoustic feature frames to phonemes in an end-to-end fashion.

Alignment

Take the word see as an example.

  • First, we form a $W$ sequence of see. $W$ = {s,e,e}
  • Next, we insert and wrap with the blank token between each characters. $W’$= {<b>,s,<b>,e,<b>,e,<b>}
  • Next, align with the feature frames! Length of frame array is $T$, since $T$ can vary, we could be shrinking or expanding the $W’$ sequence. No matter which way, there are two shared rules.
    • All characters can be repeated in the process of alignment.
    • Skipping rules of <b>:
      • {…, s, <b>, e, …}: We can skip
      • {…, e, <b>, e, …}: We cannot skip
  • Finally, the aligned sequence $Z$ has the length of $T$, and may look like one of the following:
    • Z = {<b>, s, e, <b>, e}
    • Z = {s, <b>, e, <b>, e}
    • Z = {s, s, e, <b>, e}

      No matter which possible alignment $Z$, it is clear that we followed the two rules.

Note

  • $f:Z\rightarrow W$: many to one mapping
    • remove repeated tokens
    • remove <b> blank tokens
  • $f^{-1}:W\rightarrow Z$: one to many mapping

An example: Trellis of Z

Trellis Example
We will consider all possible paths $Z$, and will use EM algorithm to find the path with the largest probability. But first, we can use random variable $P(W|O)$ to represent the alignment.
$$\begin{split}
p(W|O)&=\sum_{Z\in f^{-1}}p(Z|O)\\
&=\sum_{Z\in f^{-1}}\prod_{t=1}^{T}p(z_t|z_{1:t-1},O)\\
&=\sum_{Z\in f^{-1}}\prod_{t=1}^{T}p(z_t|O)\\
\end{split}$$

Note:

  • for each alignment $z_t$, $O$ does not include a subscription. We choose to use entire sequence information.
  • $p(z_t|O)$ is represented by a neural network (Bidirectional LSTM or self-attention)

RNN-Transducer

Task Description

Transducer, in general, means to transform one modality into another. In this case, transform input features to output texts. RNN-Transducer is also an End-to-End solution.

RNN-Transducer Pipeline

Alignment

RNN-Transducer Alignment

  • Each frame in temporal dimension will be consumed by one alignment.
  • The start token <s> consumes the first time step.
  • $W=T+J$: The length of $W$ is the number of edges.

Similar to CTC, we can use random variable $P(W|O)$
$$\begin{split}
p(W|O)&=\sum_{Z\in f^{-1}}p(Z|O)\\
&=\sum_{Z\in f^{-1}}\prod_{k=1}^{T+j}p(z_k|z_{1:k-1},O)\\
&=\sum_{Z\in f^{-1}}\prod_{k=1}^{T+j}p(z_t|f(z_{1:k-1}),O)\\
\end{split}$$

But in the third step, RNN-Transducer does not use the C.I.A.

Comparison with CTC

CTC vs. RNN-Transducer
Two differences:

  • Length of alignment $Z$ is different. $Z_{ctc}=T$, $Z_{RNN-T}=T+J$
  • In CTC, each transition/arrow must consume one speech frame.
Author

Ziang Zhou

Posted on

2022-09-30

Updated on

2022-10-02

Licensed under

Comments