Input: A recording, then will be processed into a Spectrogram
Output: The text based on the predicted result
⇒ Idea of Connectionist Temporal Classification (CTC-Loss)
Properties must satisfies:
Allow repeated output: When the model is not sure at which moment it should output, it should allow model to predict the same token multiple times
Merge output: Merger these repetitive outputs.
Input a piece of audio, and predict the word ‘g’. Assuming the model decodes 3 states, each give us the prob. of all tokens and we select the result with the highest prob.
But there are many combinations (a.k.a the path to get the desired target) that can generate the same result, we need to guide the model to produce one of the results, then we can decode the corresponding text.
⇒ Enumerate all the combinations and calculate the loss for each. (Idea of CTC-Loss)
Disadvantages:
⇒ To increase efficiency, we need to use Dynamic Programming. (quite similar to Viterbi algorithm)
Forward-backward Algorithm:
At the first time step $T_1$, we give each of the element corresponding probability from left. Remember that we have two “” here (one for start and one for end). We use circle and triangle to distinguish. The triangle “” prob. in $T_1$ is 0 because at time step $T_1$( which is the start point) we can not have end “_”.
Then calculate $T_2$ based on the result of $T_1$.
(T2, _
circle) will only come from (T1, _
circle).
(T2,g) may come from (T1, _
circle) and (T1,g).
(T2,_
, triangle) will be the result after (T1, g).
The input length must larger than the output length. The longer input sequence, the harder to train.