/usr/share/doc/sphinx3/sphinxman

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>Decoding</title>
  <style type="text/css">
     pre { font-size: medium; background: #f0f8ff; padding: 2mm; border-style: ridge ; color: teal}
     code {font-size: medium; color: teal}
  </style>
</head>
 <body>

<a name="top">INDEX</a>
<p>(This is under construction.)
<!======================================================================>
<ol>
   <li><a href="#0"><font color="red">Miscellaneous</font></a>
   <ul>
   <li><a href="#00">Generating lattices and N-best lists, and some facts about them</a>
   <li><a href="#01">Explanation of various fields in an ARPA format LM</a>
   <li><a href="#02">Generating matchseg files</a>
   <li><a href="#03">Explanation of some SPHINX-II decoder flags</a>
   </ul>
   </ol>
<!======================================================================>

<a name="0"></a>
<a name="00"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING LATTICES AND N-BEST LISTS, AND SOME FACTS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b><u>Generating lattices:</b></u>
<p>
Lattices can be generated by including the flag
<pre>
-oulatdir [directory in which you want to write lattices]
</pre>
in the argument of the s3decode binary. Corresponding to each
utterance, the decoder will then write a lattice in the directory
you have specified. The lattice will be named utteranceid.lat.gz,
and the contents of the file can be seen by giving the command
"zcat filename" from the commandline on a unix machine.
<p>
If the utterance name in the ctl file includes directory names too, then
you have the option of including or excluding them from the lattice
filenames by including or excluding the string ,CTL from the argument
you give to -outlatdir, respectively. This string is appended directly
after the argument, without a space. Thus if the argument you give is
<pre>
-outlatdir current
</pre>
the extended argument would be
<pre>
-outlatdir current,CTL
</pre>
<p>
<b><u>Generating N-best lists from lattices:</b></u>
<p>
N-best lists can be generated from the lattices by using
the binary s3astar. It works just like the decoder (it takes
the same controlfile as the decoder, and the inlatdir is the
same as the outlatdir that the decoder used). You need to
additionally provide an nbestdir where the N-best files are
written. The number of hypotheses in any N-best list can be
specified using a -nbest argument (the default value is 200, but
note that just becuase you ask for 200 hypotheses it does not mean that
you will get 200 hypotheses. If the lattice holds fewer than 200
possible hypotheses, you'll get fewer hypotheses). The N-best files
will look like matchseg outputs.
<p>
<p>

<b><u>Example of a lattice and explanation of format:</b></u>
<p>
The lattice has three
distinct sections. In the first all the nodes in the graph with their
associated words and being and end times are listed. In the second section
the acoustic scores associated with each of the nodes is listed.  In the
final section the scores associated with the edge between any two words is
listed. The lattice also has additional lines of information mentioning the
total number of nodes in the graph, the id of the first and last nodes,
and text describing the format of the lines in the lattice. In addition,
the lattice may contain lines that begin with a "#". These are comments.
<p>
Here are examples from each of the components of the lattice. Explanations
are interspersed.
<pre>
.....
Nodes 1949 (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME)
0 ++GARBAGE++ 254 256 256
1 ++LAUGH++ 254 256 256
2 ++N++ 254 256 256
3 ++GARBAGE++ 253 255 255
4 ++LAUGH++ 253 255 255
5 ++N++ 253 255 255
6 ++GARBAGE++ 252 254 254
7 ++LAUGH++ 252 254 254
8 ++N++ 252 254 254
9 ++GARBAGE++ 251 253 253
10 ++N++ 251 253 253
11 A 251 253 253
12 ++GARBAGE++ 250 252 252
13 ++N++ 250 252 252
14 A 250 252 252
15 ++N++ 249 251 251
16 HAVE 245 250 253
17 ARE(2) 245 250 250
18 HAVE 244 249 249
19 GO 244 249 251  
.....
</pre>
Node no. 16 is the word HAVE and begins on the 245th frame and can end
anywhere between the 250th and 253rd frames.
<pre>
.....
#
Initial 1948
Final 82
.....
</pre>
Nodes are written out in *reverse* order in the lattice. As a
result, the node that is written out last is actually the *first*
node in the lattice. Nodes are also not written in stricly reverse
sequential order since, due to the "stretch" in the ending frames
of different nodes, it is difficult to determine a precise sequence
for all but the first node. As a result, in this lattice, the
first node was node number 1948 (the one written out last), but
the last node was actually node 82.
<pre>
.....
#
BestSegAscr 13865 (NODEID ENDFRAME ASCORE)
1948 2 -172014
1948 3 -207858
1948 4 -220188
1947 5 -351673 
.....
</pre>
While a node can end at many different frames, the acoustic score
associated with the node when it ends at a particular frame will be
different from that associated with it when it ends at a different
frame. This portion of the lattice shows this information. In this example,
when node number 1948 ends at frame 2 it has an acoustic score of -172014,
when it ends at frame 3, the acoustic score is -207858, etc. Note, however
that this acoustic score is only the best score and is not really useful
since the true score for the node would depend on the path being considered
due to the existence of cross-word triphones. 
<pre>
.....
#
Edges (FROM-NODEID TO-NODEID ASCORE)
33 23 -243293
33 20 -297751
35 23 -1599007
37 23 -1923161     
.....
</pre>
The true acoustic score for any word is dependent on the word following
it in the path. We therefore associate this score with the *edge* leading
from that word to the following word. There can be many edges leading
out of a node even at a given frame. Each of these edges is likely to
have a different score than the other edges.
In the above portion of the lattice we are given the information that
the edge from node 33 to node 23 has the score -243293, the edge from
node 33 to node 20 has score -297751 and so on. Keep in mind that there
can be only one edge between any two nodes, even though a node can
end at many different frames. This is because only one of these possible
ending frames will permit a proper edge to the unique starting frame
of the next word.
<pre>
.....
1948 1440 -2083713
1948 1399 -220188
End 
.....
</pre>
The lattice ends here.
<p>
Note also that a lattice is actually a *tree*, and so the left 
context of any node is fixed. So, the variations in acoustic scores
of words are only due to the right contexts, since any node in
a tree can have only one predecessor. However, what the sphinx3 writes
out is not a lattice, but actually a DAG, or a directed, acyclic
graph. What is done here is that nodes representing the same word
in the lattice are merged if they have identical time stamps. 
What you see in the "lattice" file is actually this DAG and not a 
tree-structured lattice at all. 
<p>


<u>An important consideration in combining lattices from different
sources:</u><br>
if you had two parallel paths of this kind:
<p>
<pre>
   ......> WORD1 ------> WORD2
   ......> WORD1 ------> WORD2
</pre>
(WORD1 = "and", say and WORD2 = "the")
<p>
You CAN merge it to 
<pre>
   .....> WORD1 ------> WORD2
</pre>
<p>
*If* you are using CI models! Then the two parallel edges (the dashed
edges) would have had close to identical scores, so you could just take 
the highest score.
But if you are using CD models here's what will happen: the edges from 
path1 and path2 will have *different* scores in the lattice *even* if 
both WORD1 and WORD2 begin and end at exactly the same time instants in 
both cases.
This is because the the word preceding WORD1 in the two cases would have been different,
so the cross-word triphone score of the first phone in the word would have
been different. e.g.
<pre>
   OK......> WORD1(and) ------> WORD2(the)
   BIT......> WORD1(and) ------> WORD2(the)
</pre>
the word preceding "and" in the first path is "ok", the x-wd triphone
at the beginning of and in the first path is A(EY,N).
The preceding word is "bit" in path 2, the x-wd triphone at the
beginning of "and" is  A(T,N). So the score of the edge between
"and" and "the" would reflect this in the two paths and be different.
All this even when you are only working with a *single*
lattice (e.g. the MFC lattice). Any heuristic, like using the highest
score for the merged path (node/edges) is likely to backfire for this
reason (but would have to be experimentally tested).
If path1 is (say) from and  MFC-based lattice and path2 is (say) from a 
PLP-based one, this problem is compunded by the additional problem of
how to come up with correct scaling factors for the scores.

<p>
<a name="0"></a>
<a name="01"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF VARIOUS FIELDS IN AN ARPA FORMAT LM</td>
</table>
<!------------------------------------------------------------------------->
At the top of the LM are the lines:
<pre>
\data\
ngram 1=NUM1
ngram 2=NUM2
ngram 3=NUM3
</pre>

This means that there are NUM1 unigrams, NUM2 bigrams
and NUM3 trigrams in the LM.

Then you have a line
<pre>
\1-grams:
</pre>
This means that all following lines are unigrams until you encounter
a line "\2-grams:" or a "\end\" marker.
The \end\ marker marks the end of the arpa LM.

All unigrams have the form
<pre>
MUMa  WORD NUMb
</pre>
NUMa is the log probabilty of the unigram for the word WORD.
NUMb is the back-off weight associated with that word.


For bigrams entries may be
<pre>
NUMa WORD1 WORD2 NUMb
</pre>
or
<pre>
NUMa WORD1 WORD2
</pre>
The first form of entry is when the LM also has trigrams.
If it is only a bigram LM the entries will be of the second
form.

Here, NUMa is the log prob of the bigram P(WORD2 | WORD1) and NUMb is
the back-off weight for the word pair (WORD1 WORD2).

The general N-gram entry is of the form
<pre>
NUMa WORD1 WORD2 ... WORDN NUMb
</pre>
or if it is an Ngram model
<pre>
NUMa WORD1 WORD2 ... WORDN
</pre>
All logarithms are base 10.

To prune the LM you can delete all N-gram entries where the difference
between the probability entry for that Ngram and the predicted probability
for the N-gram obtained by backing off is very small. The predicted
probability is (of course)
for trigrams:
P(C|A,B) ~ P(C|B) * backoffwt(A,B).
For bigram
P(C|B) ~ P(C) * backofwt(B)

Pruning is easiest done only on the highest order Ngram since deleting
lower order Ngrams will delete the back-off weight for that Ngram as
well and affect our prediction for the higher order Ngram. For example,
if we pruned P(B|A) out of the LM, then the backoffwt(A,B) would
also get pruned out. This affects the estimate of P(C|A,B), and the
pruning heuristic would have to be appropriately considered.
<p>
<a name="0"></a>
<a name="02"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING MATCHSEG FILES, FORMAT OF MATCHSEG OUTPUT</td>
</table>
<!------------------------------------------------------------------------->
Matchseg files can be generated by using the flag -matchsegfn [filename]
in the argument of the decoder binary.
<pre>
SB01 
    S 36683016 
    T -27154407 
    A -21711975 
    L -858796 
    0 -603268 0 &#60s> 
    17 -814154 -49595 SHOW 
    40 -594450 -11806 THE(3) 
    51 -463637 -23023 &#60sil> 
    65 -2880525 -88849 ITS(2) 
    133 -1203392 -72333 DATES 
    171 -806587 -5753 FOR 
    185 -898344 -26773 ALL 
    210 - 2459017 -71603 DEPLOYED 
    260 -1765774 -94176 CEP
    302 -1791387 -76622 EVERETT(2) 
    338 -843218 -62384 THEIR 
    355 -848925 -17748 HOME 
    376 -1482528 -2325 PORT 
    411 -105 8809 -79875 SPEED 
    435 -786647 -37608 BY 
    453 -1771960 -32704 FIVE 
    482 -363323 -23 023 &#60sil> 
    489 -276030 -82596 &#60/s> 
    514
</pre>

This is a hypothesis in the "matchseg format". This output usually comes in a single line, but it was rearranged above to make it easier to read. The first word (or "field")
is the filename, S is a scaling factor (to prevent integers from wrapping
around due to underflowing - it can be thought of as a normalization factor
for likelihoods). T is the total likehood of the utterance, A is the
acoustic likelihood L is the LM likelihood (these are all log likelihoods,
hence the large numbers). Then onwards the format is beginning_frame_number
acoustic_score lm_score WORD beginning_frame_number acoustic_score lm_score
WORD ........ and so on.  In the end, the LAST frame number of the
utterance is written (514 in this case).
<br>
The hypothesis itself is the string of all words between &#60s> and &#60/s>.
(The hypothesis combination program needs two or more
such matchseg files [in the same order] and outputs a matchseg file
which is the best path hypothesis in the graph constructed from
the input matchseg files.)
<p>
<a name="0"></a>
<a name="03"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF SOME SPHINX-II DECODER FLAGS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b> -compress </b>: <i>compress excess background frames</i><br>
<b> -compress </b>: <i>compress excess background frames based on prior utt</i>
<br>Typical silence compression code is as follows:
<pre>
   if (silcomp == COMPRESS_PRIOR) {
        j = 0;
        for (i = 0; i < nfr; i++) {
            if (histo_add_c0 (mfc[i][0])) {
                if (i != j)
                    memcpy (mfc[j], mfc[i], sizeof(float)*CEP_SIZE);

                comp2rawfr[j++] = i;
            }
            /* Else skip the frame, don't copy across */
        }
        nfr = j;
    }

</pre>

<p>The "silence" frames are actually deleted.</p>

note: This is not good when you are using models trained in the 
standard manner using SPHINX-III.
Deleting silence frames completely during decoding (regardless of whether
they are put back in the seg file later) is bad. We train cross-word
triphones with silence as context explicitly. There are usually hundreds of
such triphones in the model set. If there are no silence frames at all in
the sequence of frames being decoded, the cross-word triphones with silence
never get a chance of being used. Note that in most model sets, the silence
and breath models are usually the best trained models. 
<p>

<hr>
<em> last modified: 22 Nov. 2000 </em>
</body>
</html>
sphinx3-doc 0.8-0ubuntu1 / usr / share / doc / sphinx3 / sphinxman_misc.html