/usr/share/doc/sphinx3/sphinxman_misc.html is in sphinx3-doc 0.8-0ubuntu1.
This file is owned by root:root, with mode 0o644.
The actual contents of the file can be viewed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Decoding</title>
<style type="text/css">
pre { font-size: medium; background: #f0f8ff; padding: 2mm; border-style: ridge ; color: teal}
code {font-size: medium; color: teal}
</style>
</head>
<body>
<a name="top">INDEX</a>
<p>(This is under construction.)
<!======================================================================>
<ol>
<li><a href="#0"><font color="red">Miscellaneous</font></a>
<ul>
<li><a href="#00">Generating lattices and N-best lists, and some facts about them</a>
<li><a href="#01">Explanation of various fields in an ARPA format LM</a>
<li><a href="#02">Generating matchseg files</a>
<li><a href="#03">Explanation of some SPHINX-II decoder flags</a>
</ul>
</ol>
<!======================================================================>
<a name="0"></a>
<a name="00"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING LATTICES AND N-BEST LISTS, AND SOME FACTS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b><u>Generating lattices:</b></u>
<p>
Lattices can be generated by including the flag
<pre>
-oulatdir [directory in which you want to write lattices]
</pre>
in the argument of the s3decode binary. Corresponding to each
utterance, the decoder will then write a lattice in the directory
you have specified. The lattice will be named utteranceid.lat.gz,
and the contents of the file can be seen by giving the command
"zcat filename" from the commandline on a unix machine.
<p>
If the utterance name in the ctl file includes directory names too, then
you have the option of including or excluding them from the lattice
filenames by including or excluding the string ,CTL from the argument
you give to -outlatdir, respectively. This string is appended directly
after the argument, without a space. Thus if the argument you give is
<pre>
-outlatdir current
</pre>
the extended argument would be
<pre>
-outlatdir current,CTL
</pre>
<p>
<b><u>Generating N-best lists from lattices:</b></u>
<p>
N-best lists can be generated from the lattices by using
the binary s3astar. It works just like the decoder (it takes
the same controlfile as the decoder, and the inlatdir is the
same as the outlatdir that the decoder used). You need to
additionally provide an nbestdir where the N-best files are
written. The number of hypotheses in any N-best list can be
specified using a -nbest argument (the default value is 200, but
note that just becuase you ask for 200 hypotheses it does not mean that
you will get 200 hypotheses. If the lattice holds fewer than 200
possible hypotheses, you'll get fewer hypotheses). The N-best files
will look like matchseg outputs.
<p>
<p>
<b><u>Example of a lattice and explanation of format:</b></u>
<p>
The lattice has three
distinct sections. In the first all the nodes in the graph with their
associated words and being and end times are listed. In the second section
the acoustic scores associated with each of the nodes is listed. In the
final section the scores associated with the edge between any two words is
listed. The lattice also has additional lines of information mentioning the
total number of nodes in the graph, the id of the first and last nodes,
and text describing the format of the lines in the lattice. In addition,
the lattice may contain lines that begin with a "#". These are comments.
<p>
Here are examples from each of the components of the lattice. Explanations
are interspersed.
<pre>
.....
Nodes 1949 (NODEID WORD STARTFRAME FIRST-ENDFRAME LAST-ENDFRAME)
0 ++GARBAGE++ 254 256 256
1 ++LAUGH++ 254 256 256
2 ++N++ 254 256 256
3 ++GARBAGE++ 253 255 255
4 ++LAUGH++ 253 255 255
5 ++N++ 253 255 255
6 ++GARBAGE++ 252 254 254
7 ++LAUGH++ 252 254 254
8 ++N++ 252 254 254
9 ++GARBAGE++ 251 253 253
10 ++N++ 251 253 253
11 A 251 253 253
12 ++GARBAGE++ 250 252 252
13 ++N++ 250 252 252
14 A 250 252 252
15 ++N++ 249 251 251
16 HAVE 245 250 253
17 ARE(2) 245 250 250
18 HAVE 244 249 249
19 GO 244 249 251
.....
</pre>
Node no. 16 is the word HAVE and begins on the 245th frame and can end
anywhere between the 250th and 253rd frames.
<pre>
.....
#
Initial 1948
Final 82
.....
</pre>
Nodes are written out in *reverse* order in the lattice. As a
result, the node that is written out last is actually the *first*
node in the lattice. Nodes are also not written in stricly reverse
sequential order since, due to the "stretch" in the ending frames
of different nodes, it is difficult to determine a precise sequence
for all but the first node. As a result, in this lattice, the
first node was node number 1948 (the one written out last), but
the last node was actually node 82.
<pre>
.....
#
BestSegAscr 13865 (NODEID ENDFRAME ASCORE)
1948 2 -172014
1948 3 -207858
1948 4 -220188
1947 5 -351673
.....
</pre>
While a node can end at many different frames, the acoustic score
associated with the node when it ends at a particular frame will be
different from that associated with it when it ends at a different
frame. This portion of the lattice shows this information. In this example,
when node number 1948 ends at frame 2 it has an acoustic score of -172014,
when it ends at frame 3, the acoustic score is -207858, etc. Note, however
that this acoustic score is only the best score and is not really useful
since the true score for the node would depend on the path being considered
due to the existence of cross-word triphones.
<pre>
.....
#
Edges (FROM-NODEID TO-NODEID ASCORE)
33 23 -243293
33 20 -297751
35 23 -1599007
37 23 -1923161
.....
</pre>
The true acoustic score for any word is dependent on the word following
it in the path. We therefore associate this score with the *edge* leading
from that word to the following word. There can be many edges leading
out of a node even at a given frame. Each of these edges is likely to
have a different score than the other edges.
In the above portion of the lattice we are given the information that
the edge from node 33 to node 23 has the score -243293, the edge from
node 33 to node 20 has score -297751 and so on. Keep in mind that there
can be only one edge between any two nodes, even though a node can
end at many different frames. This is because only one of these possible
ending frames will permit a proper edge to the unique starting frame
of the next word.
<pre>
.....
1948 1440 -2083713
1948 1399 -220188
End
.....
</pre>
The lattice ends here.
<p>
Note also that a lattice is actually a *tree*, and so the left
context of any node is fixed. So, the variations in acoustic scores
of words are only due to the right contexts, since any node in
a tree can have only one predecessor. However, what the sphinx3 writes
out is not a lattice, but actually a DAG, or a directed, acyclic
graph. What is done here is that nodes representing the same word
in the lattice are merged if they have identical time stamps.
What you see in the "lattice" file is actually this DAG and not a
tree-structured lattice at all.
<p>
<u>An important consideration in combining lattices from different
sources:</u><br>
if you had two parallel paths of this kind:
<p>
<pre>
......> WORD1 ------> WORD2
......> WORD1 ------> WORD2
</pre>
(WORD1 = "and", say and WORD2 = "the")
<p>
You CAN merge it to
<pre>
.....> WORD1 ------> WORD2
</pre>
<p>
*If* you are using CI models! Then the two parallel edges (the dashed
edges) would have had close to identical scores, so you could just take
the highest score.
But if you are using CD models here's what will happen: the edges from
path1 and path2 will have *different* scores in the lattice *even* if
both WORD1 and WORD2 begin and end at exactly the same time instants in
both cases.
This is because the the word preceding WORD1 in the two cases would have been different,
so the cross-word triphone score of the first phone in the word would have
been different. e.g.
<pre>
OK......> WORD1(and) ------> WORD2(the)
BIT......> WORD1(and) ------> WORD2(the)
</pre>
the word preceding "and" in the first path is "ok", the x-wd triphone
at the beginning of and in the first path is A(EY,N).
The preceding word is "bit" in path 2, the x-wd triphone at the
beginning of "and" is A(T,N). So the score of the edge between
"and" and "the" would reflect this in the two paths and be different.
All this even when you are only working with a *single*
lattice (e.g. the MFC lattice). Any heuristic, like using the highest
score for the merged path (node/edges) is likely to backfire for this
reason (but would have to be experimentally tested).
If path1 is (say) from and MFC-based lattice and path2 is (say) from a
PLP-based one, this problem is compunded by the additional problem of
how to come up with correct scaling factors for the scores.
<p>
<a name="0"></a>
<a name="01"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF VARIOUS FIELDS IN AN ARPA FORMAT LM</td>
</table>
<!------------------------------------------------------------------------->
At the top of the LM are the lines:
<pre>
\data\
ngram 1=NUM1
ngram 2=NUM2
ngram 3=NUM3
</pre>
This means that there are NUM1 unigrams, NUM2 bigrams
and NUM3 trigrams in the LM.
Then you have a line
<pre>
\1-grams:
</pre>
This means that all following lines are unigrams until you encounter
a line "\2-grams:" or a "\end\" marker.
The \end\ marker marks the end of the arpa LM.
All unigrams have the form
<pre>
MUMa WORD NUMb
</pre>
NUMa is the log probabilty of the unigram for the word WORD.
NUMb is the back-off weight associated with that word.
For bigrams entries may be
<pre>
NUMa WORD1 WORD2 NUMb
</pre>
or
<pre>
NUMa WORD1 WORD2
</pre>
The first form of entry is when the LM also has trigrams.
If it is only a bigram LM the entries will be of the second
form.
Here, NUMa is the log prob of the bigram P(WORD2 | WORD1) and NUMb is
the back-off weight for the word pair (WORD1 WORD2).
The general N-gram entry is of the form
<pre>
NUMa WORD1 WORD2 ... WORDN NUMb
</pre>
or if it is an Ngram model
<pre>
NUMa WORD1 WORD2 ... WORDN
</pre>
All logarithms are base 10.
To prune the LM you can delete all N-gram entries where the difference
between the probability entry for that Ngram and the predicted probability
for the N-gram obtained by backing off is very small. The predicted
probability is (of course)
for trigrams:
P(C|A,B) ~ P(C|B) * backoffwt(A,B).
For bigram
P(C|B) ~ P(C) * backofwt(B)
Pruning is easiest done only on the highest order Ngram since deleting
lower order Ngrams will delete the back-off weight for that Ngram as
well and affect our prediction for the higher order Ngram. For example,
if we pruned P(B|A) out of the LM, then the backoffwt(A,B) would
also get pruned out. This affects the estimate of P(C|A,B), and the
pruning heuristic would have to be appropriately considered.
<p>
<a name="0"></a>
<a name="02"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>GENERATING MATCHSEG FILES, FORMAT OF MATCHSEG OUTPUT</td>
</table>
<!------------------------------------------------------------------------->
Matchseg files can be generated by using the flag -matchsegfn [filename]
in the argument of the decoder binary.
<pre>
SB01
S 36683016
T -27154407
A -21711975
L -858796
0 -603268 0 <s>
17 -814154 -49595 SHOW
40 -594450 -11806 THE(3)
51 -463637 -23023 <sil>
65 -2880525 -88849 ITS(2)
133 -1203392 -72333 DATES
171 -806587 -5753 FOR
185 -898344 -26773 ALL
210 - 2459017 -71603 DEPLOYED
260 -1765774 -94176 CEP
302 -1791387 -76622 EVERETT(2)
338 -843218 -62384 THEIR
355 -848925 -17748 HOME
376 -1482528 -2325 PORT
411 -105 8809 -79875 SPEED
435 -786647 -37608 BY
453 -1771960 -32704 FIVE
482 -363323 -23 023 <sil>
489 -276030 -82596 </s>
514
</pre>
This is a hypothesis in the "matchseg format". This output usually comes in a single line, but it was rearranged above to make it easier to read. The first word (or "field")
is the filename, S is a scaling factor (to prevent integers from wrapping
around due to underflowing - it can be thought of as a normalization factor
for likelihoods). T is the total likehood of the utterance, A is the
acoustic likelihood L is the LM likelihood (these are all log likelihoods,
hence the large numbers). Then onwards the format is beginning_frame_number
acoustic_score lm_score WORD beginning_frame_number acoustic_score lm_score
WORD ........ and so on. In the end, the LAST frame number of the
utterance is written (514 in this case).
<br>
The hypothesis itself is the string of all words between <s> and </s>.
(The hypothesis combination program needs two or more
such matchseg files [in the same order] and outputs a matchseg file
which is the best path hypothesis in the graph constructed from
the input matchseg files.)
<p>
<a name="0"></a>
<a name="03"></a>
<!------------------------------------------------------------------------->
<center><h4><font color="red">MISCELLANEOUS</font></h4></center>
<TABLE width="100%" bgcolor="#ffffff">
<td>EXPLANATION OF SOME SPHINX-II DECODER FLAGS</td>
</table>
<!------------------------------------------------------------------------->
<p>
<b> -compress </b>: <i>compress excess background frames</i><br>
<b> -compress </b>: <i>compress excess background frames based on prior utt</i>
<br>Typical silence compression code is as follows:
<pre>
if (silcomp == COMPRESS_PRIOR) {
j = 0;
for (i = 0; i < nfr; i++) {
if (histo_add_c0 (mfc[i][0])) {
if (i != j)
memcpy (mfc[j], mfc[i], sizeof(float)*CEP_SIZE);
comp2rawfr[j++] = i;
}
/* Else skip the frame, don't copy across */
}
nfr = j;
}
</pre>
<p>The "silence" frames are actually deleted.</p>
note: This is not good when you are using models trained in the
standard manner using SPHINX-III.
Deleting silence frames completely during decoding (regardless of whether
they are put back in the seg file later) is bad. We train cross-word
triphones with silence as context explicitly. There are usually hundreds of
such triphones in the model set. If there are no silence frames at all in
the sequence of frames being decoded, the cross-word triphones with silence
never get a chance of being used. Note that in most model sets, the silence
and breath models are usually the best trained models.
<p>
<hr>
<em> last modified: 22 Nov. 2000 </em>
</body>
</html>
|