This file is indexed.

/usr/lib/R/library/rpart/doc/usercode.Rnw is in r-cran-rpart 4.1-5-1.

This file is owned by root:root, with mode 0o644.

The actual contents of the file can be viewed below.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
\documentclass{article}
\usepackage[pdftex]{graphicx}
\usepackage{Sweave}
\usepackage{amsmath}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}

\SweaveOpts{keep.source=TRUE, fig=FALSE}
%\VignetteIndexEntry{User Written Split Functions}
%\VignetteDepends{rpart}
% Ross Ihaka suggestions
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}

\newcommand{\myfig}[1]{\resizebox{\textwidth}{!}
                        {\includegraphics{#1.pdf}}}
\title{User written splitting functions for RPART}
\author{Terry Therneau \\Mayo Clinic}

\begin{document}
\maketitle
<<echo=FALSE>>=
options(continue="  ", width=60)
options(SweaveHooks=list(fig=function() par(mar=c(5.1, 4.1, .3, 1.1))))
@
\section{Splitting functions}
The rpart code was written in a modular fashion with the idea that the C
code would be extended to include more splitting functions.
As a consequence all of the splitting functions are coordinated through a
single structure in the file \texttt{func\_table.h}, shown below:
\begin{verbatim}
static struct {
    int  (*init_split)();
    void (*choose_split)();
    void (*eval)();
    double (*error)();
} func_table [] = {
    {anovainit,   anova,   anovass,    anovapred},
    {poissoninit, poisson, poissondev, poissonpred},
    {giniinit, gini, ginidev, ginipred},
    {usersplit_init, usersplit, usersplit_eval, usersplit_pred}
};

#define NUM_METHODS 4  /*size of the above structure */
\end{verbatim}
Adding a new splitting method requires the definition of four functions which
initialize, choose a split, compute the predicted value or values for a node,
and
compute a prediction error for a new observation assigned to the node.
The last of these is used only in cross-validation.

To add new splitting rules to the main routine
four new C functions need to be defined, the
\texttt{func\_table.h} file expanded, the R function \texttt{rpart} updated to
recognize the new
option (it calls the C routine with the row number of func\_table),
and finally an initialization function written.
See the \texttt{rpart.class}, \texttt{rpart.anova}, and \texttt{rpart.exp}
functions for examples of the latter.
The lion's share of the code, which deals with all the necessary bookkeeping,
remains unchanged.

An easier route is to add a user defined splitting rules as a set of R functions.
The resulting code will be considerably slower than the built-in functions;
nevertheless this allows an easy way to extend \texttt{rpart}, and to
test out new
ideas with less effort than additions to the C base.
For user defined splits only the first three functions are defined,
and cross-validation does not happen automatically.
(User defined methods are many times slower than the built in ones, so
it may be preferable to skip cross-validation during the initial evaluation
of a model.)

\section{Anova function}
As the first illustration we show how to add anova splitting as a user-written
addition.
This method is also one of those built into the C code, so we can also
check the correctness of the results.

\subsection{Initialization function}
<<>>=
itemp <- function(y, offset, parms, wt) {
    if (is.matrix(y) && ncol(y) > 1)
       stop("Matrix response not allowed")
    if (!missing(parms) && length(parms) > 0)
        warning("parameter argument ignored")
    if (length(offset)) y <- y - offset
    sfun <- function(yval, dev, wt, ylevel, digits ) {
		  paste("  mean=", format(signif(yval, digits)),
			", MSE=" , format(signif(dev/wt, digits)),
			sep = '')
    }
    environment(sfun) <- .GlobalEnv
    list(y = c(y), parms = NULL, numresp = 1, numy = 1, summary = sfun)
}
@

On input the function will be called with
\begin{description}
  \item[y] the response value as found in the formula.
    Note that rpart will normally
    have removed any observations with a missing response.
  \item[offset] the offset term, if any, found on the right hand side of the formula
  \item[parms] the vector or list (if any) supplied by the user as a
    \texttt{parms} argument to the call.
  \item[wt] the weight vector from the call, if any
  \end{description}

The last two arguments are optional.
The initialization function needs to perform any data checks,
deal with the offset
vector if present, and return a list containing
\begin{description}
  \item[y] the value of the response, possibly updated
  \item[numy] the number of columns of y
  \item[numresp] the length of the prediction vector for each node
  \item[summary] optional: a function which will produce a 1--3 line summary for the
    node, to be used by \texttt{summary.rpart}.
  \item[print] optional: a function which will produce a one line summary for
    the \texttt{print} function.
  \item[text] optional: a function which will produce a short label for the node,
    used by the \texttt{plot} function.
  \end{description}

The parameters vector can contain whatever might be appropriate for
the method, e.g. the variance of a prior distribution.
The vector of parameters passed forward need not be he same as the set
passed into the routine.
In anova splitting there are no extra parameters.

  The summary function will be called with arguments which are
  \begin{itemize}
    \item yval= the response value for the node
    \item dev = the deviance for the node
    \item wt = the sum of weights at the node (number of observations)
    \item ylevel = the levels of y, if y is categorical
    \item digits = the current setting for digits
  \end{itemize}
It should return a character vector.
The text function will be called with these arguments plus two more
\begin{itemize}
  \item n= the number of observations in the node
  \item use.n = TRUE/FALSE, see the help page for \texttt{text.rpart}
\end{itemize}
The print function is called only with yval, ylevels, and digits.

The only puzzling line may be \texttt{environment(sfun) <- .GlobalEnv}.
R ensures that any function can access variables that are \emph{not}
passed in as arguments via a mechanism called environments.
As a trivial example
<<>>=
temp <- 4
fun1 <- function(x) {
    q <- 15
    z <- 10
    fun2 <- function(y) y + z + temp
    fun2(x^2)
}
fun1(5)
@
The definition of fun2 essentially contains a copy of \texttt{z} (and
\texttt{q} as well) to ensure that this works.
The exception to this is objects at the top level such as
\texttt{temp} above; the user is responsible
for the retention of those via their answer to the ``save'' question at
the end of an R session.

The consequence of this is that the summary function created by a
call to itemp will by default have copies of all of the external variables
that it \emph{might} make use of, in this case all of the input arguments.
The summary function is in fact self-contained and makes reference only to
its input arguments (doing otherwise is bad programming, in my
opinion) and so needs none of these.
Setting the environment of the function to \texttt{.GlobalEnv} in essence
tells R to treat the function as though it had been defined at top level,
i.e., the input prompt, and thus not attach these extraneous copies.
I first became aware of this issue when using a huge data set, and found that
the saved output of the fit took up an inordinate amount of disk space due
to data copies attached to the summary function.

Do not take this as a critique of R environments in general.
The rules governing them need to be responsive to a large ensemble of
conditions; in this particular case the results are not ideal but in the main
they are the best compromise possible.

\subsection{Evaluation function}
The evaluation function is called once per node.
It needs to produce a deviance for the node along with a vector to be
used as a label for the node.
<<>>=
etemp <- function(y, wt, parms) {
    wmean <- sum(y*wt)/sum(wt)
    rss <- sum(wt*(y-wmean)^2)
    list(label = wmean, deviance = rss)
}
@

As an example of a longer label, the gini splitting method for
categorical outcomes returns both the chosen group and the full vector
of estimated group probabilities.
The deviance value that is returned does not need to be an actual
deviance associated with a log-likelihood, any impurity measure for
the node will work.
However, the pruning algorithm used during tree construction will be most
efficient if the value 0 corresponds to a perfectly pure node.
(The pruning code decides when a branch would be futile and further
computations on it can thus be avoided.)

\subsection{Splitting function}
The splitting function is where the work lies.
It will be called once for every covariate at each potential split.
The input arguments are
\begin{description}
  \item[y] vector or matrix of response values
  \item[wt] vector of weights
  \item[x] vector of x values
  \item[parms] vector of user parameters, passed forward
  \item[continuous] if TRUE the x variable should be treated as continuous
  \end{description}

The data will have been subset to include only the non-missing subset
of x and y at the node, and if x is continuous all values will be in the
sorted order of the x variable.
If x is continuous the routine should return two vectors each of
length $n-1$, where $n$ is the length of x:
\begin{description}
  \item[goodness] the utility of the split, where larger numbers are
    better.  A value of 0 signifies that no worthwhile split could be
    found (for instance if y were a constant).  The $i$th value of
    goodness compares a split of observations 1 to $i$ versus $i+1$ to $n$.
    \item[direction]  A vector of the same length with values of -1
      and +1, where -1 suggests that values with y $<$ cutpoint be
      sent to the left side of the tree, and a value of +1 that values
      with y $<$ cutpoint be sent to the right.
      This is not critical, but sending larger values of y to the
      right, as is done in the code below, seems to make the final tree
      easier to read.
\end{description}

The reason for returning an entire vector of goodness values is that
the parent code is responsible for handling the minimum node size
constraint, and also for dealing with ties.  When x is continuous the
split routine actually has no reason to look at x.

<<>>=
stemp <- function(y, wt, x, parms, continuous)
{
    # Center y
    n <- length(y)
    y <- y- sum(y*wt)/sum(wt)

    if (continuous) {
        # continuous x variable
        temp <- cumsum(y*wt)[-n]
        left.wt  <- cumsum(wt)[-n]
        right.wt <- sum(wt) - left.wt
        lmean <- temp/left.wt
        rmean <- -temp/right.wt
        goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
        list(goodness = goodness, direction = sign(lmean))
    } else {
        # Categorical X variable
        ux <- sort(unique(x))
        wtsum <- tapply(wt, x, sum)
        ysum  <- tapply(y*wt, x, sum)
        means <- ysum/wtsum

        # For anova splits, we can order the categories by their means
        #  then use the same code as for a non-categorical
        ord <- order(means)
        n <- length(ord)
        temp <- cumsum(ysum[ord])[-n]
        left.wt  <- cumsum(wtsum[ord])[-n]
        right.wt <- sum(wt) - left.wt
        lmean <- temp/left.wt
        rmean <- -temp/right.wt
        list(goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2),
             direction = ux[ord])
    }
}
@

The code above does the computations for all the split points at once
 by making use of two tricks. The first is to center the y values
at zero (so the grand mean is zero), and the second takes advantage of the
many ways to write the ``effect'' sum of squares for a simple two
group anova.
The key identity is
\begin{equation}
  \sum (y_i - c)^2 = \sum (y_i - \overline y)^2 +  n(c - \overline y)^2
\end{equation}
If you have an old enough statistics book, this is used to show that
for a 1-way anova the between groups sum of squares $SS_B$ is
\begin{equation}
  SS_B = n_l(\overline y_l - \overline y)^2 +
  n_r(\overline y_r - \overline y)^2 \label{ssb}
\end{equation}
where $n_l$ is the number of observations in the left group,
$\overline y_l$ the
mean of $y$ in the left group, $\overline y$ the overall mean,
and $n_r$ $\overline y_r$ the corresponding terms for the right hand group.

Centering at zero makes $\overline y$ zero in \eqref{ssb}, and the terms
then can be computed for all splits at once using the cumsum function.
Extension of the formulas to case weights is left as an exercise for
the reader.

If the predictor variable x is categorical with $k$ classes then there are
potentially $2^{k-1} -1$ different ways to split the node.
However, for most splitting rules the optimal rule can be found by first
ordering the groups by their average y value and then using the
usual splitting rule on this ordered variable.
For user mode rules this is assumed to be the case.
The variable x is supplied as integer values 1, 2, \ldots, k.
On return the direction vector should have $k$ values giving the ordering
of the groups, and the goodness vector $k-1$ values giving the utility of
the splits.


\subsection{Test}
We can test the above code to make sure that it gives the same answer as
the built-in anova splitting method.
<<>>=
library(rpart)
mystate <- data.frame(state.x77, region=state.region)
names(mystate) <- casefold(names(mystate)) #remove mixed case
ulist <- list(eval = etemp, split = stemp, init = itemp)
fit1 <- rpart(murder ~ population + illiteracy + income + life.exp +
              hs.grad + frost + region, data = mystate,
              method = ulist, minsplit = 10)
fit2 <- rpart(murder ~ population + illiteracy + income + life.exp +
              hs.grad + frost + region, data = mystate,
              method = 'anova', minsplit = 10, xval = 0)
all.equal(fit1$frame, fit2$frame)
all.equal(fit1$splits, fit2$splits)
all.equal(fit1$csplit, fit2$csplit)
all.equal(fit1$where, fit2$where)
all.equal(fit1$cptable, fit2$cptable)
@
The all.equal test can't be done on fit1 vs fit2 as a whole since their
\texttt{call} component will differ.

\section{Cross validation}
To do cross-validation on user written rules one needs to use the
\texttt{xpred.rpart} routine.
This routine returns the predicted value(s) for each observation,
predicted from a fit that did not include that observation.
The result is a matrix with one row per subject and one column for
each complexity parameter value.
As an example we will replicate the cross-validation results
for the anova model above.
In order to get the same groupings we fix the xval group membership
in advance.
<<>>=
xgroup <- rep(1:10, length = nrow(mystate))
xfit <- xpred.rpart(fit1, xgroup)
xerror <- colMeans((xfit - mystate$murder)^2)

fit2b <-  rpart(murder ~ population + illiteracy + income + life.exp +
                hs.grad + frost + region, data = mystate,
                method = 'anova', minsplit = 10, xval = xgroup)
topnode.error <- (fit2b$frame$dev/fit2b$frame$wt)[1]

xerror.relative <- xerror/topnode.error
all.equal(xerror.relative, fit2b$cptable[, 4], check.attributes = FALSE)
@





\begin{figure}
  \myfig{usercode-fig1}
  \caption{Goodness of split for predicting income from illiteracy, using
    the state data.}
  \label{fig:state}
\end{figure}

<<fig1, echo=FALSE, fig=TRUE, include=FALSE>>=
tdata <- mystate[order(mystate$illiteracy), ]
n <- nrow(tdata)
temp <- stemp(tdata$income, wt = rep(1, n), tdata$illiteracy,
              parms = NULL, continuous = TRUE)
xx <- (tdata$illiteracy[-1] + tdata$illiteracy[-n])/2
plot(xx, temp$goodness, xlab = "Illiteracy cutpoint",
     ylab = "Goodness of split")
lines(smooth.spline(xx, temp$goodness, df = 4), lwd = 2, lty = 2)
@

\subsection{Smoothed anova}
For any particular covariate consider a plot of
x= split point vs y= goodness of split;
figure \ref{fig:state} shows an example for the state data
set, along with a smoothed curve.
The plot points are very erratic.
At one time we entertained the idea that a more stable split
point would be found by finding the maximum value for a
smoothed version of the plot.
Exactly how to smooth is of course a major issue in such an
endeavor.

Modification of the anova splitting routine to test this
is quite easy --- simply add a few lines to the stemp routine:
\begin{verbatim}
 ...
    	goodness= (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2)
	rx <- rank(x[-1])  #use only the ranks of x, to preserve invariance
	fit <- smooth.spline(rx, goodness, df=4)
	list(goodness= predict(fit, rx)$y, direction=sign(lmean))
        }
   else {
...
\end{verbatim}

\section{Alternating logistic regression}
Chen \cite{Chen07} used a mixed logistic regression model to
fit clinical and genetic marker data.
\begin{equation}
  E(y) = \frac{e^{\eta_1 + \eta_2}}{1+e^{\eta_1 + \eta_2}}
  \end{equation}
where $\eta_1 = X\beta$ is a standard linear predictor based
on clinical variables $X$ and $\eta_2$ is based on a tree model
using a set of genetic markers $Z$.
The solution was computed using alternate steps of an ordinary
logistic regression on $X$ treating $\eta_2$ as an offset,
and an rpart fit for $Z$ treating $\eta_1$ as an offset.

For many splitting rules the initialization function can take
care of offset terms once and for all by modifying the y
vector, but this is not the case for logistic regression.
In order to make the offset visible to the splitting function
we create a 2 column ``response'' consisting of the original
y vector in the first column and the offset in the second.
<<>>=
loginit <- function(y, offset, parms, wt)
{
    if (is.null(offset)) offset <- 0
    if (any(y != 0 & y != 1)) stop ('response must be 0/1')

    sfun <- function(yval, dev, wt, ylevel, digits ) {
		  paste("events=",  round(yval[,1]),
			", coef= ", format(signif(yval[,2], digits)),
			", deviance=" , format(signif(dev, digits)),
			sep = '')}
    environment(sfun) <- .GlobalEnv
    list(y = cbind(y, offset), parms = 0, numresp = 2, numy = 2,
         summary = sfun)
    }

logeval <- function(y, wt, parms)
{
    tfit <- glm(y[,1] ~ offset(y[,2]), binomial, weight = wt)
    list(label= c(sum(y[,1]), tfit$coef), deviance = tfit$deviance)
}
@
The evaluation function returns the number of positive responses
along with the fitted coefficient.

The splitting function attempts every possible logistic regression.
A much faster version could almost certainly be written based on
a score test, however.
<<>>=
logsplit <- function(y, wt, x, parms, continuous)
{
    if (continuous) {
	# continuous x variable: do all the logistic regressions
	n <- nrow(y)
	goodness <- double(n-1)
	direction <- goodness
	temp <- rep(0, n)
	for (i in 1:(n-1)) {
	    temp[i] <- 1
            if (x[i] != x[i+1]) {
                tfit <- glm(y[,1] ~ temp + offset(y[,2]), binomial, weight = wt)
                goodness[i] <- tfit$null.deviance - tfit$deviance
                direction[i] <- sign(tfit$coef[2])
            }
        }
    } else {
	# Categorical X variable
	# First, find out what order to put the categories in, which
	#  will be the order of the coefficients in this model
	tfit <- glm(y[,1] ~ factor(x) + offset(y[,2]) - 1, binomial, weight = wt)
	ngrp <- length(tfit$coef)
	direction <- rank(rank(tfit$coef) + runif(ngrp, 0, 0.1)) #break ties
        # breaking ties -- if 2 groups have exactly the same p-hat, it
        #  does not matter which order I consider them in.  And the calling
        #  routine wants an ordering vector.
        #
	xx <- direction[match(x, sort(unique(x)))] #relabel from small to large
	goodness <- double(length(direction) - 1)
	for (i in 1:length(goodness)) {
	    tfit <- glm(y[,1] ~ I(xx > i) + offset(y[,2]), binomial, weight = wt)
	    goodness[i] <- tfit$null.deviance - tfit$deviance
        }
    }
    list(goodness=goodness, direction=direction)
}
@
\begin{thebibliography}{9}
\bibitem{Chen07} Chen J, Yu K, Hsing A and Therneau TM.
  \emph{A partially linear tree-based regression model for assessing complex
        joint gene-gene and gene-environment effects},
Genet Epidemiol, 2007(3), 238-51.
\end{thebibliography}
\end{document}