-
Notifications
You must be signed in to change notification settings - Fork 1
/
main.tex
2056 lines (1863 loc) · 104 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentstyle[jair,twoside,11pt,theapa]{article}
\input{psfig}
% \input epsf %% at top of file
\jairheading{2}{1994}{1-32}{4/94}{8/94}
\ShortHeadings{Induction of Oblique Decision Trees}%
{Murthy, Kasif \& Salzberg}
\firstpageno{1}
\title {A System for Induction of Oblique Decision Trees}
\author {\name Sreerama K. Murthy \email [email protected] \\
\name Simon Kasif \email [email protected] \\
\name Steven Salzberg \email [email protected] \\
\addr Department of Computer Science \\
Johns Hopkins University,
Baltimore, MD 21218 USA}
\begin{document}
\maketitle
\begin{abstract}
This article describes a new system for induction of oblique decision
trees. This system, OC1, combines deterministic hill-climbing with
two forms of randomization to find a good oblique split (in the form
of a hyperplane) at each node of a decision tree. Oblique decision
tree methods are tuned especially for domains in which the attributes
are numeric, although they can be adapted to symbolic or mixed
symbolic/numeric attributes. We present extensive empirical studies,
using both real and artificial data, that analyze OC1's ability to
construct oblique trees that are smaller and more accurate than their
axis-parallel counterparts. We also examine the benefits of
randomization for the construction of oblique decision trees.
\end{abstract}
\section{Introduction}
\label{section:intro}
Current data collection technology provides a unique challenge and
opportunity for automated machine learning techniques. The advent of
major scientific projects such as the Human Genome Project, the Hubble
Space Telescope, and the human brain mapping initiative are generating
enormous amounts of data on a daily basis. These streams of data
require automated methods to analyze, filter, and classify them before
presenting them in digested form to a domain scientist. Decision
trees are a particularly useful tool in this context because they
perform classification by a sequence of simple, easy-to-understand
tests whose semantics is intuitively clear to domain experts.
Decision trees have been used for classification and other tasks
since the 1960s \cite{moret/82,safavin/landgrebe/91}. In the 1980's,
Breiman et al.'s book on classification and regression trees (CART)
and Quinlan's work on ID3 \cite{quinlan/83,quinlan/86} provided the
foundations for what has become a large body of research on one of the
central techniques of experimental machine learning.
Many variants of decision tree (DT) algorithms have been introduced in
the last decade. Much of this work has concentrated on decision trees
in which each node checks the value of a single attribute
\cite{breiman/etal/84,quinlan/86,quinlan/93}. Quinlan initially
proposed decision trees for classification in domains with
symbolic-valued attributes
\citeyear{quinlan/86}, and later extended them to numeric domains
\citeyear{quinlan/87b}. When the attributes are numeric, the tests
have the form $x_i > k$, where $x_i$ is one of the attributes of an
example and $k$ is a constant. This class of decision trees may be
called {\it axis-parallel}, because the tests at each node are
equivalent to axis-parallel hyperplanes in the attribute space. An
example of such a decision tree is given in
Figure~\ref{figure:aptree}, which shows both a tree and the
partitioning it creates in a 2-D attribute space.
\begin{figure}
\vspace{2.0in}
\special{psfile="aptree.ps" hoffset=20 voffset=-330}
\caption{The left side of the figure shows a simple axis-parallel tree
that uses two attributes. The right side shows the partitioning
that this tree creates in the attribute space.}
\label{figure:aptree}
\vspace*{-.2in}
\end{figure}
Researchers have also studied decision trees in which the test at a
node uses boolean combinations of attributes
\cite{pagallo/90,pagallo/haussler/90,sahami/93} and linear combinations
of attributes (see Section \ref{section:oblique}). Different methods
for measuring the goodness of decision tree nodes, as well as
techniques for pruning a tree to reduce {\it overfitting} and increase
accuracy have also been explored, and will be discussed in later
sections.
In this paper, we examine decision trees that test a linear
combination of the attributes at each internal node. More precisely,
let an example take the form $X = x_1,x_2,\ldots,x_d,C_j$ where $C_j$
is a class label and the $x_i$'s are real-valued
attributes.\footnote{The constraint that $x_1,\ldots,x_d$ are
real-valued does not necessarily restrict oblique decision trees to
numeric domains. Several researchers have studied the problem of
converting symbolic (unordered) domains to numeric (ordered) domains
and vice versa; e.g., \cite{breiman/etal/84,hampson/volper/86,%
utgoff/brodley/90,deMerckt/92,deMerckt/93}. To keep the discussion
simple, however, we will assume that all attributes have numeric
values.} The test at each node will then have the form:
\begin{equation}
\label{equation:hyperplane}
\sum_{i=1}^{d}{a_ix_i} + a_{d+1} > 0
\end{equation}
where $a_1,\ldots,a_{d+1}$ are real-valued coefficients. Because
these tests are equivalent to hyperplanes at an oblique orientation to
the axes, we call this class of decision trees {\em oblique} decision
trees. (Trees of this form have also been called ``multivariate''
\cite{brodley/utgoff/94}. We prefer the term ``oblique'' because
``multivariate'' includes non-linear combinations of the variables,
i.e., curved surfaces. Our trees contain only linear tests.) It is
clear that these are simply a more general form of axis-parallel
trees, since by setting $a_i = 0$ for all coefficients but one, the
test in Eq.~\ref{equation:hyperplane} becomes the familiar univariate
test. Note that oblique decision trees produce polygonal (polyhedral)
partitionings of the attribute space, while axis-parallel trees
produce partitionings in the form of hyper-rectangles that are
parallel to the feature axes.
It should be intuitively clear that when the underlying concept is
defined by a polygonal space partitioning, it is preferable to use
oblique decision trees for classification. For example, there exist
many domains in which one or two oblique hyperplanes will be the best
model to use for classification. In such domains, axis-parallel
methods will have to approximate the correct model with a
staircase-like structure, while an oblique tree-building method could
capture it with a tree that was both smaller and more
accurate.\footnote{Note that though a given oblique tree may have fewer
leaf nodes than an axis-parallel tree---which is what we mean by
``smaller''---the oblique tree may in some cases be larger in terms of
information content, because of the increased complexity of the tests
at each node.} Figure~\ref{figure:stairs} gives an illustration.
\begin{figure}
\vspace{2.0in}
\special{psfile="stairs.ps" hoffset=10 voffset=-305}
\caption{The left side shows a simple 2-D domain in which two oblique
hyperplanes define the classes. The right side shows an approximation of
the sort that an axis-parallel decision tree would have to create to model
this domain.}
\label{figure:stairs}
\vspace*{-.2in}
\end{figure}
Breiman et al.\ first suggested a method for inducing oblique decision
trees in 1984. However, there has been very little further research
on such trees until relatively recently \cite{utgoff/brodley/90,%
heath/etal/93,murthy/etal/93,brodley/utgoff/94}. A comparison of
existing approaches is given in more detail in Section
\ref{section:oblique}. The purpose of this study is to review the
strengths and weaknesses of existing methods, to design a system that
combines some of the strengths and overcomes the weaknesses, and to
evaluate that system empirically and analytically. The main
contributions and conclusions of our study are as follows:
\begin{itemize}
\item We have developed a new, randomized algorithm for inducing
oblique decision trees from examples. This algorithm extends the
original 1984 work of Breiman et al. Randomization helps
significantly in learning many concepts.
\item Our algorithm is fully implemented as an oblique decision tree
induction system and is available over the Internet. The code can be
retrieved from Online Appendix 1 of this paper (or by anonymous ftp
>from ftp://ftp.cs.jhu.edu/pub/oc1/oc1.tar.Z).
\item The randomized hill-climbing algorithm used in OC1 is more
efficient than other existing randomized oblique decision tree methods
(described below). In fact, the current implementation of OC1
guarantees a worst-case running time that is only $O(\log n)$ times
greater than the worst-case time for inducing axis-parallel trees
(i.e., $O(dn^2 \log n)$ vs.\ $O(dn^2)$).
\item The ability to generate oblique trees often produces very
small trees compared to axis-parallel methods. When the underlying
problem requires an oblique split, oblique trees are also more
accurate than axis-parallel trees. Allowing a tree-building system to
use both oblique and axis-parallel splits broadens the range of
domains for which the system should be useful.
\end{itemize}
The remaining sections of the paper follow this outline: the remainder
of this section briefly outlines the general paradigm of decision tree
induction, and discusses the complexity issues involved in inducing
oblique decision trees. Section~\ref{section:oblique} briefly reviews
some existing techniques for oblique DT induction, outlines some
limitations of each approach, and introduces the OC1 system.
Section~\ref{section:oc1} describes the OC1 system in detail.
Section~\ref{section:exp} describes experiments that (1)~compare the
performance of OC1 to that of several other axis-parallel and oblique
decision tree induction methods on a range of real-world datasets and
(2)~demonstrate empirically that OC1 significantly benefits from its
randomization methods. In Section~\ref{section:conclude}, we conclude
with some discussion of open problems and directions for further
research.
\subsection{Top-Down Induction of Decision Trees}
Algorithms for inducing decision trees follow an approach
described by Quinlan as top-down induction of decision trees
\citeyear{quinlan/86}. This can also be called a greedy
divide-and-conquer method. The basic outline is as follows:
\begin{enumerate}
\itemsep -0.0in
\item Begin with a set of examples called the training set, $T$.
If all examples in $T$ belong to one class, then halt.
\item Consider all tests that divide $T$ into two or more subsets.
Score each test according to how well it splits up the examples.
\item Choose (``greedily'') the test that scores the highest.
\item Divide the examples into subsets and run this procedure
recursively on each subset.
\end{enumerate}
Quinlan's original model only considered attributes with symbolic
values; in that model, a test at a node splits an attribute into all
of its values. Thus a test on an attribute with three values will
have at most three child nodes, one corresponding to each value. The
algorithm considers {\it all} possible tests and chooses
the one that optimizes a pre-defined goodness measure. (One could
also split symbolic values into two or more subsets of values, which
gives many more choices for how to split the examples.) As we explain
next, oblique decision tree methods cannot consider all tests due to
complexity considerations.
\subsection{Complexity of Induction of Oblique Decision Trees}
One reason for the relatively few papers on the problem of inducing
oblique decision trees is the increased computational complexity of
the problem when compared to the axis-parallel case. There are two
important issues that must be addressed. In the context of top-down
decision tree algorithms, we must address the complexity of finding
optimal separating hyperplanes (decision surfaces) for a given node of
a decision tree. An optimal hyperplane will minimize the impurity
measure used; e.g., impurity might be measured by the total number of
examples mis-classified. The second issue is the lower bound on the
complexity of finding optimal (e.g., smallest size) trees.
Let us first consider the issue of the complexity of selecting an
optimal oblique hyperplane for a single node of a tree. In a domain
with $n$ training instances, each described using $d$ real-valued
attributes, there are at most $2^d \cdot {n \choose d}$ distinct
$d$-dimensional oblique splits; i.e., hyperplanes\footnote{Throughout
the paper, we use the terms ``split'' and ``hyperplane''
interchangeably to refer to the test at a node of a decision tree.
The first usage is standard~\cite{moret/82}, and refers to the fact
that the test splits the data into two partitions. The second usage
refers to the geometric form of the test.} that divide the training
instances uniquely into two nonoverlapping subsets. This upper bound
derives from the observation that every subset of size $d$ from the
$n$ points can define a $d$-dimensional hyperplane, and each such
hyperplane can be rotated slightly in $2^d$ directions to divide the
set of $d$ points in all possible ways. Figure \ref{figure:upperlimit}
illustrates these upper limits for two points in two dimensions.
\begin{figure}
\vspace{3.1in}
\special{psfile="upperlimit.ps" hoffset=160 voffset=-75 }
\caption{For $n$ points in $d$ dimensions ($n \geq d$),
there are $n \cdot d$ distinct axis-parallel splits, while
there are $2^d \cdot {n \choose d}$ distinct
$d$-dimensional oblique splits. This shows all distinct oblique
and axis-parallel splits for two specific points in 2-D.}
\label{figure:upperlimit}
\vspace*{-.2in}
\end{figure}
For axis-parallel splits, there are only $n \cdot d$ distinct
possibilities, and axis-parallel methods such as C4.5
\cite{quinlan/93} and CART \cite{breiman/etal/84} can exhaustively
search for the best split at each node. The problem of searching for
the best oblique split is therefore much more difficult than that of
searching for the best axis-parallel split. In fact, the problem is
NP-hard.
More precisely, Heath \citeyear{heath/92} proved that the following
problem is NP-hard: given a set of labelled examples, find the
hyperplane that minimizes the number of misclassified examples both
above and below the hyperplane. This result implies that any method
for finding the optimal oblique split is likely to have exponential
cost (assuming $P \neq NP$). Intuitively, the problem is that it is
impractical to enumerate all $2^d \cdot {n \choose d}$ distinct
hyperplanes and choose the best, as is done in axis-parallel decision
trees. However, any non-exhaustive deterministic algorithm for
searching through all these hyperplanes is prone to getting stuck in
local minima.
On the other hand, it is possible to define impurity measures for
which the problem of finding optimal hyperplanes can be solved in
polynomial time. For example, if one minimizes the sum of distances
of mis-classified examples, then the optimal solution can be found
using linear programming methods (if distance is measured along one
dimension only). However, classifiers are usually judged by how many
points they classify correctly, regardless of how close to the
decision boundary a point may lie. Thus most of the standard measures
for computing impurity base their calculation on the discrete number
of examples of each category on either side of the hyperplane.
Section~\ref{section:details} discusses several commonly used impurity
measures.
Now let us address the second issue, that of the complexity of
building a small tree. It is easy to show that the problem of
inducing the smallest axis-parallel decision tree is NP-hard. This
observation follows directly from the work of Hyafil and Rivest
\citeyear{hyafil/rivest/76}. Note that one can generate the smallest
axis-parallel tree that is consistent with the training set in
polynomial time {\it if} the number of attributes is a constant. This
can be done by using dynamic programming or branch and bound
techniques (see Moret \citeyear{moret/82} for several pointers). But
when the tree uses oblique splits, it is not clear, even for a fixed
number of attributes, how to generate an optimal (e.g., smallest)
decision tree in polynomial time. This suggests that the complexity
of constructing good oblique trees is greater than that for
axis-parallel trees.
It is also easy to see that the problem of constructing an optimal
(e.g., smallest) oblique decision tree is NP-hard. This conclusion
follows from the work of Blum and Rivest \citeyear{blum/rivest/88}.
Their result implies that in $d$ dimensions (i.e., with $d$
attributes) the problem of producing a 3-node oblique decision tree
that is consistent with the training set is NP-complete. More
specifically, they show that the following decision problem is
NP-complete: given a training set $T$ with $n$ examples and $d$
Boolean attributes, does there exist a 3-node neural network
consistent with $T$? From this it is easy to show that the following
question is NP-complete: given a training set $T$, does there exist a
3-leaf-node oblique decision tree consistent with $T$?
As a result of these complexity considerations, we took the pragmatic
approach of trying to generate small trees, but not looking for the
smallest tree. The greedy approach used by OC1 and virtually all
other decision tree algorithms implicitly tries to generate small
trees. In addition, it is easy to construct example problems for
which the optimal split at a node will not lead to the best tree; thus
our philosophy as embodied in OC1 is to find locally good splits, but
not to spend excessive computational effort on improving the quality
of these splits.
\section{Previous Work on Oblique Decision Tree Induction}
\label{section:oblique}
Before describing the OC1 algorithm, we will briefly discuss some
existing oblique DT induction methods, including CART with linear
combinations, Linear Machine Decision Trees, and Simulated Annealing
of Decision Trees. There are also methods that induce tree-like
classifiers with linear discriminants at each node, most notably
methods using linear programming
\cite{mangasarian/etal/90,bennett/mangasarian/92,%
bennett/mangasarian/94a,bennett/mangasarian/94b}. Though these
methods can find the optimal linear discriminants for specific
goodness measures, the size of the linear program grows very fast with
the number of instances and the number of attributes. There is also
some less closely related work on algorithms to train artificial
neural networks to build decision tree-like classifiers
\cite{brent/91,cios/liu/92,herman/yeung/92}.
The first oblique decision tree algorithm to be proposed was CART with
linear combinations \cite[chapter 5]{breiman/etal/84}. This
algorithm, referred to henceforth as CART-LC, is an important basis
for OC1. Figure ~\ref{figure:cart-lc} summarizes (using Breiman et al.'s
notation) what the CART-LC algorithm does at each node in the decision
tree.
\begin{figure}
{\small
{\bf
\begin{tabbing}
To \= induce a split at node $T$ of the decision tree:\\
\> Normalize values for all $d$ attributes.\\
\> $L = 0$\\
\> While \= (TRUE)\\
\> \> $L = L+1$ \\
\> \> Let the current split $s_L$ be $v \leq c$, where \( v = \sum_{i=1}^{d} {a_ix_i} \).\\
\> \> For \= $i = 1,\dots,d$\\
\> \> \> For \= $\gamma$ = -0.25,0,0.25 \\
\> \> \> \> Search for the $\delta$ that maximizes the goodness of the split $v - \delta(a_i + \gamma) \leq c$. \\
\> \> \> Let $\delta^*$,$\gamma^*$ be the settings that result in highest goodness in these 3 searches.\\
\> \> \> $a_i = a_i - \delta^*$, $c = c - \delta^*\gamma^*$.\\
\> \> Perturb $c$ to maximize the goodness of $s_L$, keeping $a_1,\ldots,a_d$ constant.\\
\> \> If $|$goodness($s_L$) - goodness$(s_{L-1})| \leq \epsilon$ exit while loop.\\
\> Eliminate irrelevant attributes in \{$a_1,\ldots,a_d$\} using backward elimination.\\
\> Convert $s_L$ to a split on the un-normalized attributes. \\
\> Return the better of $s_L$ and the best axis-parallel split as
the split for $T$.
\end{tabbing}
}
}
\caption{The procedure used by CART with linear combinations (CART-LC) at
each node of a decision tree.}
\label{figure:cart-lc}
\end{figure}
The core idea of the CART-LC algorithm is how it finds the value of
$\delta$ that maximizes the goodness of a split. This idea is also used
in OC1, and is explained in detail in Section~\ref{section:perturb}.
After describing CART-LC, Breiman et al.\ point out that there is
still much room for further development of the algorithm. OC1
represents an extension of CART-LC that includes some significant
additions. It addresses the following limitations of CART-LC:
\begin{itemize}
\item CART-LC is fully deterministic. There is no built-in mechanism for
escaping local minima, although such minima may be very common for
some domains. Figure~\ref{figure:local-minimum} shows a simple
example for which CART-LC gets stuck.
\begin{figure}
\vspace{2.0in}
\special{psfile="small.ps" hscale=80 vscale=25 hoffset=-30 voffset=-10}
\caption{The deterministic perturbation algorithm of CART-LC fails to find
the correct split for this data, even when it starts from the location of
the best axis-parallel split. OC1 finds the correct split using one random
jump.}
\label{figure:local-minimum}
\end{figure}
\item CART-LC produces only a single tree for a given data set.
\item CART-LC sometimes makes adjustments that increase the impurity
of a split. This feature was probably included to allow it to escape
some local minima.
\item There is no upper bound on the time spent at any node in the
decision tree. It halts when no perturbation changes the impurity
more than $\epsilon$, but because impurity may increase and decrease,
the algorithm can spend arbitrarily long time at a node.
\end{itemize}
Another oblique decision tree algorithm, one that uses a very
different approach from CART-LC, is the Linear Machine Decision Trees
(LMDT) system \cite{utgoff/brodley/91,brodley/utgoff/92}, which is a
successor to the Perceptron Tree method
\cite{utgoff/89b,utgoff/brodley/90}. Each internal node in an LMDT
tree is a Linear Machine \cite{nilsson/90}. The training algorithm
presents examples repeatedly at each node until the linear machine
converges. Because convergence cannot be guaranteed, LMDT uses
heuristics to determine when the node has stabilized. To make the
training stable even when the set of training instances is not
linearly separable, a ``thermal training'' method \cite{frean/90} is
used, similar to simulated annealing.
A third system that creates oblique trees is Simulated Annealing of
Decision Trees (SADT) \cite{heath/etal/93} which, like OC1, uses
randomization. SADT uses simulated annealing
\cite{kirkpatrick/etal/83} to find good values for the coefficients of
the hyperplane at each node of a tree. SADT first places a hyperplane
in a canonical location, and then iteratively perturbs all the
coefficients by small random amounts. Initially, when the temperature
parameter is high, SADT accepts almost any perturbation of the
hyperplane, regardless of how it changes the goodness score. However,
as the system ``cools down,'' only changes that improve the goodness
of the split are likely to be accepted. Though SADT's use of
randomization allows it to effectively avoid some local minima, it
compromises on efficiency. It runs much slower than either CART-LC,
LMDT or OC1, sometimes considering tens of thousands of hyperplanes
at a single node before it finishes annealing.
Our experiments in Section \ref{section:exp2} include some results
showing how all of these methods perform on three artificial domains.
We next describe a way to combine some of the strengths of the methods
just mentioned, while avoiding some of the problems. Our algorithm,
OC1, uses deterministic hill climbing most of the time, ensuring
computational efficiency. In addition, it uses two kinds of
randomization to avoid local minima. By limiting the number of random
choices, the algorithm is guaranteed to spend only polynomial time at
each node in the tree. In addition, randomization itself has produced
several benefits: for example, it means that the algorithm can produce
many different trees for the same data set. This offers the
possibility of a new family of classifiers: $k$-decision-tree
algorithms, in which an example is classified by the majority vote of
$k$ trees. Heath et al.~\citeyear{heath/etal/93b} have shown that
$k$-decision tree methods (which they call $k$-DT) will consistently
outperform single tree methods if classification accuracy is the main
criterion. Finally, our experiments indicate that OC1 efficiently
finds small, accurate decision trees for many different types of
classification problems.
\section{Oblique Classifier 1 (OC1)}
\label{section:oc1}
In this section we discuss details of the oblique decision tree
induction system OC1. As part of this description, we include:
\begin{itemize}
\itemsep -0.0in
\item the method for finding coefficients of a hyperplane at each
tree node,
\item methods for computing the impurity or goodness of a hyperplane,
\item a tree pruning strategy, and
\item methods for coping with missing and irrelevant attributes.
\end{itemize}
Section~\ref{section:perturb} focuses on the most complicated of
these algorithmic details; i.e. the question of how to find a
hyperplane that splits a given set of instances into two reasonably
``pure'' non-overlapping subsets. This randomized perturbation
algorithm is the main novel contribution of OC1\@.
Figure~\ref{figure:oc1-alg} summarizes the basic OC1 algorithm, used
at each node of a decision tree.
\begin{figure}
{\small
{\bf
\begin{tabbing}
abc \= abc \= abc \= \kill
To find a split of a set of examples $T$: \\
\> Find the best axis-parallel split of $T$. Let $I$ be the
impurity of this split. \\
\> Repeat $R$ times: \\
\> \> Choose a random hyperplane $H$. \\
\> \> (For the first iteration, initialize $H$ to be the best
axis-parallel split.) \\
\> \> Step 1: Until the impurity measure does not improve, do: \\
\> \> \> Perturb each of the coefficients of $H$ in sequence. \\
\> \> Step 2: Repeat at most $J$ times: \\
\> \> \> Choose a random direction and attempt to perturb $H$ in
that direction. \\
\> \> \> If this reduces the impurity of $H$, go to Step 1. \\
\> \> Let $I_1$ = the impurity of $H$. If $I_1 < I$, then set $I=I_1$. \\
\> Output the split corresponding to $I$.
\end{tabbing}
}
}
\vspace*{-.2in}
\caption{Overview of the OC1 algorithm for a single node of a decision tree.}
\label{figure:oc1-alg}
\end{figure}
This figure will be explained further in the following sections.
\subsection{Perturbation algorithm}
\label{section:perturb}
OC1 imposes no restrictions on the orientation of the hyperplanes.
However, in order to be at least as powerful as standard DT methods,
it first finds the best axis-parallel (univariate) split at a node
before looking for an oblique split. OC1 uses an oblique split only
when it improves over the best axis-parallel split.\footnote{As
pointed out in \cite[Chapter 5]{breiman/etal/84}, it does not make
sense to use an oblique split when the number of examples at a node
$n$ is less than or almost equal to the number of features $d$,
because the data {\it underfits} the concept. By default, OC1 uses
only axis-parallel splits at tree nodes at which $n < 2d$. The user
can vary this threshold.}
The search strategy for the space of possible hyperplanes is defined
by the procedure that perturbs the current hyperplane $H$ to a new
location. Because there are an exponential number of distinct ways to
partition the examples with a hyperplane, any procedure that simply
enumerates all of them will be unreasonably costly. The two main
alternatives considered in the past have been simulated annealing,
used in the SADT system \cite{heath/etal/93}, and deterministic
heuristic search, as in CART-LC \cite{breiman/etal/84}. OC1 combines
these two ideas, using heuristic search until it finds a local
minimum, and then using a non-deterministic search step to get out of
the local minimum. (The non-deterministic step in OC1 is {\it not}
simulated annealing, however.)
We will start by explaining how we perturb a hyperplane to split the
training set $T$ at a node of the decision tree. Let $n$ be the
number of examples in $T$, $d$ be the number of attributes (or
dimensions) for each example, and $k$ be the number of categories.
Then we can write $T_j = (x_{j1},x_{j2},\ldots,x_{jd},C_j)$ for the
$j$th example from the training set $T$, where $x_{ji}$ is the value
of attribute $i$ and $C_j$ is the category label. As defined
in Eq.~\ref{equation:hyperplane}, the
equation of the current hyperplane $H$ at a node of the decision tree
is written as \( \sum_{i=1}^{d} (a_i x_i) + a_{d+1} = 0 \).
If we substitute a point (an example) $T_j$ into the equation for $H$,
we get \( \sum_{i=1}^{d} (a_i x_{ji}) + a_{d+1} = V_j\),
where the sign of $V_j$ tells us whether the point $T_j$ is above or
below the hyperplane $H$; i.e., if $V_j > 0$, then $T_j$ is above $H$.
If $H$ splits the training set $T$ perfectly, then all points
belonging to the same category will have the same sign for
$V_j$. i.e., sign($V_i$) = sign($V_j$) iff category($T_i$) =
category($T_j$).
OC1 adjusts the coefficients of $H$ individually, finding a locally
optimal value for one coefficient at a time. This key idea was
introduced by Breiman et al. It works as follows. Treat the
coefficient $a_m$ as a variable, and treat all other coefficients as
constants. Then $V_j$ can be viewed as a function of $a_m$.
In particular, the condition that $T_j$ is above $H$ is equivalent
to
\[ V_j > 0 \]
\begin{equation}
\label{equation:uj}
a_m > \frac{a_m x_{jm} - V_j}{x_{jm}} \stackrel{\rm def}{=} U_j
\end{equation}
assuming that $x_{jm} > 0$, which we ensure by normalization. Using
this definition of $U_j$, the point $T_j$ is above $H$ if $a_m > U_j$,
and below otherwise. By plugging all the points from $T$ into this
equation, we will obtain $n$ constraints on the value of $a_m$.
The problem then is to find a value for $a_m$ that satisfies as many of
these constraints as possible. (If all the constraints are satisfied,
then we have a perfect split.) This problem is easy to solve optimally:
simply sort all the values $U_j$, and consider setting $a_m$ to
the midpoint between each pair of different values. This is illustrated in
Figure~\ref{figure:1Dsplit}.
\begin{figure}
\vspace{1.0in}
\special{psfile="1Dsplit.ps" hoffset=-20 voffset=-320}
\caption{Finding the optimal value for a single coefficient $a_m$. Large
U's correspond to examples in one category and small u's to another.}
\label{figure:1Dsplit}
\vspace*{-.2in}
\end{figure}
In the figure, the categories are indicated by font size; the larger
$U_i$'s belong to one category, and the smaller to another. For each
distinct placement of the coefficient $a_m$, OC1 computes the impurity
of the resulting split; e.g., for the location between $U_6$ and $U_7$
illustrated here, two examples on the left and one example on the
right would be misclassified (see Section~\ref{section:impmeasures}
for different ways of computing impurity). As the figure illustrates,
the problem is simply to find the best one-dimensional split of the
$U$s, which requires considering just $n-1$ values for $a_m$. The
value $a_m'$ obtained by solving this one-dimensional problem is
then considered as a replacement for $a_m$. Let $H_1$ be the
hyperplane obtained by ``perturbing'' $a_m$ to $a_m'$. If $H$ has
better (lower) impurity than $H_1$, then $H_1$ is discarded. If $H_1$
has lower impurity, $H_1$ becomes the new location of the hyperplane.
If $H$ and $H_1$ have identical impurities, then $H_1$ replaces $H$
with probability $P_{stag}$.\footnote{The parameter $P_{stag}$,
denoting ``stagnation probability'', is the probability that a
hyperplane is perturbed to a location that does not change the
impurity measure. To prevent the impurity from remaining stagnant for
a long time, $P_{stag}$ decreases by a constant amount each time OC1
makes a ``stagnant'' perturbation; thus only a constant number of
such perturbations will occur at each node. This constant can be set
by the user. $P_{stag}$ is reset to 1 every time the global impurity
measure is improved.} Figure~\ref{figure:perturb} contains pseudocode
for our perturbation procedure.
\begin{figure}
{\small
{\bf
\begin{tabbing}
\noindent Perturb(H,m) \\
\{ \= \kill
\> For \= $j = 1,\ldots,n$\\
\> \> Compute $U_j$ (Eq.~\ref{equation:uj})\\
\> Sort $U_1,\ldots,U_n$ in non-decreasing order.\\
\> $a_m'$ = best univariate split of the sorted $U_j$s.\\
\> $H_1$ = result of substituting $a_m'$ for $a_m$ in $H$.\\
\> If \((impurity(H_1) < impurity(H)) \)\\
\> \> \{ $a_m = a_m'$ ; $P_{move} = P_{stag}$ \} \\
\> Else if \((impurity(H) = impurity(H_1)) \) \\
\> \> \{ \= $a_m = a_m'$ \=with probability $P_{move}$\\
\> \> \> $P_{move} = P_{move} - 0.1 * P_{stag}$ \} \\
\end{tabbing}
}
}
\vspace*{-.3in}
\caption{Perturbation algorithm for a single coefficient $a_m$.}
\label{figure:perturb}
\vspace*{-.1in}
\end{figure}
Now that we have a method for locally improving a coefficient of a
hyperplane, we need to decide which of the $d+1$ coefficients to pick
for perturbation. We experimented with three different methods for
choosing which coefficient to adjust, namely, sequential, best first
and random.
\begin{tabbing}
{\bf Seq}: \hspace*{0.1in} \= Rep\=eat until none of the coefficient values is
modified in the {\bf For} loop:\\
\> \>For $i=1$ to $d$, Perturb($H,i$)\\
{\bf Best}: \> Repeat until coefficient $m$ remains unmodified: \\
\> \>$m$ = \=coefficient which when perturbed, results in the \\
\> \> \>maximum improvement of the impurity measure.\\
\> \>Perturb($H,m$)\\
{\bf R-50}: \> Repeat a fixed number of times (50 in our experiments): \\
\> \>$m$ = random integer between 1 and $d+1$\\
\> \>Perturb($H,m$)\\
\end{tabbing}
Our previous experiments \cite{murthy/etal/93} indicated that the
order of perturbation of the coefficients does not affect the
classification accuracy as much as other parameters, especially the
randomization parameters (see below). Since none of these orders was
uniformly better than any other, we used sequential (Seq) perturbation
for all the experiments reported in Section~\ref{section:exp}.
\subsection{Randomization}
\label{section:rand}
The perturbation algorithm halts when the split reaches a local
minimum of the impurity measure. For OC1's search space, a local
minimum occurs when no perturbation of any single coefficient of the
current hyperplane will decrease the impurity measure. (Of course, a
local minimum may also be a global minimum.) We have implemented two
ways of attempting to escape local minima: perturbing the hyperplane
with a random vector, and re-starting the perturbation algorithm with
a different random initial hyperplane.
The technique of perturbing the hyperplane with a random vector works
as follows. When the system reaches a local minimum, it chooses a
random vector to add to the coefficients of the current hyperplane.
It then computes the optimal amount by which the hyperplane should be
perturbed along this random direction. To be more precise, when a
hyperplane \(H = \sum_{i=1}^{d} a_i x_i + a_{d+1} \) cannot be
improved by deterministic perturbation, OC1 repeats the following loop
$J$ times (where $J$ is a user-specified parameter, set to 5 by
default).
\begin{itemize}
\itemsep 0.0in
\item Choose a random vector $R = (r_1,r_2,\ldots,r_{d+1})$.
\item Let $\alpha$ be the amount by which we want to perturb $H$ in the
direction $R$. In other words, let
\(H_1 = \sum_{i=1}^{d} {(a_i + \alpha r_i) x_i} + (a_{d+1} +
\alpha r_{d+1}) \).
\item Find the optimal value for $\alpha$.
\item If the hyperplane $H_1$ thus obtained decreases the overall impurity,
replace $H$ with $H_1$, exit this loop and begin the deterministic
perturbation algorithm for the individual coefficients.
\end{itemize}
Note that we can treat $\alpha$ as the only variable in the equation
for $H_1$. Therefore each of the $n$ examples in $T$, if plugged into
the equation for $H_1$, imposes a constraint on the value of $\alpha$.
OC1 therefore can use its coefficient perturbation method (see
Section~\ref{section:perturb}) to compute the best value of $\alpha$.
If $J$ random jumps fail to improve the impurity, OC1 halts and uses
$H$ as the split for the current tree node.
An intuitive way of understanding this random jump is to look at the
dual space in which the algorithm is actually searching. Note that
the equation \(H = \sum_{i=1}^{d} a_i x_i + a_{d+1} \) defines a space
in which the axes are the coefficients $a_i$ rather than the
attributes $x_i$. Every point in this space defines a distinct
hyperplane in the original formulation. The deterministic algorithm
used in OC1 picks a hyperplane and then adjusts coefficients one at a
time. Thus in the dual space, OC1 chooses a point and perturbs it by
moving it parallel to the axes. The random vector $R$ represents a
random {\it direction} in this space. By finding the best value for
$\alpha$, OC1 finds the best distance to adjust the hyperplane in the
direction of $R$.
Note that this additional perturbation in a random direction does not
significantly increase the time complexity of the algorithm (see
Appendix A). We found in our experiments that even a single random
jump, when used at a local minimum, proves to be very helpful.
Classification accuracy improved for every one of our data sets when
such perturbations were made. See Section \ref{section:exp2} for some
examples.
The second technique for avoiding local minima is a variation on the
idea of performing multiple local searches. The technique of multiple
local searches is a natural extension to local search, and has been
widely mentioned in the optimization literature (see Roth
\citeyear{roth/70} for an early example). Because most of the steps
of our perturbation algorithm are deterministic, the initial
hyperplane largely determines which local minimum will be encountered
first. Perturbing a single initial hyperplane is thus unlikely to
lead to the best split of a given data set. In cases where the random
perturbation method fails to escape from local minima, it may be
helpful to simply start afresh with a new initial hyperplane. We use
the word {\it restart} to denote one run of the perturbation
algorithms, at one node of the decision tree, using one random initial
hyperplane.\footnote{The first run through the algorithm at each node
always begins at the location of the best axis-parallel hyperplane;
all subsequent restarts begin at random locations.} That is, a
restart cycles through and perturbs the coefficients one at a time and
then tries to perturb the hyperplane in a random direction when the
algorithm reaches a local minimum. If this last perturbation reduces
the impurity, the algorithm goes back to perturbing the coefficients
one at a time. The restart ends when neither the deterministic local
search nor the random jump can find a better split. One of the
optional parameters to OC1 specifies how many restarts to use. If
more than one restart is used, then the best hyperplane found thus far
is always saved. In all our experiments, the classification
accuracies increased with more than one restart. Accuracy tended to
increase up to a point and then level off (after about 20--50
restarts, depending on the domain). Overall, the use of multiple
initial hyperplanes substantially improved the quality of the decision
trees found (see Section~\ref{section:exp2} for some examples).
By carefully combining hill-climbing and randomization, OC1 ensures a
worst case time of $O(dn^2 \log n)$ for inducing a decision tree. See
Appendix A for a derivation of this upper bound.
\paragraph{Best Axis-Parallel Split.}
It is clear that axis-parallel splits are more suitable for some data
distributions than oblique splits. To take into account such
distributions, OC1 computes the best axis-parallel split {\em and} an
oblique split at each node, and then picks the better of the
two.\footnote{Sometimes a simple axis-parallel split is preferable to
an oblique split, even if the oblique split has slightly lower
impurity. The user can specify such a bias as an input parameter to
OC1.} Calculating the best axis-parallel split takes an additional
$O(dn \log n)$ time, and so does not increase the asymptotic time
complexity of OC1. As a simple variant of the OC1 system, the user
can opt to ``switch off'' the oblique perturbations, thus building an
axis-parallel tree on the training data. Section~\ref{section:exp1}
empirically demonstrates that this axis-parallel variant of OC1
compares favorably with existing axis-parallel algorithms.
\subsection{Other Details}
\label{section:details}
\subsubsection{Impurity Measures}
\label{section:impmeasures}
OC1 attempts to divide the $d$-dimensional attribute space into
homogeneous regions; i.e., regions that contain examples from just one
category. The goal of adding new nodes to a tree is to split up the
sample space so as to minimize the ``impurity'' of the training set.
Some algorithms measure ``goodness'' instead of impurity, the
difference being that goodness values should be maximized while
impurity should be minimized. Many different measures of
impurity have been studied \cite{breiman/etal/84,quinlan/86,%
mingers/89a,buntine/niblett/92,fayyad/irani/92b,heath/etal/93}.
The OC1 system is designed to work with a large class of impurity
measures. Stated simply, if the impurity measure uses only the counts
of examples belonging to every category on both sides of a split, then
OC1 can use it. (See Murthy and Salzberg
\citeyear{murthy/salzberg/94} for ways of mapping other kinds of
impurity measures to this class of impurity measures.) The user can
plug in any impurity measure that fits this description. The OC1
implementation includes six impurity measures, namely:
\begin{enumerate}
\itemsep -0.1in
\item Information Gain
\item The Gini Index
\item The Twoing Rule
\item Max Minority
\item Sum Minority
\item Sum of Variances
\end{enumerate}
Though all six of the measures have been defined elsewhere in the
literature, in some cases we have made slight modifications that are
defined precisely in Appendix B\@. Our experiments
indicated that, on average, Information Gain, Gini Index and the
Twoing Rule perform better than the other three measures for both
axis-parallel and oblique trees. The Twoing Rule is the current
default impurity measure for OC1, and it was used in all of the
experiments reported in Section~\ref{section:exp}. There are,
however, artificial data sets for which Sum Minority and/or Max
Minority perform much better than the rest of the measures. For
instance, Sum Minority easily induces the exact tree for the POL data
set described in Section~\ref{section:artificialdata}, while all other
methods have difficulty finding the best tree.
\paragraph{Twoing Rule.}
The Twoing Rule was first proposed by Breiman et al.\ (1984). The
value to be computed is defined as:
\[ \mbox{TwoingValue} = (|T_L|/n) * (|T_R|/n) *
(\sum_{i=1}^{k} {\left| L_i/|T_L| - R_i/|T_R| \right|})^2 \]
where $|T_L|$ ($|T_R|$) is the number of examples on the left (right)
of a split at node $T$, $n$ is the number of examples at node $T$, and
$L_i$ ($R_i$) is the number of examples in category $i$ on the left
(right) of the split. The TwoingValue is actually a goodness measure
rather than an impurity measure. Therefore OC1 attempts to minimize
the reciprocal of this value.
The remaining five impurity measures implemented in OC1 are defined in
Appendix B.
\subsubsection{Pruning}
\label{section:pruning}
Virtually all decision tree induction systems prune the trees they
create in order to avoid overfitting the data. Many studies have
found that judicious pruning results in both smaller and more accurate
classifiers, for decision trees as well as other types of machine
learning systems \cite{quinlan/87b,niblett/86,cestnik/etal/87,%
kodratoff/manago/87,cohen/93,hassibi/stork/93,wolpert/92,schaffer/93}.
For the OC1 system we implemented an existing pruning method, but note
that any tree pruning method will work fine within OC1. Based on the
experimental evaluations of Mingers \citeyear{mingers/89b} and other
work cited above, we chose Breiman et al.'s Cost Complexity (CC)
pruning \citeyear{breiman/etal/84} as the default pruning method for OC1.
This method, which is also called Error Complexity or Weakest Link
pruning, requires a separate pruning set. The pruning set can be a
randomly chosen subset of the training set, or it can be approximated
using cross validation. OC1 randomly chooses 10\% (the default value)
of the training data to use for pruning. In the experiments reported
below, we only used this default value.
Briefly, the idea behind CC pruning is to create a set of trees of
decreasing size from the original, complete tree. All these trees are
used to classify the pruning set, and accuracy is estimated from that.
CC pruning then chooses the smallest tree whose accuracy is within $k$
standard errors squared of the best accuracy obtained. When the 0-SE
rule ($k=0$) is used, the tree with highest accuracy on the pruning
set is selected. When $k>0$, smaller tree size is preferred over
higher accuracy. For details of Cost Complexity pruning, see Breiman
et al.~\citeyear{breiman/etal/84} or Mingers~\citeyear{mingers/89b}.
\subsubsection{Irrelevant attributes}
Irrelevant attributes pose a significant problem for most machine
learning methods
\cite{breiman/etal/84,aha/90,almuallin/dietterich/91,%
kira/rendell/92,salzberg/92,cardie/93,schlimmer/93,langley/sage/93,%
brodley/utgoff/94}. Decision tree algorithms, even axis-parallel
ones, can be confused by too many irrelevant attributes. Because
oblique decision trees learn the coefficients of each attribute at a
DT node, one might hope that the values chosen for each coefficient
would reflect the relative importance of the corresponding attributes.
Clearly, though, the process of searching for good coefficient values
will be much more efficient when there are fewer attributes; the
search space is much smaller. For this reason, oblique DT induction
methods can benefit substantially by using a feature selection method
(an algorithm that selects a subset of the original attribute set) in
conjunction with the coefficient learning algorithm
\cite{breiman/etal/84,brodley/utgoff/94}.
Currently, OC1 does not have a built-in mechanism to select relevant
attributes. However, it is easy to include any of several standard
methods (e.g., stepwise forward selection or stepwise backward
selection) or even an ad hoc method to select features before running
the tree-building process. For example, in separate experiments on
data from the Hubble Space Telescope \cite{salzberg/etal/94}, we used
feature selection methods as a preprocessing step to OC1, and reduced
the number of attributes from 20 to 2. The resulting decision trees
were both simpler and more accurate. Work is currently underway to
incorporate an efficient feature selection technique into the OC1
system.
Regarding missing values, if an example is missing a value for any
attribute, OC1 uses the mean value for that attribute. One can of
course use other techniques for handling missing values, but those
were not considered in this study.
\section{Experiments}
\label{section:exp}
In this section, we present two sets of experiments to support the
following two claims.
\begin{enumerate}
\item OC1 compares favorably over a variety of real-world domains
with several existing axis-parallel and oblique decision tree induction
methods.
\item Randomization, both in the form of multiple local searches and
random jumps, improves the quality of decision trees produced by OC1.
\end{enumerate}
The experimental method used for all the experiments is described
in Section \ref{section:expmethod}. Sections \ref{section:exp1} and
\ref{section:exp2} describe experiments corresponding to the
above two claims. Each experimental section begins with a description
of the data sets, and then presents the experimental results and
discussion.
\subsection{Experimental Method}
\label{section:expmethod}
We used five-fold cross validation (CV) in all our experiments to
estimate classification accuracy. A $k$-fold CV experiment consists
of the following steps.
\begin{enumerate}
\itemsep -0.1in
\item Randomly divide the data into $k$ equal-sized disjoint partitions.
\item For each partition, build a decision tree using all data outside the
partition, and test the tree on the data in the partition.
\item Sum the number of correct classifications of the $k$ trees and divide
by the total number of instances to compute the classification accuracy.
Report this accuracy and the average size of the $k$ trees.
\end{enumerate}
Each entry in Tables \ref{table:1} and \ref{table:2} is a result of
ten 5-fold CV experiments; i.e., the result of tests that used 50
decision trees. Each of the ten 5-fold cross validations used a
different random partitioning of the data. Each entry in the tables
reports the mean and standard deviation of the classification
accuracy, followed by the mean and standard deviation of the decision
tree size (measured as the number of leaf nodes). Good results should
have high values for accuracy, low values for tree size, and small
standard deviations.
In addition to OC1, we also included in the experiments an
axis-parallel version of OC1, which only considers axis-parallel
hyperplanes. We call this version, described in
Section~\ref{section:rand}, OC1-AP\@. In all our experiments, both
OC1 and OC1-AP used the Twoing Rule (Section \ref{section:impmeasures})
to measure impurity. Other parameters to OC1 took their default
values unless stated otherwise. (Defaults include the following:
number of restarts at each node: 20. Number of random jumps attempted
at each local minimum: 5. Order of coefficient perturbation:
Sequential. Pruning method: Cost Complexity with the 0-SE rule, using
10\% of the training set exclusively for pruning.)
In our comparison, we used the oblique version of the CART algorithm,
CART-LC\@. We implemented our own version of CART-LC, following the
description in Brei\-man et al.~\citeyear[Chapter 5]{breiman/etal/84};
however, there may be differences between our version and other
versions of this system (note that CART-LC is not freely available).
Our implementation of CART-LC measured impurity with the Twoing Rule
and used 0-SE Cost Complexity pruning with a separate test set, just as
OC1 does. We did not include any feature selection methods in CART-LC
or in OC1, and we did not implement normalization. Because the CART
coefficient perturbation algorithm may alternate indefinitely between
two locations of a hyperplane (see Section~\ref{section:oblique}), we
imposed an arbitrary limit of 100 such perturbations before forcing
the perturbation algorithm to halt.
We also included axis-parallel CART and C4.5 in our comparisons. We
used the implementations of these algorithms from the IND 2.1
package~\cite{buntine/92}. The default cart0 and c4.5 ``styles''
defined in the package were used, without altering any parameter
settings. The cart0 style uses the Twoing Rule and 0-SE cost
complexity pruning with 10-fold cross validation. The pruning method,
impurity measure and other defaults of the c4.5 style are the same as
those described in Quinlan \citeyear{quinlan/93}.
\subsection{OC1 vs. Other Decision Tree Induction Methods}
\label{section:exp1}
Table~\ref{table:1} compares the performance of OC1 to three
well-known decision tree induction methods plus OC1-AP on six
different real-world data sets. In the next section we will consider
artificial data, for which the concept definition can be precisely
characterized.
\subsubsection{Description of Data Sets}
\label{section:realdata}
\paragraph{Star/Galaxy Discrimination.}
Two of our data sets came from a large set of astronomical images
collected by Odewahn et al. \cite{odewahn/etal/92}. In their study,
they used these images to train artificial neural networks running the
perceptron and back propagation algorithms. The goal was to
classify each example as either ``star'' or ``galaxy.'' Each image is