-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathclassification.html
2169 lines (1974 loc) · 107 KB
/
classification.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Smile - Classification</title>
<meta name="description" content="Statistical Machine Intelligence and Learning Engine">
<!-- prettify js and CSS -->
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js?lang=scala&lang=kotlin&lang=clj"></script>
<style>
.prettyprint ol.linenums > li { list-style-type: decimal; }
</style>
<!-- Bootstrap core CSS -->
<link href="css/cerulean.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
<!-- slider -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.min.js"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.transitions.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.theme.min.css" type="text/css" />
<!-- table of contents auto generator -->
<script src="js/toc.js" type="text/javascript"></script>
<!-- styles for pager and table of contents -->
<link rel="stylesheet" href="css/pager.css" type="text/css" />
<link rel="stylesheet" href="css/toc.css" type="text/css" />
<!-- Vega-Lite Embed -->
<script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-57GD08QCML"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-57GD08QCML');
</script>
<!-- Sidebar and testimonial-slider -->
<script type="text/javascript">
$(document).ready(function(){
// scroll/follow sidebar
// #sidebar is defined in the content snippet
// This script has to be executed after the snippet loaded.
// $.getScript("js/follow-sidebar.js");
$("#testimonial-slider").owlCarousel({
items: 1,
singleItem: true,
pagination: true,
navigation: false,
loop: true,
autoPlay: 10000,
stopOnHover: true,
transitionStyle: "backSlide",
touchDrag: true
});
});
</script>
</head>
<body>
<div class="container" style="max-width: 1200px;">
<header>
<div class="masthead">
<p class="lead">
<a href="index.html">
<img src="images/smile.jpg" style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px; margin-right: 20px;">
<span class="tagline">Smile — Statistical Machine Intelligence and Learning Engine</span>
</a>
</p>
</div>
<nav class="navbar navbar-default" role="navigation">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="navbar-collapse">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quickstart.html">Quick Start</a></li>
<li><a href="overview.html">What's Machine Learning</a></li>
<li><a href="data.html">Data Processing</a></li>
<li><a href="visualization.html">Data Visualization</a></li>
<li><a href="vegalite.html">Declarative Visualization</a></li>
<li><a href="gallery.html">Gallery</a></li>
<li><a href="faq.html">FAQ</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Supervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="classification.html">Classification</a></li>
<li><a href="regression.html">Regression</a></li>
<li><a href="deep-learning.html">Deep Learning</a></li>
<li><a href="feature.html">Feature Engineering</a></li>
<li><a href="validation.html">Model Validation</a></li>
<li><a href="missing-value-imputation.html">Missing Value Imputation</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Unsupervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="clustering.html">Clustering</a></li>
<li><a href="vector-quantization.html">Vector Quantization</a></li>
<li><a href="association-rule.html">Association Rule Mining</a></li>
<li><a href="mds.html">Multi-Dimensional Scaling</a></li>
<li><a href="manifold.html">Manifold Learning</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">LLM & NLP <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="llm.html">Large Language Model (LLM)</a></li>
<li><a href="nlp.html">Natural Language Processing (NLP)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Math <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="linear-algebra.html">Linear Algebra</a></li>
<li><a href="statistics.html">Statistics</a></li>
<li><a href="wavelet.html">Wavelet</a></li>
<li><a href="interpolation.html">Interpolation</a></li>
<li><a href="graph.html">Graph Data Structure</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/java/index.html" target="_blank">Java</a></li>
<li><a href="api/scala/index.html" target="_blank">Scala</a></li>
<li><a href="api/kotlin/index.html" target="_blank">Kotlin</a></li>
<li><a href="api/clojure/index.html" target="_blank">Clojure</a></li>
<li><a href="api/json/index.html" target="_blank">JSON</a></li>
</ul>
</li>
<li><a href="https://mybinder.org/v2/gh/haifengl/smile/notebook?urlpath=lab%2Ftree%2Fshell%2Fsrc%2Funiversal%2Fnotebooks%2Findex.ipynb" target="_blank">Try It Online</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
</header>
<div id="content" class="row">
<div class="col-md-3 col-md-push-9 hidden-xs hidden-sm">
<div id="sidebar">
<div class="sidebar-toc" style="margin-bottom: 20px;">
<p class="toc-header">Contents</p>
<div id="toc"></div>
</div>
<div id="search">
<script>
(function() {
var cx = '010264411143030149390:ajvee_ckdzs';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
<gcse:searchbox-only></gcse:searchbox-only>
</div>
</div>
</div>
<div class="col-md-9 col-md-pull-3">
<h1 id="classification-top" class="title">Classification</h1>
<p>Smile's classification algorithms are in the package
<a href="api/java/smile/classification/package-summary.html"><code>smile.classification</code></a>
and all algorithms implement the interface
<a href="api/java/smile/classification/Classifier.html"><code>Classifier</code></a>
that has a method <code>predict</code> to predict the class label of an instance.
An overloaded version in <a href="api/java/smile/classification/SoftClassifier.html"><code>SoftClassifier</code></a>
can also calculate the <i>a posteriori</i>
probabilities besides the class label.</p>
<p>Some algorithms with online learning capability also implement the interface
<a href="api/java/smile/classification/OnlineClassifier.html"><code>OnlineClassifier</code></a>.
Online learning is a model of induction that learns one instance at a time.
The method <code>update</code> updates the model with a new instance.</p>
<p>The high-level operators are defined in Scala package object of
<a href="api/scala/index.html#smile.classification.package"><code>smile.classification</code></a>.
In what follows, we will discuss each algorithm, their high-level
Scala API, and examples.</p>
<h2 id="introduction">Introduction</h2>
<p>In machine learning and pattern recognition,
classification refers to an algorithmic procedure for assigning a given
input object into one of a given number of categories. The input
object is formally termed an instance, and the categories are termed classes.</p>
<p>The instance is usually described by a vector of features, which together
constitute a description of all known characteristics of the instance.
Typically, features are either categorical (also known as nominal, i.e.
consisting of one of a set of unordered items, such as a gender of "male"
or "female", or a blood type of "A", "B", "AB" or "O"), ordinal (consisting
of one of a set of ordered items, e.g. "large", "medium" or "small"),
integer-valued (e.g. a count of the number of occurrences of a particular
word in an email) or real-valued (e.g. a measurement of blood pressure).</p>
<p>Classification normally refers to a supervised procedure, i.e. a procedure
that produces an inferred function to predict the output value of new
instances based on a training set of pairs consisting of an input object
and a desired output value. The inferred function is called a classifier
if the output is discrete or a regression function if the output is
continuous.</p>
<p>The inferred function should predict the correct output value for any valid
input object. This requires the learning algorithm to generalize from the
training data to unseen situations in a "reasonable" way.</p>
<p>A wide range of supervised learning algorithms is available, each with
its strengths and weaknesses. There is no single learning algorithm that
works best on all supervised learning problems. The most widely used
learning algorithms are AdaBoost and gradient boosting, support vector
machines, linear regression, linear discriminant analysis, logistic
regression, naive Bayes, decision trees, k-nearest neighbor algorithm,
and neural networks (multilayer perceptron).</p>
<p>If the feature vectors include features of many different kinds (discrete,
discrete ordered, counts, continuous values), some algorithms cannot be
easily applied. Many algorithms, including linear regression, logistic
regression, neural networks, and nearest neighbor methods, require that
the input features be numerical and scaled to similar ranges (e.g., to
the [-1,1] interval). Methods that employ a distance function, such as
nearest neighbor methods and support vector machines with Gaussian kernels,
are particularly sensitive to this. An advantage of decision trees (and
boosting algorithms based on decision trees) is that they easily handle
heterogeneous data.</p>
<p>If the input features contain redundant information (e.g., highly correlated
features), some learning algorithms (e.g., linear regression, logistic
regression, and distance based methods) will perform poorly because of
numerical instabilities. These problems can often be solved by imposing
some form of regularization.</p>
<p>If each of the features makes an independent contribution to the output,
then algorithms based on linear functions (e.g., linear regression,
logistic regression, linear support vector machines, naive Bayes) generally
perform well. However, if there are complex interactions among features,
then algorithms such as nonlinear support vector machines, decision trees
and neural networks work better. Linear methods can also be applied, but
the engineer must manually specify the interactions when using them.</p>
<p>There are several major issues to consider in supervised learning:</p>
<dl>
<dt>Features</dt>
<dd><p>The accuracy of the inferred function depends strongly on how the input
object is represented. Typically, the input object is transformed into
a feature vector, which contains a number of features that are descriptive
of the object. The number of features should not be too large, because of
the curse of dimensionality; but should contain enough information to
accurately predict the output.</p>
<p>There are many algorithms for feature selection that seek to identify
the relevant features and discard the irrelevant ones. More generally,
dimensionality reduction may seek to map the input data into a lower
dimensional space prior to running the supervised learning algorithm.</p>
</dd>
<dt>Overfitting</dt>
<dd><p>Overfitting occurs when a statistical model describes random error
or noise instead of the underlying relationship. Overfitting generally
occurs when a model is excessively complex, such as having too many
parameters relative to the number of observations. A model which has
been overfit will generally have poor predictive performance, as it can
exaggerate minor fluctuations in the data.</p>
<p>The potential for overfitting depends not only on the number of parameters
and data but also the conformability of the model structure with the data
shape, and the magnitude of model error compared to the expected level
of noise or error in the data.</p>
<p>In order to avoid overfitting, it is necessary to use additional techniques
(e.g. cross-validation, regularization, early stopping, pruning, Bayesian
priors on parameters or model comparison), that can indicate when further
training is not resulting in better generalization. The basis of some
techniques is either (1) to explicitly penalize overly complex models,
or (2) to test the model's ability to generalize by evaluating its
performance on a set of data not used for training, which is assumed to
approximate the typical unseen data that a model will encounter.</p>
</dd>
<dt>Regularization</dt>
<dd><p>Regularization involves introducing additional information in order
to solve an ill-posed problem or to prevent over-fitting. This information
is usually of the form of a penalty for complexity, such as restrictions
for smoothness or bounds on the vector space norm.
A theoretical justification for regularization is that it attempts to impose
Occam's razor on the solution. From a Bayesian point of view, many
regularization techniques correspond to imposing certain prior distributions
on model parameters.</p>
</dd>
<dt>Bias-variance tradeoff</dt>
<dd><p>Mean squared error (MSE) can be broken down into two components:
variance and squared bias, known as the bias-variance decomposition.
Thus, in order to minimize the MSE, we need to minimize both the bias and
the variance. However, this is not trivial. Therefore, there is a tradeoff
between bias and variance.</p>
</dd>
</dl>
<h2 id="knn">K-Nearest Neighbor</h2>
<p>The k-nearest neighbor algorithm (k-NN) is
a method for classifying objects by a majority vote of its neighbors,
with the object being assigned to the class most common amongst its k
nearest neighbors (k is typically small).
k-NN is a type of instance-based learning, or lazy learning where the
function is only approximated locally and all computation
is deferred until classification.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_1" data-toggle="tab">Java</a></li>
<li><a href="#scala_1" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_1" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def knn(x: Array[Array[Double]], y: Array[Int], k: Int): KNN[Array[Double]]
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class KNN {
public static KNN<double[]> fit(double[][] x, int[] y, int k);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
fun knn(x: Array<DoubleArray>, y: IntArray, k: Int): KNN<DoubleArray>
</code></pre>
</div>
</div>
</div>
<p>The simplest k-NN method takes a data set of feature vectors and labels
with Euclidean distance as the similarity measure. When applied to the
Iris datset, the accuracy of 10-fold cross validation is 96% for <code>k = 3</code>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_2" data-toggle="tab">Java</a></li>
<li><a href="#scala_2" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_2" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val iris = read.arff("data/weka/iris.arff")
val x = iris.drop("class").toArray()
val y = iris("class").toIntArray()
cv.classification(10, x, y) { case (x, y) => knn(x, y, 3) }
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var iris = Read.arff("data/weka/iris.arff");
var x = iris.drop("class").toArray();
var y = iris.column("class").toIntArray();
CrossValidation.classification(10, x, y, (x, y) -> KNN.fit(x, y, 3));
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
val iris = read.arff("data/weka/iris.arff")
val x = iris.drop("class").toArray()
val y = iris.column("class").toIntArray()
CrossValidation.classification(10, x, y, {x, y -> KNN.fit(x, y, 3)})
</code></pre>
</div>
</div>
</div>
<p>The best choice of k depends upon the data; generally, larger values of
k reduce the effect of noise on the classification, but make boundaries
between classes less distinct. A good k can be selected by various
heuristic techniques, e.g. cross-validation. In binary problems, it is
helpful to choose k to be an odd number as this avoids tied votes.</p>
<p>The nearest neighbor algorithm has some strong consistency results. As
the amount of data approaches infinity, the algorithm is guaranteed to
yield an error rate no worse than twice the Bayes error rate (the minimum
achievable error rate given the distribution of the data). k-NN is
guaranteed to approach the Bayes error rate, for some value of k (where k
increases as a function of the number of data points).</p>
<p>The user can also provide a customized distance function.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_3" data-toggle="tab">Java</a></li>
<li><a href="#scala_3" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_3" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def knn[T](x: Array[T], y: Array[Int], k: Int, distance: Distance[T]): KNN[T]
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class KNN {
public static KNN<T> fit(T[] x, int[] y, int k, Distance<T> distance);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun <T> knn(x: Array<T>, y: IntArray, k: Int, distance: Distance<T>): KNN<T>
</code></pre>
</div>
</div>
</div>
<p>Often, the classification accuracy of k-NN can be improved significantly
if the distance metric is learned with specialized algorithms such as
Large Margin Nearest Neighbor or Neighborhood Components Analysis.</p>
<p>Alternatively, the user may provide a k-nearest neighbor search data
structure. Besides the simple linear search, Smile provides KD-Tree,
Cover Tree, and LSH (Locality-Sensitive Hashing) for efficient
k-nearest neighbor search.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_4" data-toggle="tab">Java</a></li>
<li><a href="#scala_4" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_4" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def knn[T](x: KNNSearch[T, T], y: Array[Int], k: Int): KNN[T]
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class KNN {
public KNN(KNNSearch<T, T> knn, int[] y, int k);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun <T> knn(x: KNNSearch<T, T>, y: IntArray, k: Int): KNN<T>
</code></pre>
</div>
</div>
</div>
<p>A KD-tree (short for k-dimensional tree) is a space-partitioning dataset
structure for organizing points in a k-dimensional space. Cover tree
is a data structure for generic nearest neighbor search (with a metric),
which is especially efficient in spaces with small intrinsic dimension.
The cover tree has a theoretical bound that is based on the dataset's
doubling constant. LSH is an efficient algorithm for approximate nearest
neighbor search in high dimensional spaces by performing probabilistic
dimension reduction of data.</p>
<p>Nearest neighbor rules in effect compute the decision boundary in an
implicit manner. In the following example, we show the implicit decision
boundary of k-NN on a 2-dimensional toy data. Please try different <code>k</code>,
and you will observe the change of decision boundary. In general, the larger
<code>k</code>, the smoother the boundary.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_5" data-toggle="tab">Java</a></li>
<li><a href="#scala_5" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_5" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val toy = read.csv("data/classification/toy200.txt", delimiter="\t", header=false)
val x = toy.select(1, 2).toArray()
val y = toy.column(0).toIntArray()
knn(x, y, 3)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
var toy = Read.csv("data/classification/toy200.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
var x = toy.select(1, 2).toArray();
var y = toy.column(0).toIntArray();
KNN.fit(x, y, 3);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
import smile.*;
import smile.classification.*;
val toy = read.csv("data/classification/toy200.txt", delimiter='\t', header=false)
val x = toy.select(1, 2).toArray()
val y = toy.column(0).toIntArray()
knn(x, y, 3)
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/knn.png" class="enlarge" style="width: 480px;" />
</div>
<p>In what follows, we will discuss the classification
methods that compute the decision boundary explicitly.
The computational complexity in the prediction phase
is therefore a function of the boundary complexity.
We will start with the simplest linear decision functions.</p>
<h2 id="lda">Linear Discriminant Analysis</h2>
<p>Linear discriminant analysis (LDA) is based on the Bayes decision theory
and assumes that the conditional probability density functions are normally
distributed. LDA also makes the simplifying homoscedastic assumption (i.e.
that the class covariances are identical) and that the covariances have full
rank. With these assumptions, the discriminant function of an input being
in a class is purely a function of this linear combination of independent
variables.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_6" data-toggle="tab">Java</a></li>
<li><a href="#scala_6" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_6" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def lda(x: Array[Array[Double]], y: Array[Int], priori: Array[Double] = null, tol: Double = 0.0001):LDA
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class LDA {
public static LDA fit(double[][] x, int[] y, double[] priori, double tol);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun lda(x: Array<DoubleArray>, y: IntArray, priori: DoubleArray? = null, tol: Double = 0.0001): LDA
</code></pre>
</div>
</div>
</div>
<p>The parameter <code>priori</code> is the priori probability of each class. If null, it will be
estimated from the training data. The parameter <code>tol</code> is a tolerance to decide
if a covariance matrix is singular. The function will reject variables whose variance is
less than tol<sup>2</sup>.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_7" data-toggle="tab">Java</a></li>
<li><a href="#scala_7" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_7" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
lda(x, y)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
LDA.fit(x, y);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
lda(x, y)
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/lda.png" class="enlarge" style="width: 480px;" />
</div>
<p>As shown in the plot of toy data, the decision boundary of LDA is linear. Note that
the decision boundary in the plot is the estimated contour on the testing grid (small dots)
rather than the exact decision function. Therefore, there are the artifacts caused by
discretized grid.</p>
<p>LDA is closely related to ANOVA (analysis of variance) and linear regression
analysis, which also attempt to express one dependent variable as a
linear combination of other features or measurements. In the other two
methods, however, the dependent variable is a numerical quantity, while
for LDA it is a categorical variable (i.e. the class label). Logistic
regression and probit regression are more similar to LDA, as they also
explain a categorical variable. These other methods are preferable in
applications where it is not reasonable to assume that the independent
variables are normally distributed, which is a fundamental assumption
of the LDA method.</p>
<p>One complication in applying LDA (and Fisher's discriminant) to real data
occurs when the number of variables/features does not exceed
the number of samples. In this case, the covariance estimates do not have
full rank, and so cannot be inverted. This is known as small sample size
problem.</p>
<h2 id="fld">Fisher's Linear Discriminant</h2>
<p>Fisher's linear discriminant (FLD) is another popular linear classifier.
Fisher defined the separation between two
distributions to be the ratio of the variance between the classes to
the variance within the classes, which is, in some sense, a measure
of the signal-to-noise ratio for the class labeling. FLD finds a linear
combination of features which maximizes the separation after the projection.
The resulting combination may be used for dimensionality reduction
before later classification.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_8" data-toggle="tab">Java</a></li>
<li><a href="#scala_8" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_8" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def fisher(x: Array[Array[Double]], y: Array[Int], L: Int = -1, tol: Double = 0.0001): FLD
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class FLD {
public static FLD fit(double[][] x, int[] y, int L, double tol);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun fisher(x: Array<DoubleArray>, y: IntArray, L: Int = -1, tol: Double = 0.0001): FLD
</code></pre>
</div>
</div>
</div>
<p>The parameter <code>L</code> is the dimensionality of mapped space.
The default value is the number of classes - 1.</p>
<p>In FLD, the features are mapped to the subspace spanned by the eigenvectors
corresponding to the <code>C − 1</code> largest eigenvalues of
<code>Σ<sup>-1</sup>Σ<sub>b</sub></code>,
where <code>Σ</code> is the covariance matrix,
<code>Σ<sub>b</sub></code> is the scatter matrix between class means,
and <code>C</code> is the number of classes.</p>
<p>The terms Fisher's linear discriminant and LDA are often used
interchangeably, although FLD actually describes a slightly different
discriminant, which does not make some of the assumptions of LDA such
as normally distributed classes or equal class covariances.
When the assumptions of LDA are satisfied, FLD is equivalent to LDA.</p>
<ul class="nav nav-tabs">
<li><a href="#java_9" data-toggle="tab">Java</a></li>
<li><a href="#scala_9" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_9" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
fisher(x, y)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
FLD.fit(x, y);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fisher(x, y)
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/fld.png" class="enlarge" style="width: 480px;" />
</div>
<p>FLD is also closely related to principal component analysis (PCA), which also
looks for linear combinations of variables which best explain the data.
As a supervised method, FLD explicitly attempts to model the
difference between the classes of data. On the other hand, PCA is an
unsupervised method and does not take into account any difference in class.</p>
<p>One complication in applying FLD (and LDA) to real data
occurs when the number of variables/features does not exceed
the number of samples. In this case, the covariance estimates do not have
full rank, and so cannot be inverted. This is known as small sample size
problem.</p>
<h2 id="qda">Quadratic Discriminant analysis</h2>
<p>Quadratic discriminant analysis (QDA) is closely related to LDA.
Like LDA, QDA models the conditional probability density
functions as a Gaussian distribution, then uses the posterior distributions
to estimate the class for a given test data.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_10" data-toggle="tab">Java</a></li>
<li><a href="#scala_10" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_10" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def qda(x: Array[Array[Double]], y: Array[Int], priori: Array[Double] = null, tol: Double = 0.0001):QDA
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class QDA {
public static QDA fit(double[][] x, int[] y, double[] priori, double tol);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun qda(x: Array<DoubleArray>, y: IntArray, priori: DoubleArray? = null, tol: Double = 0.0001): QDA
</code></pre>
</div>
</div>
</div>
<p>Unlike LDA, however, in QDA there is no assumption that the covariance of each of the classes
is identical. Therefore, the resulting separating surface between
the classes is quadratic.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_11" data-toggle="tab">Java</a></li>
<li><a href="#scala_11" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_11" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_11">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
qda(x, y)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_11">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
QDA.fit(x, y);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_11">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
qda(x, y)
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/qda.png" class="enlarge" style="width: 480px;" />
</div>
<p>The Gaussian parameters for each class can be estimated from training data
with maximum likelihood (ML) estimation. However, when the number of
training instances is small compared to the dimension of input space,
the ML covariance estimation can be ill-posed. One approach to resolve
the ill-posed estimation is to regularize the covariance estimation
as in regularized discriminant analysis.</p>
<h2 id="rda">Regularized Discriminant Analysis</h2>
<p>In LDA and FLD, the eigenvectors corresponding to the smaller eigenvalues will
tend to be very sensitive to the exact choice of training data, and it is often
necessary to use regularization. In the regularized discriminant analysis (RDA),
the regularized covariance matrices of each class is
Σ<sub>k</sub>(α) = α Σ<sub>k</sub> + (1 - α) Σ.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_12" data-toggle="tab">Java</a></li>
<li><a href="#scala_12" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_12" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_12">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def rda(x: Array[Array[Double]], y: Array[Int], alpha: Double, priori: Array[Double] = null, tol: Double = 0.0001): RDA
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_12">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
public class RDA {
public static RDA fit(double[][] x, int[] y, double alpha, double[] priori, double tol);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_12">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun rda(x: Array<DoubleArray>, y: IntArray, alpha: Double, priori: DoubleArray? = null, tol: Double = 0.0001): RDA
</code></pre>
</div>
</div>
</div>
<p>where the parameter <code>alpha</code> is the regularization factor in <code>[0, 1]</code>.</p>
<p>RDA is a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common
variance as in LDA. This method is very similar in flavor to ridge regression.
The quadratic discriminant function is defined using the shrunken covariance
matrices Σ<sub>k</sub>(α). The parameter α in [0, 1]
controls the complexity of the model. When α is one, RDA becomes QDA.
While α is zero, RDA is equivalent to LDA. Therefore, the
regularization factor α allows a continuum of models between LDA and QDA.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_13" data-toggle="tab">Java</a></li>
<li><a href="#scala_13" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_13" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_13">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
rda(x, y, 0.1)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_13">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
RDA.fit(x, y, 0.1);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_13">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
rda(x, y, 0.1)
</code></pre>
</div>
</div>
</div>
<div style="width: 100%; display: inline-block; text-align: center;">
<img src="images/rda.png" class="enlarge" style="width: 480px;" />
</div>
<h2 id="logit">Logistic Regression</h2>
<p>Logistic regression (logit model) is a generalized
linear model used for binomial regression. Logistic regression applies
maximum likelihood estimation after transforming the dependent into
a logit variable. A logit is the natural log of the odds of the dependent
equaling a certain value or not (usually 1 in binary logistic models,
the highest value in multinomial models). In this way, logistic regression
estimates the odds of a certain event (value) occurring.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_14" data-toggle="tab">Java</a></li>
<li><a href="#scala_14" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_14" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_14">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
def logit(x: Array[Array[Double]],
y: Array[Int],
lambda: Double = 0.0,
tol: Double = 1E-5,
maxIter: Int = 500): LogisticRegression
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_14">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
public class LogisticRegression {
public static LogisticRegression fit(double[][] x, int[] y, Options options);
}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_14">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
fun logit(x: Array<DoubleArray>,
y: IntArray,
lambda: Double = 0.0,
tol: Double = 1E-5,
maxIter: Int = 500): LogisticRegression
</code></pre>
</div>
</div>
</div>
<p>where the parameter <code>lambda (> 0)</code> gives a "regularized" estimate of linear
weights which often has superior generalization performance, especially
when the dimensionality is high.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_15" data-toggle="tab">Java</a></li>
<li><a href="#scala_15" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_15" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_15">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
logit(x, y)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_15">
<div class="code" style="text-align: left;">