-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnlp.html
1272 lines (1115 loc) · 67.3 KB
/
nlp.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Smile - NLP</title>
<meta name="description" content="Statistical Machine Intelligence and Learning Engine">
<!-- prettify js and CSS -->
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js?lang=scala&lang=kotlin&lang=clj"></script>
<style>
.prettyprint ol.linenums > li { list-style-type: decimal; }
</style>
<!-- Bootstrap core CSS -->
<link href="css/cerulean.min.css" rel="stylesheet">
<link href="css/custom.css" rel="stylesheet">
<script src="https://code.jquery.com/jquery.min.js"></script>
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.6/js/bootstrap.min.js"></script>
<!-- slider -->
<script src="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.min.js"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.carousel.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.transitions.css" type="text/css" />
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/owl-carousel/1.3.3/owl.theme.min.css" type="text/css" />
<!-- table of contents auto generator -->
<script src="js/toc.js" type="text/javascript"></script>
<!-- styles for pager and table of contents -->
<link rel="stylesheet" href="css/pager.css" type="text/css" />
<link rel="stylesheet" href="css/toc.css" type="text/css" />
<!-- Vega-Lite Embed -->
<script src="https://cdn.jsdelivr.net/npm/vega@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-lite@5"></script>
<script src="https://cdn.jsdelivr.net/npm/vega-embed@6"></script>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-57GD08QCML"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-57GD08QCML');
</script>
<!-- Sidebar and testimonial-slider -->
<script type="text/javascript">
$(document).ready(function(){
// scroll/follow sidebar
// #sidebar is defined in the content snippet
// This script has to be executed after the snippet loaded.
// $.getScript("js/follow-sidebar.js");
$("#testimonial-slider").owlCarousel({
items: 1,
singleItem: true,
pagination: true,
navigation: false,
loop: true,
autoPlay: 10000,
stopOnHover: true,
transitionStyle: "backSlide",
touchDrag: true
});
});
</script>
</head>
<body>
<div class="container" style="max-width: 1200px;">
<header>
<div class="masthead">
<p class="lead">
<a href="index.html">
<img src="images/smile.jpg" style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px; margin-right: 20px;">
<span class="tagline">Smile — Statistical Machine Intelligence and Learning Engine</span>
</a>
</p>
</div>
<nav class="navbar navbar-default" role="navigation">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="navbar-collapse">
<ul class="nav navbar-nav">
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quickstart.html">Quick Start</a></li>
<li><a href="overview.html">What's Machine Learning</a></li>
<li><a href="data.html">Data Processing</a></li>
<li><a href="visualization.html">Data Visualization</a></li>
<li><a href="vegalite.html">Declarative Visualization</a></li>
<li><a href="gallery.html">Gallery</a></li>
<li><a href="faq.html">FAQ</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Supervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="classification.html">Classification</a></li>
<li><a href="regression.html">Regression</a></li>
<li><a href="deep-learning.html">Deep Learning</a></li>
<li><a href="feature.html">Feature Engineering</a></li>
<li><a href="validation.html">Model Validation</a></li>
<li><a href="missing-value-imputation.html">Missing Value Imputation</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Unsupervised Learning <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="clustering.html">Clustering</a></li>
<li><a href="vector-quantization.html">Vector Quantization</a></li>
<li><a href="association-rule.html">Association Rule Mining</a></li>
<li><a href="mds.html">Multi-Dimensional Scaling</a></li>
<li><a href="manifold.html">Manifold Learning</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">LLM & NLP <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="llm.html">Large Language Model (LLM)</a></li>
<li><a href="nlp.html">Natural Language Processing (NLP)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Math <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="linear-algebra.html">Linear Algebra</a></li>
<li><a href="statistics.html">Statistics</a></li>
<li><a href="wavelet.html">Wavelet</a></li>
<li><a href="interpolation.html">Interpolation</a></li>
<li><a href="graph.html">Graph Data Structure</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API <b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/java/index.html" target="_blank">Java</a></li>
<li><a href="api/scala/index.html" target="_blank">Scala</a></li>
<li><a href="api/kotlin/index.html" target="_blank">Kotlin</a></li>
<li><a href="api/clojure/index.html" target="_blank">Clojure</a></li>
<li><a href="api/json/index.html" target="_blank">JSON</a></li>
</ul>
</li>
<li><a href="https://mybinder.org/v2/gh/haifengl/smile/notebook?urlpath=lab%2Ftree%2Fshell%2Fsrc%2Funiversal%2Fnotebooks%2Findex.ipynb" target="_blank">Try It Online</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
</header>
<div id="content" class="row">
<div class="col-md-3 col-md-push-9 hidden-xs hidden-sm">
<div id="sidebar">
<div class="sidebar-toc" style="margin-bottom: 20px;">
<p class="toc-header">Contents</p>
<div id="toc"></div>
</div>
<div id="search">
<script>
(function() {
var cx = '010264411143030149390:ajvee_ckdzs';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
'//cse.google.com/cse.js?cx=' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
<gcse:searchbox-only></gcse:searchbox-only>
</div>
</div>
</div>
<div class="col-md-9 col-md-pull-3">
<h1 id="nlp-top" class="title">Natural Language Processing</h1>
<p>Natural language processing (NLP) is about developing applications
and services that are able to understand human languages. Advanced
high level NLP tasks include speech recognition, machine translation,
natural language understanding, natural language generation,
dialog system, etc. While smile-deep module supports LLMs, smile-nlp
module focuses on low and intermediate level NLP tasks such as
sentence breaking, stemming, n-gram, part-of-speech recognition,
keyword detection, named entity recognition, etc.</p>
<h2 id="normalization" class="title">Normalization</h2>
<p>Text often contains variations (various quote marks in Unicode) that
introduces annoying problems in many NLP tools. Normalization is typically
applied to text first to remove unwanted variations. Normalization may
range from light textual cleanup such as compressing whitespace to
more aggressive and knowledge-intensive forms like standardizing date
formats or expanding abbreviations. The nature and extent of normalization,
as well as whether it is most appropriate to apply on the document, sentence,
or token level, must be determined in the context of a specific application.</p>
<p>The function <code>normalize</code> is a simple normalizer for
processing Unicode text:</p>
<ul>
<li>Apply Unicode normalization form NFKC.</li>
<li>Strip, trim, normalize, and compress whitespace.</li>
<li>Remove control and formatting characters.</li>
<li>Normalize dash, double and single quotes.</li>
</ul>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_1" data-toggle="tab">Java</a></li>
<li><a href="#scala_1" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_1" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
val unicode = """When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.
“It was very strange to see the seal. I’ve seen a lot of things on runways, but never a seal,” Babcock told ABC News. His footage of the hefty mammal went viral after he posted it on Facebook.
According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a “sled.”
Normal air traffic soon resumed, the station said.
Poking fun at the seal’s surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of “low sealings” in the North Slope region — a pun on “low ceilings,” a term used to describe low clouds and poor visibility.
Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.
“Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year,” department spokeswoman Meadow Bailey told the Associated Press. “Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare.”"""
val text = unicode.normalize
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
import smile.math.MathEx;
import smile.nlp.*;
import smile.nlp.tokenizer.*;
import smile.nlp.normalizer.*;
import smile.nlp.collocation.*;
import smile.nlp.dictionary.*;
import smile.nlp.keyword.*;
import smile.nlp.stemmer.*;
import smile.nlp.pos.*;
var unicode = "When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.\n\n" +
"\"It was very strange to see the seal. I've seen a lot of things on runways, but never a seal,\" Babcock told ABC News. His footage of the hefty mammal went viral after he posted it on Facebook.\n\n" +
"According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a \"sled.\"\n\n" +
" Normal air traffic soon resumed, the station said.\n\n" +
"Poking fun at the seal's surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of \"low sealings\" in the North Slope region - a pun on \"low ceilings,\" a term used to describe low clouds and poor visibility.\n\n" +
"Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.\n\n" +
"\"Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year,\" department spokeswoman Meadow Bailey told the Associated Press. \"Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare\"";
var text = SimpleNormalizer.getInstance().normalize(unicode);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_1">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
import smile.nlp.*;
val unicode = """When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.
“It was very strange to see the seal. I’ve seen a lot of things on runways, but never a seal,” Babcock told ABC News. His footage of the hefty mammal went viral after he posted it on Facebook.
According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a “sled.”
Normal air traffic soon resumed, the station said.
Poking fun at the seal’s surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of “low sealings” in the North Slope region — a pun on “low ceilings,” a term used to describe low clouds and poor visibility.
Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.
“Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year,” department spokeswoman Meadow Bailey told the Associated Press. “Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare.”"""
val text = unicode.normalize()
</code></pre>
</div>
</div>
</div>
<h2 id="sentence" class="title">Sentence Breaking</h2>
<p>In many NLP tasks, the input text has to be divided into sentences.
However, sentence boundary identification is challenging because
punctuation marks are often ambiguous.
In English, punctuation marks that usually appear at the end of a sentence
may not indicate the end of a sentence. The period is the worst offender.
A period can end a sentence, but it can also be part of an abbreviation
or acronym, an ellipsis, a decimal number, or part of a bracket of periods
surrounding a Roman numeral. A period can even act both as the end of an
abbreviation and the end of a sentence at the same time. Other the other
hand, some poems may not contain any sentence punctuation at all.</p>
<p>We implement an efficient rule-based sentence splitter for English.
In Smile shell, simply call <code>sentences</code> on a string
to return an array of sentences. Any carriage returns
in the text will be replaced by whitespace.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_2" data-toggle="tab">Java</a></li>
<li><a href="#scala_2" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_2" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> val sentences = text.sentences
sentences: Array[String] = Array(
"When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.",
"\"It was very strange to see the seal.",
"I've seen a lot of things on runways, but never a seal,\" Babcock told ABC News.",
"His footage of the hefty mammal went viral after he posted it on Facebook.",
"According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a \"sled.\"",
"Normal air traffic soon resumed, the station said.",
"Poking fun at the seal's surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of \"low sealings\" in the North Slope region -- a pun on \"low ceilings,\" a term used to describe low clouds and poor visibility.",
"Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.",
"\"Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year,\" department spokeswoman Meadow Bailey told the Associated Press.",
"\"Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare.\"",
""
)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var sentences = SimpleSentenceSplitter.getInstance().split(text)
sentences ==> String[10] { "When airport foreman Scott Babcock ... mmal strikes are rare\"" }
smile> for (int i = 0; i < sentences.length; i++) System.out.println(sentences[i]+"\n")
When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine.
"It was very strange to see the seal.
I've seen a lot of things on runways, but never a seal," Babcock told ABC News.
His footage of the hefty mammal went viral after he posted it on Facebook.
According to local TV station KTVA, animal control was called in and eventually moved the seal with the help of a "sled."
Normal air traffic soon resumed, the station said.
Poking fun at the seal's surprise appearance, the Alaska Department of Transportation warned pilots on Tuesday of "low sealings" in the North Slope region - a pun on "low ceilings," a term used to describe low clouds and poor visibility.
Though this was the first seal sighting on the runway at the airport, the department said other animals, including birds, caribou and polar bears, have been spotted there in the past.
"Wildlife strikes to aircraft pose a significant safety hazard and cost the aviation industry hundreds of millions of dollars each year," department spokeswoman Meadow Bailey told the Associated Press.
"Birds make up over 90 percent of strikes in the U.S., while mammal strikes are rare"
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_2">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
val sentences = text.sentences()
</code></pre>
</div>
</div>
</div>
<h2 id="tokenizer" class="title">Word Segmentation</h2>
<p>For a language like English, this is fairly trivial to
separate a chunk of continuous text into separate words,
since words are usually separated by spaces.
However, some written languages like Chinese, Japanese
and Thai do not mark word boundaries by spaces.
In those languages word segmentation is a significant
task requiring knowledge of the vocabulary and morphology
of words in the language.</p>
<p>The method <code>words(filter)</code> assumes that an English text
has already been segmented into sentences and splits a sentence
into tokens. Any periods – apart from those at the end
of a string or before newline – are assumed to be part
of the word they are attached to (e.g. for abbreviations, etc.),
and are not separately tokenized. Most punctuation is split
from adjoining words. Verb contractions and the Anglo-Saxon
genitive of nouns are split into their component morphemes,
and each morpheme is tagged separately. The below example
splits a set of sentences and flat out the results into one
array.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_3" data-toggle="tab">Java</a></li>
<li><a href="#scala_3" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_3" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> sentences.flatMap(_.words())
res6: Array[String] = Array(airport, foreman, Scott, Babcock, went, runway, Wiley, Post-Will, Rogers, Memorial, Airport, Utqiagvik, Alaska, Monday, clear, snow, surprised, visitor, waiting, asphalt, 450-pound, bearded, seal, chilling, milky, sunshine, strange, seal, seen, lot, things, runways, seal, Babcock, told, ABC, News, footage, hefty, mammal, went, viral, posted, Facebook, According, local, TV, station, KTVA, animal, control, called, eventually, moved, seal, help, sled., Normal, air, traffic, soon, resumed, station, said, Poking, fun, seal, surprise, appearance, Alaska, Department, Transportation, warned, pilots, Tuesday, low, sealings, North, Slope, region, pun, low, ceilings, term, used, low, clouds, poor, visibility, seal, sighting, runway, airport...
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var tokenizer = new SimpleTokenizer(true)
tokenizer ==> smile.nlp.tokenizer.SimpleTokenizer@6b53e23f
smile> var words = Arrays.stream(sentences).
flatMap(s -> Arrays.stream(tokenizer.split(s))).
filter(w -> !(EnglishStopWords.DEFAULT.contains(w.toLowerCase()) || EnglishPunctuations.getInstance().contains(w))).
toArray(String[]::new)
words ==> String[133] { "airport", "foreman", "Scott", "Babcock", "went", "runway", "Wiley", "Post-Will", "Rogers", "Memorial", "Airport", "Utqiagvik", "Alaska", "Monday", "clear", "snow", "surprised", "visitor", "waiting", "asphalt", "450-pound", "bearded", "seal", "chilling", "milky", "sunshine", "strange", "seal", "seen", "lot", "things", "runways", "seal", "Babcock", "told", "ABC", "News", "footage", "hefty", "mammal", "went", "viral", "posted", "Facebook", "According", "local", "TV", "station", "KTVA", "animal", "control", "called", "eventually", "moved", "seal", "help", "sled.", "Normal", "air", "traffic", "soon", "resumed", "station", "said", "Poking", "fun ... spotted", "past", "Wildlife", "strikes", "aircraft", "pose", "significant", "safety", "hazard", "cost", "aviation", "industry", "hundreds", "millions", "dollars", "year", "department", "spokeswoman", "Meadow", "Bailey", "told", "Associated", "Press", "Birds", "make", "90", "percent", "strikes", "U.S.", "mammal", "strikes", "rare" }
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_3">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
>>> sentences.flatMap{ it.words().asIterable() }
res6: kotlin.collections.List<kotlin.String> = [airport, foreman, Scott, Babcock, went, runway, Wiley, Post-Will, Rogers, Memorial, Airport, Utqiagvik, Alaska, Monday, clear, snow, surprised, visitor, waiting, asphalt, 450-pound, bearded, seal, chilling, milky, sunshine, strange, seal, seen, lot, things, runways, seal, Babcock, told, ABC, News, footage, hefty, mammal, went, viral, posted, Facebook, According, local, TV, station, KTVA, animal, control, called, eventually, moved, seal, help, sled., Normal, air, traffic, soon, resumed, station, said, Poking, fun, seal, surprise, appearance, Alaska, Department, Transportation, warned, pilots, Tuesday, low, sealings, North, Slope, region, pun, low, ceilings, term, used, low, clouds, poor, visibility, seal, sighting, runway, airport, department, said, animals, including, birds, caribou, polar, bears, spotted, past, Wildlife, strikes, aircraft, pose, significant, safety, hazard, cost, aviation, industry, hundreds, millions, dollars, year, department, spokeswoman, Meadow, Bailey, told, Associated, Press, Birds, make, 90, percent, strikes, U.S., mammal, strikes, rare.]
</code></pre>
</div>
</div>
</div>
<p>You may notice that some words like "the", "a", etc. are missing
in the result. It is because that <code>words()</code> filters
out stop words and punctuations by default. A stop word is a
commonly used word that many NLP algorithms would like to ignore.
For example, a search engine ignores stop words both when indexing
entries and when retrieving them in order to save space and
time as stop words are deemed irrelevant for searching
purposes. There is no definite list of stop words which all tools
incorporate. So the parameter <code>filter</code> may take the
following values:</p>
<ul>
<li>"none": no filtering</li>
<li>"default": the default English stop word list</li>
<li>"comprehensive": a more comprehensive English stop word list</li>
<li>"google": the stop words list used by Google search engine</li>
<li>"mysql": the stop words list used by MySQL FullText feature</li>
<li>custom stop word list: comma separated stop word list</li>
</ul>
<h2 id="stemmer" class="title">Stemming</h2>
<p>For grammatical reasons, we use different forms of a word,
such as go, goes, and went. Additionally, there are families
of derivationally related words with similar meanings,
such as democracy, democratic, and democratization. For many
machine learning algorithms, it is good to reduce inflectional
forms and sometimes derivationally related forms of a word
to a common base form to improve the signal-to-noise ratio.</p>
<p> Stemming is a crude heuristic process that chops off the
ends of words in the hope of achieving this goal correctly
most of the time, and often includes the removal of
derivational affixes. The most common algorithm for stemming
English is Porter's algorithm. Porter's algorithm is based on the idea that the
suffixes in the English language are mostly made up of a combination of
smaller and simpler suffixes. As a linear step stemmer,
Porter's algorithm consists of 5 phases of word reductions, applied sequentially.
Within each step, if a suffix rule matched to a word, then the conditions
attached to that rule are tested on what would be the resulting stem,
if that suffix was removed, in the way defined by the rule. Once a Rule
passes its conditions and is accepted the rule fires and the suffix is
removed and control moves to the next step. If the rule is not accepted
then the next rule in the step is tested, until either a rule from that
step fires and control passes to the next step or there are no more rules
in that step whence control moves to the next step.</p>
<p>Another popular stemming algorithm is the Paice/Husk Lancaster
algorithm, which is a conflation based iterative stemmer.
The stemmer, although remaining efficient and easily implemented,
is known to be very strong and aggressive. The stemmer
utilizes a single table of rules, each of which may specify
the removal or replacement of an ending. The implementation
<code>LancasterStemmer</code> allows the user to load customized rules.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_4" data-toggle="tab">Java</a></li>
<li><a href="#scala_4" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_4" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> porter.stem("democratization")
res10: String = "democrat"
smile> lancaster.stem("democratization")
res11: String = "democr"
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code>
smile> var porter = new PorterStemmer()
porter ==> smile.nlp.stemmer.PorterStemmer@5ea434c8
smile> var lancaster = new LancasterStemmer()
lancaster ==> smile.nlp.stemmer.LancasterStemmer@2aa5fe93
smile> porter.stem("democratization")
$42 ==> "democrat"
smile> lancaster.stem("democratization")
$43 ==> "democr"
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_4">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> porter("democratization")
res18: kotlin.String = democrat
>>> lancaster("democratization")
res19: kotlin.String = democr
</code></pre>
</div>
</div>
</div>
<p>Different from stemming that commonly collapses derivationally
related words, lemmatization aims to remove inflectional endings
only and to return the base or dictionary form of a word, which is
known as the lemma.</p>
<h2 id="bag-of-words" class="title">Bag of Words</h2>
<p>The bag-of-words model is a simple representation of text
as the bag of its words, disregarding grammar and word
order but keeping multiplicity.</p>
<p>The method <code>bag(stemmer)</code> returns the map
of word to frequency. By default, the parameter <code>stemmer</code>
use Porter's algorithm. Passing <code>None</code> to disable stemming.
There is a similar function <code>bag2(stemmer)</code> that returns
a binary bag of words (<code>Set[String]</code>). That is, presence/absence
is used instead of frequencies.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_5" data-toggle="tab">Java</a></li>
<li><a href="#scala_5" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_5" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> text.bag()
res12: Map[String, Int] = Map(move -> 1, 90 -> 1, call -> 1, spokeswoman -> 1, snow -> 1, anim -> 2, post -> 1, footag -> 1, wait -> 1, carib -> 1, industri -> 1, sunshin -> 1, seen -> 1, abc -> 1, said -> 2, pun -> 1, polar -> 1, bear -> 1, soon -> 1, warn -> 1, sight -> 1, babcock -> 2, u.s. -> 1, clear -> 1, normal -> 1, appear -> 1, sled. -> 1, strike -> 3, foreman -> 1, milki -> 1, resum -> 1, meadow -> 1, bailei -> 1, bird -> 2, us -> 1, dollar -> 1, poke -> 1, mondai -> 1, local -> 1, term -> 1, thing -> 1, scott -> 1, year -> 1, associat -> 1, ktva -> 1, tv -> 1, told -> 2, visibl -> 1, eventu -> 1, seal -> 7, hundr -> 1, surpris -> 2, aircraft -> 1, runwai -> 3, ceil -> 1, includ -> 1, asphalt -> 1, visitor -> 1, help -> 1, hazard -> 1, transport -> ...
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var map = Arrays.stream(words).
map(porter::stem).
map(String::toLowerCase).
collect(Collectors.groupingBy(java.util.function.Function.identity(), Collectors.summingInt(e -> 1)))
map ==> {spokeswoman=1, u.s.=1, bailei=1, year=1, told=2, ... l=1, transport=1, sled.=1}
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_5">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code style="white-space: preserve nowrap;">
>>> text.bag()
res20: kotlin.collections.Map<kotlin.String, kotlin.Int> = {airport=3, foreman=1, scott=1, babcock=2, went=2, runwai=3, wilei=1, post-wil=1, roger=1, memori=1, utqiagvik=1, alaska=2, mondai=1, clear=1, snow=1, surpris=2, visitor=1, wait=1, asphalt=1, 450-pound=1, beard=1, seal=7, chill=1, milki=1, sunshin=1, strang=1, seen=1, lot=1, thing=1, told=2, abc=1, new=1, footag=1, hefti=1, mammal=2, viral=1, post=1, facebook=1, accord=1, local=1, tv=1, station=2, ktva=1, anim=2, control=1, call=1, eventu=1, move=1, help=1, sled.=1, normal=1, air=1, traffic=1, soon=1, resum=1, said=2, poke=1, fun=1, appear=1, depart=3, transport=1, warn=1, pilot=1, tuesdai=1, low=3, north=1, slope=1, region=1, pun=1, ceil=1, term=1, us=1, cloud=1, poor=1, visibl=1, sight=1, includ=1, bird=2, carib=1, polar=1, bear=1, spot=1, past=1, wildlif=1, strike=3, aircraft=1, pose=1, signific=1, safeti=1, hazard=1, cost=1, aviat=1, industri=1, hundr=1, million=1, dollar=1, year=1, spokeswoman=1, meadow=1, bailei=1, associat=1, press=1, make=1, 90=1, percent=1, u.s.=1, rare.=1}
</code></pre>
</div>
</div>
</div>
<p>The function <code>vectorize(features, bag)</code> converts
a bag of words to a feature vector. The parameter
<code>features</code> is the token list used as features
in machine learning models. Generally it is not
a good practice to use all tokens in the corpus as features.
Therefore, we require the user to provide a list of selected
tokens as the features in <code>vectorize()</code>.
An overloaded version of <code>vectorize()</code> converts
a binary bag of words (<code>Set[String]</code>) to a sparse
integer vector, which elements are the indices of presented
feature tokens in ascending order. As most documents will typically
use a very small subset of the words used in the corpus,
this representation is very memory efficient and often used with
Maximum Entropy Classifier (<code>Maxent</code>).</p>
<p>In practice, the bag-of-words model is mainly used for
feature generation in document classification by calculating
various measures to characterize the text. The most common
feature is term frequency. However, a high raw term frequency
doesn't necessarily mean that the corresponding word is more
important. It is popular to normalize the term frequencies
by the inverse of document frequency, i.e. tf-idf (term
frequency-inverse document frequency).</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_6" data-toggle="tab">Java</a></li>
<li><a href="#scala_6" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_6" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
val lines = scala.io.Source.fromFile("data/text/movie.txt").getLines().toSeq
val corpus = lines.map(_.bag())
val features = Array("like", "good", "perform", "littl", "love", "bad", "best")
val bags = corpus.map(vectorize(features, _))
val data = tfidf(bags)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
var lines = Files.lines(java.nio.file.Paths.get("data/text/movie.txt"));
var corpus = lines.map(line -> {
var sentences = SimpleSentenceSplitter.getInstance().split(SimpleNormalizer.getInstance().normalize(line));
var words = Arrays.stream(sentences).
flatMap(s -> Arrays.stream(tokenizer.split(s))).
filter(w -> !(EnglishStopWords.DEFAULT.contains(w.toLowerCase()) || EnglishPunctuations.getInstance().contains(w))).
toArray(String[]::new);
var bag = Arrays.stream(words).
map(porter::stem).
map(String::toLowerCase).
collect(Collectors.groupingBy(java.util.function.Function.identity(), Collectors.summingInt(e -> 1)));
return bag;
});
String[] features = {"like", "good", "perform", "littl", "love", "bad", "best"};
var zero = Integer.valueOf(0);
var bags = corpus.map(bag -> {
double[] x = new double[features.length];
for (int i = 0; i < x.length; i++) x[i] = (Integer) bag.getOrDefault(features[i], zero);
return x;
}).toArray(double[][]::new);
var n = bags.length;
int[] df = new int[features.length];
for (double[] bag : bags) {
for (int i = 0; i < df.length; i++) {
if (bag[i] > 0) df[i]++;
}
}
var data = Arrays.stream(bags).map(bag -> {
var maxtf = MathEx.max(bag);
double[] x = new double[bag.length];
for (int i = 0; i < x.length; i++) {
x[i] = (bag[i] / maxtf) * Math.log((1.0 + n) / (1.0 + df[i]));
}
MathEx.unitize(x);
return x;
}).toArray(double[][]::new);
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_6">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
import java.io.File
val lines = File("data/text/movie.txt").readLines()
val corpus = lines.map{ it.bag() }
val features = arrayOf("like", "good", "perform", "littl", "love", "bad", "best")
val bags = corpus.map{ vectorize(features, it) }
val data = tfidf(bags)
</code></pre>
</div>
</div>
</div>
<p>In the example, we load a file, of which each line is a document.
We define a short list of terms as the features (only for demo
purpose). Then we apply the function <code>vectorize()</code>
to convert the bag-of-words counts to the feature vectors.
Finally, we use the function <code>tfidf</code> to compute the
normalized feature vectors, which may be used in machine learning
algorithms.</p>
<p>For a new document in prediction phase, an overload version of <code>tfidf(bag, n, df)</code>
can be employed, where <code>bag</code> is the bag-of-words feature
vector of a document, <code>n</code> is the number of documents in
training corpus, and <code>df</code> is an array which element is
the number of documents containing the given term in the corpus.</p>
<h2 id="phrase" class="title">Phrase/Collocation Extraction</h2>
<p>So far, we have treat words to be independent. But natural languages
include the expressions consisting of two or more words that correspond
to some conventional way of saying things. A collocation is a sequence of
words or terms that co-occur more often than would be expected by chance.
There are about six main types of collocations: adjective+noun, noun+noun,
verb+noun, adverb+adjective, verbs+prepositional, and verb+adverb.
Collocation extraction employs various computational linguistics
techniques to find collocations in a document or corpus that are
statistically significant.</p>
<p>Finding collocations requires first calculating the frequencies of words
and their appearance in the context of other words. Often the collection
of words will then require filtering to only retain useful content terms.
Each n-gram of words may then be scored according to some association measure,
in order to determine the relative likelihood of each n-gram being a collocation.</p>
<p>The functions <code>bigram(k, minFreq, text*)</code> and <code>bigram(p, minFreq, text*)</code>
can find bigrams in a document/corpus. The integer parameter <code>k</code> specifies how many
top bigrams to find. Alternatively, you may provide a double parameter <code>p</code>
to specify the p-value threshold. The parameter <code>minFreq</code> is the minimum
frequency of collocation in the corpus.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_7" data-toggle="tab">Java</a></li>
<li><a href="#scala_7" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_7" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> bigram(10, 5, lines: _*)
[main] INFO smile.util.package$ - runtime: 255287.678705 ms
res16: Array[collocation.BigramCollocation] = Array(
(special effects, 278, 3522.38),
(new york, 201, 2450.90),
(star wars, 153, 2016.87),
(high school, 132, 1432.43),
(science fiction, 105, 1408.87),
(phantom menace, 76, 1330.73),
(ve seen, 147, 1291.82),
(pulp fiction, 83, 1254.05),
(star trek, 100, 1209.88),
(hong kong, 61, 1088.61)
)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var corpus = new SimpleCorpus()
corpus ==> smile.nlp.SimpleCorpus@59309333
smile> Files.lines(java.nio.file.Paths.get("data/text/movie.txt")).forEach(text -> corpus.add(new Text(text)));
smile> import smile.nlp.collocation.Bigram;
smile> Bigram.of(corpus, 10, 5);
$119 ==> Bigram[10] { (special effects, 381, 4849.22), (new york, 248, 3038.85), (= =, 226, 2560.62), (star wars, 165, 2184.18), (high school, 173, 1911.50), (science fiction, 126, 1733.94), (ve seen, 191, 1669.02), (hong kong, 90, 1592.32), (star trek, 119, 1463.29), (pulp fiction, 93, 1419.28) }
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_7">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> bigram(10, 5, lines)
</code></pre>
</div>
</div>
</div>
<p>In the output, the second column is the frequency of bigrams and
the third is the statistical test score.</p>
<p>To find the collocations of more than 2 words, the function
<code>ngram(maxNGramSize: Int, minFreq: Int, text: String*)</code>
can be used. This function uses an Apiori-like algorithm to
extract n-gram phrases. It takes a collection of texts and generates all n-grams of
length at most maxNGramSize that occur at least minFreq times in the
text.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_8" data-toggle="tab">Java</a></li>
<li><a href="#scala_8" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_8" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> val turing = scala.io.Source.fromFile("data/text/turing.txt").getLines.mkString
smile> val phrase = ngram(4, 4, turing)
smile> phrase(2)
res19: Seq[NGram] = Buffer(
([digital, computer], 32),
([discrete-state, machine], 17),
([imitation, game], 14),
([storage, capacity], 11),
([human, computer], 10),
([machine, think], 9),
([child, machine], 7),
([scientific, induction], 6),
([analytical, engine], 5),
([lady, lovelace], 5),
([someth, like], 5),
([well-establish, fact], 5),
([subject, matter], 4),
([differential, analyser], 4),
([manchester, machine], 4),
([learn, machine], 4)
)
smile> phrase(3)
res20: Seq[NGram] = Buffer(
([rule, of, conduct], 5),
([winter, 's, day], 5),
([law, of, behaviour], 5),
([number, of, state], 5),
([point, of, view], 4),
([punishment, and, reward], 4),
([machine, in, question], 4)
)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> var turing = Files.lines(java.nio.file.Paths.get("data/text/turing.txt")).collect(Collectors.joining(" "))
turing ==> "COMPUTING MACHINERY AND INTELLIGENCE 1. The Imi ... e imitation game, the part
smile> var sentences = SimpleSentenceSplitter.getInstance().split(SimpleNormalizer.getInstance().normalize(turing))
sentences ==> String[571] { "COMPUTING MACHINERY AND INTELLIGEN ... wish we can make this sup
smile> var text = Arrays.stream(sentences).
map(s -> Arrays.stream(tokenizer.split(s)).
map(w -> porter.stripPluralParticiple(w).toLowerCase()).
toArray(String[]::new)).
collect(Collectors.toList())
text ==> [[Ljava.lang.String;@71238fc2, [Ljava.lang.String ... ava.lang.String;@28cda624]
smile> import smile.nlp.collocation.NGram
smile> var phrase = NGram.of(text, 4, 4)
phrase ==> NGram[5][] { NGram[0] { }, NGram[335] { ([machin ... ew], 4) }, NGram[0] { } }
smile> phrase[2]
$135 ==> NGram[16] { ([digital, computer], 34), ([discrete-state, machine], 17), ([imitation, game], 15), ([storage, capacity], 11), ([human, computer], 10), ([machine, think], 9), ([child, machine], 7), ([scientific, induction], 6), ([well-establish, fact], 5), ([learn, machine], 5), ([someth, like], 5), ([lady, lovelace], 5), ([analytical, engine], 5), ([manchester, machine], 4), ([differential, analyser], 4), ([subject, matter], 4) }
smile> phrase[3]
$136 ==> NGram[8] { ([number, of, state], 5), ([law, of, behaviour], 5), ([winter, 's, day], 5), ([rule, of, conduct], 5), ([argument, from, consciousness], 4), ([machine, in, question], 4), ([punishment, and, reward], 4), ([point, of, view], 4) }
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_8">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> val turing = File("data/text/turing.txt").readLines()
>>> val phrase = ngram(4, 4, turing)
</code></pre>
</div>
</div>
</div>
<p>The result is an array list of sets of n-grams. The i-th entry is the set of i-grams.</p>
<h2 id="keyword" class="title">Keyword Extraction</h2>
<p>Beyond finding phrases, keyword extraction is tasked with the automatic
identification of terms that best describe the subject of a document,
Keywords are the terms that represent the most relevant information
contained in the document, i.e. characterization of the topic discussed
in a document.</p>
<p>We provide a method <code>keywords(k: Int)</code> to returns top-k
keywords in a single document using word co-occurrence statistical
information. The below is the found keywords of Turing's famous paper
"Computing Machinery and Intelligence". The seminal paper on artificial
intelligence introduces the concept of what is now known as the Turing test.
As shown in the results, the algorithm works pretty well and captures
many important concepts in the paper such as "storage capacity", "machine",
"digital computer", "discrete-state machine", etc.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_9" data-toggle="tab">Java</a></li>
<li><a href="#scala_9" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_9" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code>
smile> turing.keywords(10)
res21: Seq[NGram] = Buffer(
([storage, capacity], 11),
([machine], 197),
([think], 45),
([digital, computer], 32),
([imitation, game], 14),
([discrete-state, machine], 17),
([teach], 11),
([view], 20),
([process], 17),
([play], 15)
)
</code></pre>
</div>
</div>
<div class="tab-pane active" id="java_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-java"><code style="white-space: preserve nowrap;">
smile> smile.nlp.keyword.CooccurrenceKeywords.of(turing, 10)
$137 ==> NGram[10] { ([storage, capacity], 11), ([machine], 198), ([think], 46), ([teach], 11), ([process], 17), ([digital, computer], 34), ([discrete-state, machine], 17), ([imitation, game], 15), ([child], 15), ([view], 20) }
</code></pre>
</div>
</div>
<div class="tab-pane" id="kotlin_9">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-kotlin"><code>
>>> turing.joinToString("\n").keywords(10)
</code></pre>
</div>
</div>
</div>
<p>This algorithm relies on co-occurrence probability and information theory.
Therefore, the article should be long enough to contain sufficient
statistical signals. In other words, it won't work on short text such as
tweets.</p>
<h2 id="pos" class="title">Part-of-Speech Tagging</h2>
<p>A part of speech (PoS) is a category of words which have similar
grammatical properties. Words that are assigned to the same part
of speech generally display similar behavior in terms of syntax –
they play similar roles within the grammatical structure of sentences
– and sometimes in terms of morphology, in that they undergo
inflection for similar properties. Commonly listed English parts of
speech are noun, verb, adjective, adverb, pronoun, preposition,
conjunction, interjection, etc. In Smile, we use the Penn Treebank
PoS tag set. The complete list can be found in the class
<a href="api/java/smile/nlp/pos/PennTreebankPOS.html"><code>smile.nlp.pos.PennTreebankPOS</code></a>.</p>
<p>PoS tagging is an important intermediate task to make sense of
some of the structure inherent in language without requiring
complete understanding. Smile implements a highly efficient
English PoS tagger based on hidden Markov model (HMM). Suppose
a string is a single sentence, simply call <code>postag</code>
on the string to return an array of (word, tag) pairs. Because
PoS tags are often used as features along with other attributes
in machine learning algorithms, the sentence is typically
already split into words. In this case, just call <code>postag(words)</code>
on an array of words.</p>
<ul class="nav nav-tabs">
<li class="active"><a href="#java_10" data-toggle="tab">Java</a></li>
<li><a href="#scala_10" data-toggle="tab">Scala</a></li>
<li><a href="#kotlin_10" data-toggle="tab">Kotlin</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane" id="scala_10">
<div class="code" style="text-align: left;">
<pre class="prettyprint lang-scala"><code style="white-space: preserve nowrap;">
smile> val sentence = """When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine."""
sentence: String = "When airport foreman Scott Babcock went out onto the runway at Wiley Post-Will Rogers Memorial Airport in Utqiagvik, Alaska, on Monday to clear some snow, he was surprised to find a visitor waiting for him on the asphalt: a 450-pound bearded seal chilling in the milky sunshine."
smile> sentence.postag
res1: Array[(String, pos.PennTreebankPOS)] = Array(
("When", WRB),
("airport", NN),
("foreman", NN),
("Scott", NNP),
("Babcock", NNP),
("went", VBD),
("out", RP),
("onto", IN),
("the", DT),
("runway", NN),