Skip to content

Files

Latest commit

0623ec0 · Apr 7, 2023

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 7, 2023
Mar 4, 2022
Apr 7, 2023
Jul 27, 2022

README.md

Word Count


"... This book will be a great resource for
both readers looking to implement existing
algorithms in a scalable fashion and readers
who are developing new, custom algorithms
using Spark. ..."

Dr. Matei Zaharia
Original Creator of Apache Spark

FOREWORD by Dr. Matei Zaharia

Introduction to Word Count

  • Word Count is a simple and easy to understand algorithm which can be easily implemented as a MapReduce/Spark application. Given a set of text documents, the program counts the number of occurrences of each word.

  • Word count finds out the frequency of each word in a set of documents/files. The goal is to create a dictionary of (key, value) pairs, where key is a word (as a String), and value is an Integer denoting the frequency of a given key/word.

  • Complete set of solutions are given for Word Count problem using

  • BEFORE reduction filter: You may add filter() to remove undesired words (this can be done after tokenizing records)

  • AFTER reduction filter: To have a desired final word count as (word, frequency), you may add filter() to remove elements where frequency < N , where N (as an integer) is your threshold. This can be done after reduction.


Word Count in MapReduce

Word Count in Picture




References

1. Word count from Wiki

2. Word Count Example, Spark