SparkML - Sentiment Analysis with Movie Reviews

Overview

This project demonstrates the use of PySpark and Spark MLlib for sentiment analysis on the Large Movie Review Dataset. The dataset contains 50,000 movie reviews labeled as positive or negative, making it an excellent benchmark for classification tasks.

Project Structure

.
├── Activity.ipynb                # Main notebook for the lab activities
├── Assignment.ipynb              # Additional assignment notebook
├── adult.csv                     # Example dataset for auxiliary tasks
├── Bài Lab 05_ PySpark - Spark MLlib.pdf  # Lab instructions
├── README.md                     # Project documentation
├── .devcontainer/                # Development container configuration
│   └── devcontainer.json
├── data/                         # Dataset directory
│   └── aclImdb/                  # Large Movie Review Dataset
│       ├── imdb.vocab            # Vocabulary file
│       ├── imdbEr.txt            # Expected ratings for tokens
│       ├── README                # Dataset documentation
│       ├── train/                # Training data
│       └── test/                 # Test data

Dataset Details

The Large Movie Review Dataset is organized as follows:

Train/Test Split: 25,000 reviews each for training and testing.
Labels: Reviews are labeled as pos (positive) or neg (negative).
File Naming Convention: [id]_[rating].txt where [rating] is the IMDb score (e.g., 200_8.txt).

For more details, refer to the dataset's README.

Requirements

Python 3
PySpark
Jupyter Notebook
Libraries: pandas, numpy, matplotlib

Setup

Clone the repository:

git clone https://github.com/SmallChicken2k5/Lab5_Machine-Learning-with-Spark-MLlib.git
cd Lab5_Machine-Learning-with-Spark-MLlib

Install dependencies:
Open the development container (if using VS Code):
- Ensure Docker is running.
- Open the project in VS Code and reopen in the dev container.
Launch Jupyter Notebook:

Usage

Open Activity.ipynb to follow the lab instructions.
Use the Assignment.ipynb notebook for additional exercises.
Modify and experiment with the PySpark code to explore the dataset and train models.

Goals

Learn how to preprocess text data using PySpark.
Train and evaluate machine learning models with Spark MLlib.
Perform sentiment analysis on movie reviews.

About the Author

This project was created and maintained by SmallChicken2k5, a passionate developer and data enthusiast. Feel free to reach out or contribute to the repository!

For inquiries, contact me via email: [email protected].

References

License

This project is for educational purposes only. Please refer to the dataset's license for usage restrictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SparkML - Sentiment Analysis with Movie Reviews

Overview

Project Structure

Dataset Details

Requirements

Setup

Usage

Goals

About the Author

References

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.devcontainer		.devcontainer
data/aclImdb		data/aclImdb
Activity.ipynb		Activity.ipynb
Assignment.ipynb		Assignment.ipynb
Bài Lab 05_ PySpark - Spark MLlib.pdf		Bài Lab 05_ PySpark - Spark MLlib.pdf
README.md		README.md
adult.csv		adult.csv

SmallChicken2k5/Lab5_Machine-Learning-with-Spark-MLlib

Folders and files

Latest commit

History

Repository files navigation

SparkML - Sentiment Analysis with Movie Reviews

Overview

Project Structure

Dataset Details

Requirements

Setup

Usage

Goals

About the Author

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages