A text retrieval system implementing various information retrieval models with a Streamlit-based user interface.
- Multiple retrieval models:
- Boolean Retrieval Model
- Vector Space Model (VSM)
- Latent Semantic Analysis (LSA)
- Combined Model (Boolean + Vector)
- Interactive search interface
- Document statistics and visualizations
- Model evaluation metrics
- Clone this repository:
git clone https://github.com/dangvonguyen/IR-CS419.P21.git
cd IR-CS419.P21
- Set up the environment and install dependencies
# Using pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Using uv (faster installation)
uv sync
source .venv/bin/activate
Run the application using Streamlit:
streamlit run app.py
- Load Data: Use the sidebar to select data source and parameters
- Select Model: Choose between Boolean, VSM, LSA, or Combined retrieval models
- Search: Enter queries in the search tab to retrieve relevant documents
- Analyze: View document statistics and model performance in the Statistics tab
- Browse: View loaded documents in the Documents tab
app.py
: Main Streamlit applicationsrc/models/
: Implementation of retrieval modelsboolean_model.py
: Boolean retrieval with inverted indexvsm_model.py
: Vector Space Model with TF-IDFlsa_model.py
: Latent Semantic Analysis modelcombined_model.py
: Combined Boolean and Vector model
src/utils.py
: Utility functions for text processingsrc/evaluate.py
: Evaluation metrics for retrieval modelsui/
: User interface components