STAT 686: Market Models Final Project
Jackson Thetford, Ryker Dolese, Krish Kumar, Katharine Britt, Naomi Consiglio, Mehrdad Tamiji
To run the application, run:
streamlit run frontend/Home.py
The PairsFinder
class is designed to streamline the process of identifying cointegrated pairs for pairs trading strategies. It automates several key steps: data loading, data splitting, return preprocessing, autoencoder training for dimensionality reduction, asset clustering, and cointegration analysis combined with a Hurst exponent test to select mean-reverting pairs.
-
Data Loading:
Automatically downloads historical price data using yfinance if a local CSV file does not exist. -
Data Splitting:
Splits the loaded data into training and testing sets based on a customizable split ratio. -
Preprocessing:
Computes percentage returns and applies standard scaling while filtering out assets with no variation. -
Autoencoder Training:
Reduces the dimensionality of the asset returns using a customizable autoencoder, with options to adjust:- Encoding dimension
- Number of epochs
- Learning rate
- Hidden layer size
-
Clustering:
Clusters assets using KMeans on the encoded returns.
Note: If the number of clusters is not specified, it defaults to approximately one-fourth of the number of assets. -
Cointegration Analysis:
Evaluates asset pairs for cointegration (using a p-value threshold) and tests for mean reversion by computing the Hurst exponent of the spread. -
Pipeline Execution:
Therun_pipeline()
method ties all steps together for a seamless end-to-end analysis. -
Logging:
Uses Python’s logging module to provide status updates and debug information throughout the process.
import pandas as pd
import logging
from pairsfinder import PairsFinder # Adjust import based on your module location
# Load tickers (assumes 'russel3000_stocks.csv' exists)
tickers = pd.read_csv('russel3000_stocks.csv')['Ticker'].tolist()
# Optional: Customize autoencoder and clustering parameters
autoencoder_params = {
'encoding_dim': 10, # 10-dimensional latent space
'num_epochs': 500, # Train for 500 epochs
'lr': 0.005, # Learning rate of 0.005
'print_every': 25, # Log every 25 epochs
'hidden_dim': 128 # Hidden layer size of 128 neurons
}
clustering_params = {
'num_clusters': 100, # Set number of clusters to 100
'random_state': 123 # Specific random state for reproducibility
}
# Other custom settings
cointegration_threshold = 0.05
hurst_threshold = 0.45
min_cluster_size = 4
split_ratio = 0.6
# Instantiate the PairsFinder class with custom parameters
pf = PairsFinder(
tickers=tickers,
start="2019-01-01",
end="2025-01-01",
autoencoder_params=autoencoder_params,
clustering_params=clustering_params,
cointegration_threshold=cointegration_threshold,
hurst_threshold=hurst_threshold,
min_cluster_size=min_cluster_size,
split_ratio=split_ratio,
log_level=logging.INFO # Change to logging.DEBUG for more details
)
# Run the complete pipeline and save the resulting pairs to a CSV file
pairs_df = pf.run_pipeline()
pf.save_pairs("custom_pairs_df.csv")
This project implements multiple strategies for executing a pairs trading approach on the cointegrated pairs identified by PairsFinder
. Each strategy takes a different approach to generating trading signals. Total returns are the weighted average return for the trading backtest for each pair, where the weights are defined by the quality score, or the level of strength of cointegration.
The Z‑Score strategy bases its trading decisions on the statistics of the spread between a pair of assets. It calculates a z‑score from the spread (computed on log‑transformed prices using an OLS hedge ratio) and then enters a trade when the z‑score exceeds a specified upper or lower threshold. Positions are closed when the spread reverts toward its mean or when risk management conditions (stop loss or take profit) are met.
from ZScoreStrategy import ZScoreStrategy
pairs_df = pd.read_csv("data/cointegrated_pairs.csv")
test_data = pd.read_csv("data/russel_data_test.csv", index_col=0, parse_dates=True)
full_data = pd.read_csv("data/russel_data_full.csv", index_col=0, parse_dates=True)
zscore_strategy = ZScoreStrategy(pairs_df, test_data, full_data=full_data, capital=100)
zscore_strategy.run()
zscore_strategy.save_results("zscore_results.csv")
The XGBoost strategy uses an ensemble of decision trees (via the XGBoost algorithm) trained on engineered features from the asset pair’s spread. It predicts the probability of a favorable future movement in the spread. Positions are entered based on a combination of z‑score thresholds and minimum prediction probability.
from XGBoostStrategy import XGBoostStrategy
pairs_df = pd.read_csv("data/cointegrated_pairs.csv")
test_data = pd.read_csv("data/russel_data_test.csv", index_col=0, parse_dates=True)
full_data = pd.read_csv("data/russel_data_full.csv", index_col=0, parse_dates=True)
xgb_strategy = XGBoostStrategy(pairs_df, test_data, full_data=full_data, capital=100,
min_zscore_threshold=0.5, min_proba_threshold=0.6)
xgb_strategy.run()
xgb_strategy.save_results("xgboost_results.csv")
The Logistic Regression strategy uses a linear classifier to estimate the probability that the spread will move in a favorable direction. It uses engineered features similar to the XGBoost model. Trade signals are generated based on the predicted probabilities exceeding a predefined threshold.
from LogisticStrategy import LogisticStrategy
pairs_df = pd.read_csv("data/cointegrated_pairs.csv")
test_data = pd.read_csv("data/russel_data_test.csv", index_col=0, parse_dates=True)
full_data = pd.read_csv("data/russel_data_full.csv", index_col=0, parse_dates=True)
logistic_strategy = LogisticStrategy(pairs_df, test_data, full_data=full_data, capital=100,
min_zscore_threshold=0.5, min_proba_threshold=0.6)
logistic_strategy.run()
logistic_strategy.save_results("logistic_results.csv")
-
Risk-Free Rate:
- The portfolio performance metrics (e.g., Sharpe ratio) are calculated with the assumption that the risk‑free rate is zero.
-
Execution and Transaction Costs:
- The simulation does not incorporate transaction costs, slippage, or bid/ask spread.