|
1 |
| -# deepdb-public |
2 |
| -Implementation of DeepDB: Learn from Data, not from Queries! |
| 1 | +# DeepDB: Learn from Data, not from Queries! |
| 2 | + |
| 3 | +DeepDB is a data-driven learned database component achieving state-of-the-art-performance in cardinality estimation and |
| 4 | +approximate query processing (AQP). This is the implementation described in |
| 5 | + |
| 6 | +Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig: |
| 7 | +"DeepDB: Learn from Data, not from Queries!", VLDB'2020. [[PDF]](https://arxiv.org/abs/1909.00607) |
| 8 | + |
| 9 | + |
| 10 | + |
| 11 | +# Setup |
| 12 | +``` |
| 13 | +git clone https://github.com/DataManagementLab/deepdb-public.git |
| 14 | +cd deepdb-public |
| 15 | +sudo apt install -y libpq-dev gcc python3-dev |
| 16 | +python3 -m venv venv |
| 17 | +source venv/bin/activate |
| 18 | +pip3 install -r requirements.txt |
| 19 | +``` |
| 20 | + |
| 21 | +# Reproduce Experiments |
| 22 | + |
| 23 | +## Cardinality Estimation |
| 24 | +Download the [Job dataset](http://homepages.cwi.nl/~boncz/job/imdb.tgz). |
| 25 | +Generate hdf files from csvs. |
| 26 | +``` |
| 27 | +python3 maqp.py --generate_hdf |
| 28 | + --dataset imdb-light |
| 29 | + --csv_seperator , |
| 30 | + --csv_path ../imdb-benchmark |
| 31 | + --hdf_path ../imdb-benchmark/gen_single_light |
| 32 | + --max_rows_per_hdf_file 100000000 |
| 33 | +``` |
| 34 | + |
| 35 | +Generate sampled hdf files from csvs. |
| 36 | +``` |
| 37 | +python3 maqp.py --generate_sampled_hdfs |
| 38 | + --dataset imdb-light |
| 39 | + --hdf_path ../imdb-benchmark/gen_single_light |
| 40 | + --max_rows_per_hdf_file 100000000 |
| 41 | + --hdf_sample_size 10000 |
| 42 | +``` |
| 43 | + |
| 44 | +Learn ensemble with the optimized rdc strategy (requires postgres with imdb dataset) |
| 45 | +``` |
| 46 | +python3 maqp.py --generate_ensemble |
| 47 | + --dataset imdb-light |
| 48 | + --samples_per_spn 10000000 10000000 1000000 1000000 1000000 |
| 49 | + --ensemble_strategy rdc_based |
| 50 | + --hdf_path ../imdb-benchmark/gen_single_light |
| 51 | + --max_rows_per_hdf_file 100000000 |
| 52 | + --samples_rdc_ensemble_tests 10000 |
| 53 | + --ensemble_path ../imdb-benchmark/spn_ensembles |
| 54 | + --database_name imdb |
| 55 | + --post_sampling_factor 10 10 5 1 1 |
| 56 | + --ensemble_budget_factor 5 |
| 57 | + --ensemble_max_no_joins 3 |
| 58 | + --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl |
| 59 | +``` |
| 60 | + |
| 61 | +Alternatively: Learn base ensemble over different tables with naive strategy. |
| 62 | +(Does not work with different dataset sizes because join sizes are hard coded but does not require postgres) |
| 63 | +``` |
| 64 | +python3 maqp.py --generate_ensemble |
| 65 | + --dataset imdb-light |
| 66 | + --samples_per_spn 1000000 1000000 1000000 1000000 1000000 |
| 67 | + --ensemble_strategy relationship |
| 68 | + --hdf_path ../imdb-benchmark/gen_single_light |
| 69 | + --ensemble_path ../imdb-benchmark/spn_ensembles |
| 70 | + --max_rows_per_hdf_file 100000000 |
| 71 | + --post_sampling_factor 10 10 5 1 1 |
| 72 | +``` |
| 73 | + |
| 74 | +Evaluate performance for queries. |
| 75 | +``` |
| 76 | +python3 maqp.py --evaluate_cardinalities |
| 77 | + --rdc_spn_selection |
| 78 | + --max_variants 1 |
| 79 | + --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl |
| 80 | + --dataset imdb-light |
| 81 | + --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv |
| 82 | + --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl |
| 83 | + --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql |
| 84 | + --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv |
| 85 | +``` |
| 86 | + |
| 87 | +## Updates |
| 88 | + |
| 89 | +Conditional incremental learning (i.e., initial learning of all films before 2013, newer films learn incremental) |
| 90 | +``` |
| 91 | +python3 maqp.py --generate_ensemble |
| 92 | + --dataset imdb-light |
| 93 | + --samples_per_spn 10000000 10000000 1000000 1000000 1000000 |
| 94 | + --ensemble_strategy rdc_based |
| 95 | + --hdf_path ../imdb-benchmark/gen_single_light |
| 96 | + --max_rows_per_hdf_file 100000000 |
| 97 | + --samples_rdc_ensemble_tests 10000 |
| 98 | + --ensemble_path ../imdb-benchmark/spn_ensembles |
| 99 | + --database_name JOB-light |
| 100 | + --post_sampling_factor 10 10 5 1 1 |
| 101 | + --ensemble_budget_factor 0 |
| 102 | + --ensemble_max_no_joins 3 |
| 103 | + --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl |
| 104 | + --incremental_condition "title.production_year<2013" |
| 105 | +``` |
| 106 | + |
| 107 | +## Optimized Inference |
| 108 | +Generate the C++ code. (Currently only works for cardinality estimation). |
| 109 | +``` |
| 110 | +python3 maqp.py --code_generation |
| 111 | + --ensemble_path ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl |
| 112 | +``` |
| 113 | + |
| 114 | +Compile it in a venv with pybind installed. |
| 115 | +Sometimes installing this yields: `ModuleNotFoundError: No module named 'pip.req'` |
| 116 | +One workaround is to downgrade pip `pip3 install pip==9.0.3` as described [here](https://stackoverflow.com/questions/25192794/no-module-named-pip-req). |
| 117 | + |
| 118 | +The command below works for ubuntu 18.04. Make sure the generated .so file is in the root directory of the project. |
| 119 | +``` |
| 120 | +g++ -O3 -Wall -shared -std=c++11 -ftemplate-depth=2048 -ftime-report -fPIC `python3 -m pybind11 --includes` optimized_inference.cpp -o optimized_inference`python3-config --extension-suffix` |
| 121 | +``` |
| 122 | + |
| 123 | +If you now want to leverage the module you have to specify it for cardinalities. |
| 124 | +``` |
| 125 | +python3 maqp.py --evaluate_cardinalities |
| 126 | + --rdc_spn_selection |
| 127 | + --max_variants 1 |
| 128 | + --pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl |
| 129 | + --dataset imdb-light |
| 130 | + --target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv |
| 131 | + --ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl |
| 132 | + --query_file_location ./benchmarks/job-light/sql/job_light_queries.sql |
| 133 | + --ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv |
| 134 | + --use_generated_code |
| 135 | +``` |
| 136 | + |
| 137 | +## AQP |
| 138 | +### SSB pipeline |
| 139 | + |
| 140 | +Generate standard SSB dataset (Scale Factor=500) and use the correct seperator. |
| 141 | +``` |
| 142 | +for i in `ls *.tbl`; do |
| 143 | + sed 's/|$//' $i > $TMP_DIR/${i/tbl/csv} & |
| 144 | + echo $i; |
| 145 | +done |
| 146 | +``` |
| 147 | +Create lineorder sample |
| 148 | +``` |
| 149 | +cat lineorder.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .003333) print $0}' > lineorder_sampled.csv |
| 150 | +``` |
| 151 | + |
| 152 | +Generate hdf files from csvs. |
| 153 | +``` |
| 154 | +python3 maqp.py --generate_hdf |
| 155 | + --dataset ssb-500gb |
| 156 | + --csv_seperator \| |
| 157 | + --csv_path ../mqp-data/ssb-benchmark |
| 158 | + --hdf_path ../mqp-data/ssb-benchmark/gen_hdf |
| 159 | +``` |
| 160 | + |
| 161 | +Learn the ensemble with a naive strategy. |
| 162 | +``` |
| 163 | +python3 maqp.py --generate_ensemble |
| 164 | + --dataset ssb-500gb |
| 165 | + --samples_per_spn 1000000 |
| 166 | + --ensemble_strategy single |
| 167 | + --hdf_path ../mqp-data/ssb-benchmark/gen_hdf |
| 168 | + --ensemble_path ../mqp-data/ssb-benchmark/spn_ensembles |
| 169 | + --rdc_threshold 0.3 |
| 170 | + --post_sampling_factor 10 |
| 171 | +``` |
| 172 | + |
| 173 | +Optional: Compute ground truth for AQP queries (requires postgres with ssb schema). |
| 174 | +``` |
| 175 | +python3 maqp.py --aqp_ground_truth |
| 176 | + --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql |
| 177 | + --target_path ./benchmarks/ssb/ground_truth_500GB.pkl |
| 178 | + --database_name ssb |
| 179 | +``` |
| 180 | + |
| 181 | +Evaluate the AQP queries. |
| 182 | +``` |
| 183 | +python3 maqp.py --evaluate_aqp_queries |
| 184 | + --dataset ssb-500gb |
| 185 | + --target_path ./baselines/aqp/results/deepDB/ssb_500gb_model_based.csv |
| 186 | + --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl |
| 187 | + --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql |
| 188 | + --ground_truth_file_location ./benchmarks/ssb/ground_truth_500GB.pkl |
| 189 | +``` |
| 190 | + |
| 191 | +Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training) |
| 192 | +``` |
| 193 | +python3 maqp.py --aqp_ground_truth |
| 194 | + --query_file_location ./benchmarks/ssb/sql/confidence_queries.sql |
| 195 | + --target_path ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl |
| 196 | + --database_name ssb |
| 197 | +``` |
| 198 | + |
| 199 | +Evaluate the confidence intervals. |
| 200 | +``` |
| 201 | +python3 maqp.py --evaluate_confidence_intervals |
| 202 | + --dataset ssb-500gb |
| 203 | + --target_path ./baselines/aqp/results/deepDB/ssb500GB_confidence_intervals.csv |
| 204 | + --ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl |
| 205 | + --query_file_location ./benchmarks/ssb/sql/aqp_queries.sql |
| 206 | + --ground_truth_file_location ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl |
| 207 | + --confidence_upsampling_factor 300 |
| 208 | + --confidence_sample_size 10000000 |
| 209 | +``` |
| 210 | + |
| 211 | +### Flights pipeline |
| 212 | +Generate flights dataset with scale factor 1 billion using [IDEBench](https://github.com/IDEBench/IDEBench-public) and generate a sample using |
| 213 | +``` |
| 214 | +cat dataset.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > dataset_sampled.csv |
| 215 | +``` |
| 216 | + |
| 217 | +Generate hdf files from csvs. |
| 218 | +``` |
| 219 | +python3 maqp.py --generate_hdf |
| 220 | + --dataset flights1B |
| 221 | + --csv_seperator , |
| 222 | + --csv_path ../mqp-data/flights-benchmark |
| 223 | + --hdf_path ../mqp-data/flights-benchmark/gen_hdf |
| 224 | +``` |
| 225 | + |
| 226 | +Learn the ensemble. |
| 227 | +``` |
| 228 | +python3 maqp.py --generate_ensemble |
| 229 | + --dataset flights1B |
| 230 | + --samples_per_spn 10000000 |
| 231 | + --ensemble_strategy single |
| 232 | + --hdf_path ../mqp-data/flights-benchmark/gen_hdf |
| 233 | + --ensemble_path ../mqp-data/flights-benchmark/spn_ensembles |
| 234 | + --rdc_threshold 0.3 |
| 235 | + --post_sampling_factor 10 |
| 236 | +``` |
| 237 | + |
| 238 | +Optional: Compute ground truth |
| 239 | +``` |
| 240 | +python3 maqp.py --aqp_ground_truth |
| 241 | + --dataset flights1B |
| 242 | + --query_file_location ./benchmarks/flights/sql/aqp_queries.sql |
| 243 | + --target_path ./benchmarks/flights/ground_truth_1B.pkl |
| 244 | + --database_name flights |
| 245 | +``` |
| 246 | + |
| 247 | +Evaluate the AQP queries. |
| 248 | +``` |
| 249 | +python3 maqp.py --evaluate_aqp_queries |
| 250 | + --dataset flights1B |
| 251 | + --target_path ./baselines/aqp/results/deepDB/flights1B_model_based.csv |
| 252 | + --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl |
| 253 | + --query_file_location ./benchmarks/flights/sql/aqp_queries.sql |
| 254 | + --ground_truth_file_location ./benchmarks/flights/ground_truth_1B.pkl |
| 255 | +``` |
| 256 | + |
| 257 | +Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training) |
| 258 | +``` |
| 259 | +python3 maqp.py --aqp_ground_truth |
| 260 | + --dataset flights1B |
| 261 | + --query_file_location ./benchmarks/flights/sql/confidence_queries.sql |
| 262 | + --target_path ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl |
| 263 | + --database_name flights10M_origsample |
| 264 | +``` |
| 265 | + |
| 266 | +Evaluate the confidence intervals. |
| 267 | +``` |
| 268 | +python3 maqp.py --evaluate_confidence_intervals |
| 269 | + --dataset flights1B |
| 270 | + --target_path ./baselines/aqp/results/deepDB/flights1B_confidence_intervals.csv |
| 271 | + --ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl |
| 272 | + --query_file_location ./benchmarks/flights/sql/aqp_queries.sql |
| 273 | + --ground_truth_file_location ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl |
| 274 | + --confidence_upsampling_factor 100 |
| 275 | + --confidence_sample_size 10000000 |
| 276 | +``` |
0 commit comments