Skip to content

Commit 6d09007

Browse files
author
Benjamin Hilprecht
committed
Added initial codebase
1 parent 9af7bb7 commit 6d09007

File tree

1,895 files changed

+46201
-27
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,895 files changed

+46201
-27
lines changed

.gitignore

Lines changed: 32 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@ __pycache__/
66
# C extensions
77
*.so
88

9+
# vs code
10+
.vscode/
11+
912
# Distribution / packaging
1013
.Python
1114
build/
@@ -20,8 +23,6 @@ parts/
2023
sdist/
2124
var/
2225
wheels/
23-
pip-wheel-metadata/
24-
share/python-wheels/
2526
*.egg-info/
2627
.installed.cfg
2728
*.egg
@@ -40,14 +41,12 @@ pip-delete-this-directory.txt
4041
# Unit test / coverage reports
4142
htmlcov/
4243
.tox/
43-
.nox/
4444
.coverage
4545
.coverage.*
4646
.cache
4747
nosetests.xml
4848
coverage.xml
4949
*.cover
50-
*.py,cover
5150
.hypothesis/
5251
.pytest_cache/
5352

@@ -59,7 +58,6 @@ coverage.xml
5958
*.log
6059
local_settings.py
6160
db.sqlite3
62-
db.sqlite3-journal
6361

6462
# Flask stuff:
6563
instance/
@@ -77,26 +75,11 @@ target/
7775
# Jupyter Notebook
7876
.ipynb_checkpoints
7977

80-
# IPython
81-
profile_default/
82-
ipython_config.py
83-
8478
# pyenv
8579
.python-version
8680

87-
# pipenv
88-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91-
# install all needed dependencies.
92-
#Pipfile.lock
93-
94-
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95-
__pypackages__/
96-
97-
# Celery stuff
81+
# celery beat schedule file
9882
celerybeat-schedule
99-
celerybeat.pid
10083

10184
# SageMath parsed files
10285
*.sage.py
@@ -122,8 +105,32 @@ venv.bak/
122105

123106
# mypy
124107
.mypy_cache/
125-
.dmypy.json
126-
dmypy.json
127108

128-
# Pyre type checker
129-
.pyre/
109+
# Pycharm
110+
.idea
111+
.idea/
112+
113+
# logs
114+
logs/
115+
116+
# old stuff
117+
old/
118+
119+
# test artifacts
120+
benchmarks/mini-imdb/gen_single/*.hdf
121+
benchmarks/mini-ssb/gen_single/*.hdf
122+
benchmarks/mini-flights/gen_single/*.hdf
123+
benchmarks/mini-imdb/gen_single/*.pkl
124+
benchmarks/mini-ssb/gen_single/*.pkl
125+
benchmarks/mini-flights/gen_single/*.pkl
126+
benchmarks/maqp_scripts/rsync_data.sh
127+
benchmarks/maqp_scripts/rsync_dm.sh
128+
129+
# profiling
130+
profiling_results
131+
profiling.py
132+
*.lprof
133+
134+
optimized_inference.cpp
135+
compiled/
136+
.DS_Store

README.md

100644100755
Lines changed: 276 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,276 @@
1-
# deepdb-public
2-
Implementation of DeepDB: Learn from Data, not from Queries!
1+
# DeepDB: Learn from Data, not from Queries!
2+
3+
DeepDB is a data-driven learned database component achieving state-of-the-art-performance in cardinality estimation and
4+
approximate query processing (AQP). This is the implementation described in
5+
6+
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, Carsten Binnig:
7+
"DeepDB: Learn from Data, not from Queries!", VLDB'2020. [[PDF]](https://arxiv.org/abs/1909.00607)
8+
9+
![DeepDB Overview](baselines/plots/overview.png "DeepDB Overview")
10+
11+
# Setup
12+
```
13+
git clone https://github.com/DataManagementLab/deepdb-public.git
14+
cd deepdb-public
15+
sudo apt install -y libpq-dev gcc python3-dev
16+
python3 -m venv venv
17+
source venv/bin/activate
18+
pip3 install -r requirements.txt
19+
```
20+
21+
# Reproduce Experiments
22+
23+
## Cardinality Estimation
24+
Download the [Job dataset](http://homepages.cwi.nl/~boncz/job/imdb.tgz).
25+
Generate hdf files from csvs.
26+
```
27+
python3 maqp.py --generate_hdf
28+
--dataset imdb-light
29+
--csv_seperator ,
30+
--csv_path ../imdb-benchmark
31+
--hdf_path ../imdb-benchmark/gen_single_light
32+
--max_rows_per_hdf_file 100000000
33+
```
34+
35+
Generate sampled hdf files from csvs.
36+
```
37+
python3 maqp.py --generate_sampled_hdfs
38+
--dataset imdb-light
39+
--hdf_path ../imdb-benchmark/gen_single_light
40+
--max_rows_per_hdf_file 100000000
41+
--hdf_sample_size 10000
42+
```
43+
44+
Learn ensemble with the optimized rdc strategy (requires postgres with imdb dataset)
45+
```
46+
python3 maqp.py --generate_ensemble
47+
--dataset imdb-light
48+
--samples_per_spn 10000000 10000000 1000000 1000000 1000000
49+
--ensemble_strategy rdc_based
50+
--hdf_path ../imdb-benchmark/gen_single_light
51+
--max_rows_per_hdf_file 100000000
52+
--samples_rdc_ensemble_tests 10000
53+
--ensemble_path ../imdb-benchmark/spn_ensembles
54+
--database_name imdb
55+
--post_sampling_factor 10 10 5 1 1
56+
--ensemble_budget_factor 5
57+
--ensemble_max_no_joins 3
58+
--pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
59+
```
60+
61+
Alternatively: Learn base ensemble over different tables with naive strategy.
62+
(Does not work with different dataset sizes because join sizes are hard coded but does not require postgres)
63+
```
64+
python3 maqp.py --generate_ensemble
65+
--dataset imdb-light
66+
--samples_per_spn 1000000 1000000 1000000 1000000 1000000
67+
--ensemble_strategy relationship
68+
--hdf_path ../imdb-benchmark/gen_single_light
69+
--ensemble_path ../imdb-benchmark/spn_ensembles
70+
--max_rows_per_hdf_file 100000000
71+
--post_sampling_factor 10 10 5 1 1
72+
```
73+
74+
Evaluate performance for queries.
75+
```
76+
python3 maqp.py --evaluate_cardinalities
77+
--rdc_spn_selection
78+
--max_variants 1
79+
--pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
80+
--dataset imdb-light
81+
--target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv
82+
--ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl
83+
--query_file_location ./benchmarks/job-light/sql/job_light_queries.sql
84+
--ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv
85+
```
86+
87+
## Updates
88+
89+
Conditional incremental learning (i.e., initial learning of all films before 2013, newer films learn incremental)
90+
```
91+
python3 maqp.py --generate_ensemble
92+
--dataset imdb-light
93+
--samples_per_spn 10000000 10000000 1000000 1000000 1000000
94+
--ensemble_strategy rdc_based
95+
--hdf_path ../imdb-benchmark/gen_single_light
96+
--max_rows_per_hdf_file 100000000
97+
--samples_rdc_ensemble_tests 10000
98+
--ensemble_path ../imdb-benchmark/spn_ensembles
99+
--database_name JOB-light
100+
--post_sampling_factor 10 10 5 1 1
101+
--ensemble_budget_factor 0
102+
--ensemble_max_no_joins 3
103+
--pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
104+
--incremental_condition "title.production_year<2013"
105+
```
106+
107+
## Optimized Inference
108+
Generate the C++ code. (Currently only works for cardinality estimation).
109+
```
110+
python3 maqp.py --code_generation
111+
--ensemble_path ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl
112+
```
113+
114+
Compile it in a venv with pybind installed.
115+
Sometimes installing this yields: `ModuleNotFoundError: No module named 'pip.req'`
116+
One workaround is to downgrade pip `pip3 install pip==9.0.3` as described [here](https://stackoverflow.com/questions/25192794/no-module-named-pip-req).
117+
118+
The command below works for ubuntu 18.04. Make sure the generated .so file is in the root directory of the project.
119+
```
120+
g++ -O3 -Wall -shared -std=c++11 -ftemplate-depth=2048 -ftime-report -fPIC `python3 -m pybind11 --includes` optimized_inference.cpp -o optimized_inference`python3-config --extension-suffix`
121+
```
122+
123+
If you now want to leverage the module you have to specify it for cardinalities.
124+
```
125+
python3 maqp.py --evaluate_cardinalities
126+
--rdc_spn_selection
127+
--max_variants 1
128+
--pairwise_rdc_path ../imdb-benchmark/spn_ensembles/pairwise_rdc.pkl
129+
--dataset imdb-light
130+
--target_path ./baselines/cardinality_estimation/results/deepDB/imdb_light_model_based_budget_5.csv
131+
--ensemble_location ../imdb-benchmark/spn_ensembles/ensemble_join_3_budget_5_10000000.pkl
132+
--query_file_location ./benchmarks/job-light/sql/job_light_queries.sql
133+
--ground_truth_file_location ./benchmarks/job-light/sql/job_light_true_cardinalities.csv
134+
--use_generated_code
135+
```
136+
137+
## AQP
138+
### SSB pipeline
139+
140+
Generate standard SSB dataset (Scale Factor=500) and use the correct seperator.
141+
```
142+
for i in `ls *.tbl`; do
143+
sed 's/|$//' $i > $TMP_DIR/${i/tbl/csv} &
144+
echo $i;
145+
done
146+
```
147+
Create lineorder sample
148+
```
149+
cat lineorder.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .003333) print $0}' > lineorder_sampled.csv
150+
```
151+
152+
Generate hdf files from csvs.
153+
```
154+
python3 maqp.py --generate_hdf
155+
--dataset ssb-500gb
156+
--csv_seperator \|
157+
--csv_path ../mqp-data/ssb-benchmark
158+
--hdf_path ../mqp-data/ssb-benchmark/gen_hdf
159+
```
160+
161+
Learn the ensemble with a naive strategy.
162+
```
163+
python3 maqp.py --generate_ensemble
164+
--dataset ssb-500gb
165+
--samples_per_spn 1000000
166+
--ensemble_strategy single
167+
--hdf_path ../mqp-data/ssb-benchmark/gen_hdf
168+
--ensemble_path ../mqp-data/ssb-benchmark/spn_ensembles
169+
--rdc_threshold 0.3
170+
--post_sampling_factor 10
171+
```
172+
173+
Optional: Compute ground truth for AQP queries (requires postgres with ssb schema).
174+
```
175+
python3 maqp.py --aqp_ground_truth
176+
--query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
177+
--target_path ./benchmarks/ssb/ground_truth_500GB.pkl
178+
--database_name ssb
179+
```
180+
181+
Evaluate the AQP queries.
182+
```
183+
python3 maqp.py --evaluate_aqp_queries
184+
--dataset ssb-500gb
185+
--target_path ./baselines/aqp/results/deepDB/ssb_500gb_model_based.csv
186+
--ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
187+
--query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
188+
--ground_truth_file_location ./benchmarks/ssb/ground_truth_500GB.pkl
189+
```
190+
191+
Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)
192+
```
193+
python3 maqp.py --aqp_ground_truth
194+
--query_file_location ./benchmarks/ssb/sql/confidence_queries.sql
195+
--target_path ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
196+
--database_name ssb
197+
```
198+
199+
Evaluate the confidence intervals.
200+
```
201+
python3 maqp.py --evaluate_confidence_intervals
202+
--dataset ssb-500gb
203+
--target_path ./baselines/aqp/results/deepDB/ssb500GB_confidence_intervals.csv
204+
--ensemble_location ../mqp-data/ssb-benchmark/spn_ensembles/ensemble_single_ssb-500gb_1000000.pkl
205+
--query_file_location ./benchmarks/ssb/sql/aqp_queries.sql
206+
--ground_truth_file_location ./benchmarks/ssb/confidence_intervals/confidence_interval_10M.pkl
207+
--confidence_upsampling_factor 300
208+
--confidence_sample_size 10000000
209+
```
210+
211+
### Flights pipeline
212+
Generate flights dataset with scale factor 1 billion using [IDEBench](https://github.com/IDEBench/IDEBench-public) and generate a sample using
213+
```
214+
cat dataset.csv | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > dataset_sampled.csv
215+
```
216+
217+
Generate hdf files from csvs.
218+
```
219+
python3 maqp.py --generate_hdf
220+
--dataset flights1B
221+
--csv_seperator ,
222+
--csv_path ../mqp-data/flights-benchmark
223+
--hdf_path ../mqp-data/flights-benchmark/gen_hdf
224+
```
225+
226+
Learn the ensemble.
227+
```
228+
python3 maqp.py --generate_ensemble
229+
--dataset flights1B
230+
--samples_per_spn 10000000
231+
--ensemble_strategy single
232+
--hdf_path ../mqp-data/flights-benchmark/gen_hdf
233+
--ensemble_path ../mqp-data/flights-benchmark/spn_ensembles
234+
--rdc_threshold 0.3
235+
--post_sampling_factor 10
236+
```
237+
238+
Optional: Compute ground truth
239+
```
240+
python3 maqp.py --aqp_ground_truth
241+
--dataset flights1B
242+
--query_file_location ./benchmarks/flights/sql/aqp_queries.sql
243+
--target_path ./benchmarks/flights/ground_truth_1B.pkl
244+
--database_name flights
245+
```
246+
247+
Evaluate the AQP queries.
248+
```
249+
python3 maqp.py --evaluate_aqp_queries
250+
--dataset flights1B
251+
--target_path ./baselines/aqp/results/deepDB/flights1B_model_based.csv
252+
--ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
253+
--query_file_location ./benchmarks/flights/sql/aqp_queries.sql
254+
--ground_truth_file_location ./benchmarks/flights/ground_truth_1B.pkl
255+
```
256+
257+
Optional: Create the ground truth for confidence interval. (with 10M because we also use 10M samples for the training)
258+
```
259+
python3 maqp.py --aqp_ground_truth
260+
--dataset flights1B
261+
--query_file_location ./benchmarks/flights/sql/confidence_queries.sql
262+
--target_path ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
263+
--database_name flights10M_origsample
264+
```
265+
266+
Evaluate the confidence intervals.
267+
```
268+
python3 maqp.py --evaluate_confidence_intervals
269+
--dataset flights1B
270+
--target_path ./baselines/aqp/results/deepDB/flights1B_confidence_intervals.csv
271+
--ensemble_location ../mqp-data/flights-benchmark/spn_ensembles/ensemble_single_flights1B_10000000.pkl
272+
--query_file_location ./benchmarks/flights/sql/aqp_queries.sql
273+
--ground_truth_file_location ./benchmarks/flights/confidence_intervals/confidence_interval_10M.pkl
274+
--confidence_upsampling_factor 100
275+
--confidence_sample_size 10000000
276+
```

aqp_spn/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)