Explore using zorder for the layout of some data in NDS queries

This is a follow on issue to #130.

I did some analysis on what columns are impacted by predicate push down in various NDS queries and some swags about how many rows we might be able to skip if we could make the push down perfect. 

For a few tables the there was only 1 column that was ever impacted by predicate push down `catalog_returns` and `inventory`. I personally think that these should stay as just being sorted.  We could do an experiment to see what happens if we just do range partitioning instead of sorting, but that would be a trade off in the time to do the ingest/transocde vs the time it takes to do the actual queries.

There are also a number of other small fact tables that I don't think we should look at, because they are small enough that they are almost always in a single row group anyways so there would be little to no savings.

For these others it would be nice to see what happens if we try to `zorder` the data.  Unfortunately out of the box we can only do this with deltalake on deltalake 2.0 and above. If the numbers look good we might be able to do something similar with iceberg once we support zorder for it.  We also could write our own utility that would let us do zorder how we wanted.  Because this only would work for deltalake we need to make sure that the maintenance phase does not undo the ordering that we did before. It is known to do this in some cases. We might need to do the zorder optimizations as a part of maintenance.

For `web_returns` there were three columns that were involved in a predicate push down, but only two of them really appeared to have a decent sized impact.  I would like to see a comparison for the following.

1. `web_sales`
  a. zorder by `ws_net_profit` and `ws_sales_price`
  b. just order by `ws_net_profit`
2. `web_returns`
  a. zorder by `wr_return_amt` and `wr_returned_date`
  b. zorder by `wr_return_amt`, `wr_returned_date`, `wr_returning_addr_sk`
  c. just order by `wr_return_amt`
3. `catalog_sales`
  a. zorder by  `cs_ship_addr_sk` and `cs_net_profit`
  b. zorder by  `cs_ship_addr_sk`, `cs_net_profit` and `cs_sold_date_sk`  
  c. zorder by  `cs_ship_addr_sk`, `cs_net_profit`, `cs_sold_date_sk` and `cs_bill_customer_sk`
  d. just sort by `cs_ship_addr_sk`
4. `store_returns`
  a. zorder by `sr_return_amt` and `sr_returned_date_sk`
  b. zorder by `sr_return_amt`, `sr_returned_date_sk` and `sr_customer_sk`
  c. zorder by `sr_return_amt`, `sr_returned_date_sk`, `sr_customer_sk` and `sr_store_sk`
  d. just sort by `sr_return_amt`
5. `store_sales` this one is more complicated there are a number of different problems. 
  a. There are 14 different columns that have some impact to the queries. But 14 columns is way too much for zorder to work well with.
  b. The column we care the most about `ss_quantity` has a low cardinality (100) which does not work well with the deltalake zorder implementation.
  c. I am not 100% sure what happens when you optimize a partitioned deltalake table with zorder, unless you optimize each partition individually, which would be a real pain to deal with.
  d. deltalake zorder only clusters the data into files. It does not actually sort the data so unless there are multiple gigabytes of compressed data under each `ss_quantity` partition it si going to not show any benefit at all.
  e. Because of all of this I would like to see just one experiment.  partition by `ss_quantity` and zorder by `ss_wholesale_cost`, `ss_list_price` and `ss_coupon_amt` vs just partition by `ss_quantity` and order by `ss_wholesale_cost` paying special attention to query 28, which is the one most likely to see a performance improvement here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explore using zorder for the layout of some data in NDS queries #131

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Explore using zorder for the layout of some data in NDS queries #131

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions