-
Notifications
You must be signed in to change notification settings - Fork 34
Description
This is a follow on issue to #130.
I did some analysis on what columns are impacted by predicate push down in various NDS queries and some swags about how many rows we might be able to skip if we could make the push down perfect.
For a few tables the there was only 1 column that was ever impacted by predicate push down catalog_returns
and inventory
. I personally think that these should stay as just being sorted. We could do an experiment to see what happens if we just do range partitioning instead of sorting, but that would be a trade off in the time to do the ingest/transocde vs the time it takes to do the actual queries.
There are also a number of other small fact tables that I don't think we should look at, because they are small enough that they are almost always in a single row group anyways so there would be little to no savings.
For these others it would be nice to see what happens if we try to zorder
the data. Unfortunately out of the box we can only do this with deltalake on deltalake 2.0 and above. If the numbers look good we might be able to do something similar with iceberg once we support zorder for it. We also could write our own utility that would let us do zorder how we wanted. Because this only would work for deltalake we need to make sure that the maintenance phase does not undo the ordering that we did before. It is known to do this in some cases. We might need to do the zorder optimizations as a part of maintenance.
For web_returns
there were three columns that were involved in a predicate push down, but only two of them really appeared to have a decent sized impact. I would like to see a comparison for the following.
web_sales
a. zorder byws_net_profit
andws_sales_price
b. just order byws_net_profit
web_returns
a. zorder bywr_return_amt
andwr_returned_date
b. zorder bywr_return_amt
,wr_returned_date
,wr_returning_addr_sk
c. just order bywr_return_amt
catalog_sales
a. zorder bycs_ship_addr_sk
andcs_net_profit
b. zorder bycs_ship_addr_sk
,cs_net_profit
andcs_sold_date_sk
c. zorder bycs_ship_addr_sk
,cs_net_profit
,cs_sold_date_sk
andcs_bill_customer_sk
d. just sort bycs_ship_addr_sk
store_returns
a. zorder bysr_return_amt
andsr_returned_date_sk
b. zorder bysr_return_amt
,sr_returned_date_sk
andsr_customer_sk
c. zorder bysr_return_amt
,sr_returned_date_sk
,sr_customer_sk
andsr_store_sk
d. just sort bysr_return_amt
store_sales
this one is more complicated there are a number of different problems.
a. There are 14 different columns that have some impact to the queries. But 14 columns is way too much for zorder to work well with.
b. The column we care the most aboutss_quantity
has a low cardinality (100) which does not work well with the deltalake zorder implementation.
c. I am not 100% sure what happens when you optimize a partitioned deltalake table with zorder, unless you optimize each partition individually, which would be a real pain to deal with.
d. deltalake zorder only clusters the data into files. It does not actually sort the data so unless there are multiple gigabytes of compressed data under eachss_quantity
partition it si going to not show any benefit at all.
e. Because of all of this I would like to see just one experiment. partition byss_quantity
and zorder byss_wholesale_cost
,ss_list_price
andss_coupon_amt
vs just partition byss_quantity
and order byss_wholesale_cost
paying special attention to query 28, which is the one most likely to see a performance improvement here.