Skip to content

Conversation

findepi
Copy link
Member

@findepi findepi commented Jun 15, 2023

Locations like s3://bucket aren't consistently supported in Trino yet
and will cause compatibility issues with other software (e.g. Athena),
so forbid them. They may be allowed later.

Fixes #17921

@findepi
Copy link
Member Author

findepi commented Jun 15, 2023

@electrum @findinpath @alexjo2144 @losipiuk what is the best way to block locations like s3://bucket that are not interoperable and are partially working today?

@findepi findepi force-pushed the findepi/test-top-of-bucket-period branch from dd7ddd8 to c907053 Compare June 15, 2023 21:09
@findepi findepi changed the title Test Hive behavior when table location s3://bucket Forbid scheme://authority locations Jun 15, 2023
@findepi findepi marked this pull request as ready for review June 15, 2023 21:10
@findepi findepi force-pushed the findepi/test-top-of-bucket-period branch from c907053 to 804b353 Compare June 15, 2023 21:13
@findepi
Copy link
Member Author

findepi commented Jun 15, 2023

I added commit forbidding such locations and renamed the PR accordingly.

@findinpath
Copy link
Contributor

findinpath commented Jun 15, 2023

CREATE TABLE mytable WITH (external_location='s3://bucket');

It may not be straightforward for the end user to specify the table path as s3://bucket/ so that it can pass the validation.
What would be the cost to accept both s3://bucket and s3://bucket/ and treat them equivalently ?

When the metastore is HMS, the creation of the table / schema was supposed to be suffixed with /

trino> create table hive.tiny.foo1 (c integer) with (format='textfile', external_location='s3://foo1');
Query 20230615_202259_00001_mu7hf failed: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at io.trino.plugin.hive.metastore.thrift.ThriftHiveMetastore.createTable(ThriftHiveMetastore.java:1047)
...

I'm actually still rather confused regarding the HMS behavior:

trino> create table hive.default.foo1 (c integer) with (external_location = 's3://foo1');
Query 20230615_212929_00002_7xdr9 failed: java.lang.IllegalArgumentException: Can not create a Path from an empty string
io.trino.spi.TrinoException: java.lang.IllegalArgumentException: Can not create a Path from an empty string
	at io.trino.plugin.hive.metastore.thrift.ThriftHiveMetastore.createTable(ThriftHiveMetastore.java:1047)
trino> create table hive.default.foo1 (c integer) with (external_location = 's3://foo1/bar');
CREATE TABLE

From end user perspective, it is rather uneasy to understand the reasoning behind this functionality choice on HMS.
Both of the locations are not suffixed by /, but in case of using s3://foo1/bar as location the CREATE TABLE succeeds, but when using s3://foo1 the statement fails.

s3://foo1 and s3://foo1/ are equivalent - why not doing the necessary handling internally on Trino?

UPDATE 16.06.2023

I thought through my point about simplicity of use from end user perspective.
If the user is providing a location missing a path like s3://bucket , Trino should reject it with a meanigful message instead of accepting it.
It is better to avoid dealing with unwanted consequences of opening the door to almost correct schema/table locations.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@losipiuk
Copy link
Member

From end user perspective, it is rather uneasy to understand the reasoning behind this functionality choice on HMS.
s3://foo1 and s3://foo1/ are equivalent - why not doing the necessary handling internally on Trino?

Using top of the bucket as table location realistically will not be used anyway (that is my guess). It would not give us much to add behind the scenes normalization here, and it is a bit against principle of least supprise (if I specified location to X I would expect it to be set to X, not to X/).

Copy link
Member

@hashhar hashhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will cause compatibility issues with other software (e.g. Athena), so forbid them

IIRC our compatibility target is Hive, not Athena - which is also based on Trino. And it seems Hive has no problem with such tables.

@hashhar
Copy link
Member

hashhar commented Jun 16, 2023

Using top of the bucket as table location realistically will not be used anyway (that is my guess).

In practice I've seen a table per bucket as well.

It would not give us much to add behind the scenes normalization here, and it is a bit against principle of least supprise
(if I specified location to X I would expect it to be set to X, not to X/).

This I partially agree with - s3://bucket should be disallowed but not s3://bucket/.

@hashhar hashhar dismissed their stale review June 16, 2023 06:58

stale, seems only s3://bucket stops working, not s3://bucket/

findepi added 5 commits June 16, 2023 12:12
In other places in `Location` class the location toString is not in
apostrophes, so let's change this one place not to use apostrophes
either.
Assert on full message, not a single word contained within.
Note: this documents current state, but does not imply the intent to
support such locations.
Test Hive connector behavior when table location is top-of-bucket, but
without slash at the end (`s3://bucket` and not `s3://bucket/`).

As seen, the behavior is not fully coherent, so maybe it would be better
to disallow such locations. FWIW, such locations aren't supported in
Athena.
@findepi findepi force-pushed the findepi/test-top-of-bucket-period branch from 927ea05 to 5508c66 Compare June 16, 2023 12:51
@findepi findepi changed the title Forbid scheme://authority locations Forbid scheme://authority locations (without trailing slash) Jun 16, 2023
Locations like `s3://bucket` aren't consistently supported in Trino yet
and will cause compatibility issues with other software (e.g. Athena),
so forbid them. They may be allowed later.
@findepi findepi force-pushed the findepi/test-top-of-bucket-period branch from 5508c66 to c1cd116 Compare June 16, 2023 13:29
@findepi findepi requested review from electrum, martint and losipiuk and removed request for electrum June 16, 2023 18:35
@mosabua
Copy link
Member

mosabua commented Jun 16, 2023

Question @findepi and @martint .. with that also apply to other S3 related paths and config, and specifically also the docs. For example we have

    exchange.base-directories=s3://exchange-spooling-bucket

Will that fail?

And also

    location = 's3://my-bucket/a/path',

Will that fail to? Or generically .. should we just make sure that all of these (also for others like gcs and abfs) always have the trailing slash in the docs everywhere?

@findepi
Copy link
Member Author

findepi commented Jun 17, 2023

For example we have

    exchange.base-directories=s3://exchange-spooling-bucket

Will that fail?

Might. I don't know.

And also

    location = 's3://my-bucket/a/path',

Will that fail to?

No. Only s3://my-bucket is impacted. s3://my-bucket/.. is not.

@findepi findepi merged commit 4be107e into master Jun 17, 2023
@findepi findepi deleted the findepi/test-top-of-bucket-period branch June 17, 2023 16:50
@github-actions github-actions bot added this to the 420 milestone Jun 17, 2023
@mosabua
Copy link
Member

mosabua commented Jun 19, 2023

For example we have

    exchange.base-directories=s3://exchange-spooling-bucket

Will that fail?

Might. I don't know.

@arhimondr @linzebing .. could you verify and maybe send a PR to update the docs?

@losipiuk
Copy link
Member

Might. I don't know.

@arhimondr @linzebing .. could you verify and maybe send a PR to update the docs?

It will not - actually top of the bucket is configuration we use in unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

Disallow tables with s3://bucket location (without trailing slash after bucket name)
7 participants