Skip to content

Downloading IMDB dataset for benchmarks gives 404 Not Found #13896

Closed
@alihan-synnada

Description

@alihan-synnada

Describe the bug

Attempting to download the IMDB dataset gives the following error:

tar: Error opening archive: Unrecognized archive format

An IMDB.tgz is created with the following content:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>

It seems the dataset is removed or unavailable.

To Reproduce

Run benchmarks/bench.sh data imdb

Expected behavior

It should download the dataset, extract the csv files and convert to parquet.

Additional context

The related part in bench.sh

# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors)
# http://homepages.cwi.nl/~boncz/job/imdb.tgz
data_imdb() {
local imdb_dir="${DATA_DIR}/imdb"
local imdb_temp_gz="${imdb_dir}/imdb.tgz"
local imdb_url="https://homepages.cwi.nl/~boncz/job/imdb.tgz"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions