Closed
Description
Describe the bug
Attempting to download the IMDB dataset gives the following error:
tar: Error opening archive: Unrecognized archive format
An IMDB.tgz
is created with the following content:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
It seems the dataset is removed or unavailable.
To Reproduce
Run benchmarks/bench.sh data imdb
Expected behavior
It should download the dataset, extract the csv files and convert to parquet.
Additional context
The related part in bench.sh
datafusion/benchmarks/bench.sh
Lines 458 to 463 in 6cfd1cf