Bug: [Windows] Filenames with colons cause download failure (ValueError / WinError 3)

Hello,

I have identified a platform-specific bug that prevents the download of any file with a colon (`:`) in its name on the Windows operating system. The program attempts to interpret the filename as a path, causing it to fail with a `ValueError` or a `WinError 3`, depending on the command-line arguments.

### Environment Details

*   **internetarchive version:** `5.5.1` 
*   **Operating System:** Windows 11
*   **Python Version:** 3.9

### Steps to Reproduce

The bug can be reliably reproduced on any Windows machine with any file containing a colon. For example, using the file `4:4rts-1_djvu.txt` from item `44rts1`:

**1. Attempting to download to the current directory fails:**
```shell
ia download 44rts1 --glob="*.txt"
```
**Result:** The program crashes with a security validation error.
```
ValueError: Download path 4:4rts-1_djvu.txt is outside target directory H:\internetarchive
```

**2. Attempting to download to a specified directory also fails:**
```shell
ia download 44rts1 --glob="*.txt" --destdir="./downloads"
```
**Result:** The program crashes with a Windows filesystem error.
```
error downloading file 4:4rts-1_djvu.txt, exception raised: [WinError 3] The system cannot find the path specified: '4:'
```

### Root Cause

The single source of this failure is the `internetarchive.utils.sanitize_filepath` function. This utility is fundamentally broken on Windows when it processes an input string that is a simple filename containing a colon.

Here is the chain of failure:

1.  **Flawed Parsing:** The `download` logic passes the raw filename (e.g., `'4:4rts-1_djvu.txt'`) to `sanitize_filepath`.
2.  **Incorrect `os.path` Behavior:** The utility then incorrectly uses `os.path.dirname()` and `os.path.basename()` to split this filename. On Windows, these functions misinterpret the colon as a drive letter separator:
    *   `os.path.dirname('4:4rts-1_djvu.txt')` returns `'4:'`.
    *   `os.path.basename('4:4rts-1_djvu.txt')` returns the truncated `'4rts-1_djvu.txt'`.
3.  **Sanitization Failure:** The function reassembles these parts and ultimately returns the original, unchanged, and unsanitized filename: `'4:4rts-1_djvu.txt'`. The intended percent-encoding of the colon never occurs.
4.  **Cascading Failures:** This incorrect, unsanitized filename is then passed back to the `download` method, which causes it to crash in one of two ways:
    *   **Without `--destdir`**, the security check `Path('4:4rts-1_djvu.txt').resolve()` interprets the string as a path on a non-existent "4" drive, failing the validation check and raising the `ValueError`.
    *   **With `--destdir`**, the download logic eventually tries to perform filesystem operations on the path. When it tries to create directories or write the file, the Windows API receives the path component `'4:'` and fails because a drive named "4" cannot be found, raising the `WinError 3`.

### **Proposed Code Changes**

Here are the necessary changes to fix the bug. The solution requires updating one function in `utils.py` and refactoring the security check in `files.py`.

#### **1. Update `internetarchive/utils.py`**

The core of the fix is to replace the flawed `sanitize_filepath` function with a more robust version that correctly handles simple filenames and avoids the pitfalls of `os.path` functions on Windows.

In the file `internetarchive/utils.py`, replace the entire existing `sanitize_filepath` function with the following corrected code:

```python
# In internetarchive/utils.py

def sanitize_filepath(filepath: str, avoid_colon: bool = False) -> str:
    """Sanitizes only the filename component of a path, preserving the directory structure.

    This function is robustly designed to handle two distinct cases and avoid
    platform-specific bugs on Windows:
    1. The input is a simple filename (e.g., "4:4rts-1_djvu.txt").
    2. The input is a full path (e.g., "/path/to/file:name.txt").

    Args:
        filepath (str): The file path or filename string to be sanitized.
        avoid_colon (bool): If True, colon ':' characters within the filename
            will be percent-encoded (`%3A`). Defaults to False.

    Returns:
        str: A path with its filename component made safe for filesystem use.
    """
    # FIX: First, check if the input is a simple filename (contains no separators).
    # This is the crucial step that prevents the crash.
    if os.sep not in filepath and '/' not in filepath:
        # The entire input string is treated as a filename and sanitized directly.
        return sanitize_filename(filepath, avoid_colon)

    # If we are here, the input is a full path. We must separate the directory
    # from the filename using a reliable method. We use rsplit() for this, as
    # it's a simple string operation that is not confused by colons.
    separator = '/' if '/' in filepath else os.sep
    parts = filepath.rsplit(separator, 1)

    if len(parts) == 2:
        # The path was successfully split into a directory and a filename.
        parent_dir, filename = parts
        sanitized_filename = sanitize_filename(filename, avoid_colon)
        
        # Rejoin the directory with the now-sanitized filename.
        return parent_dir + separator + sanitized_filename
    else:
        # This is a fallback for unusual cases (e.g., the path is just '/').
        # We treat the whole string as a filename to be safe.
        return sanitize_filename(filepath, avoid_colon)
```


#### **2. Update `internetarchive/files.py`**

With the utility function fixed, we should also make the security check in the `download` method more robust. This new logic is clearer and less prone to edge cases with path resolution.

In the file `internetarchive/files.py`, inside the `download` method, replace the existing security check `try...except` block with the following refactored code:

```python
# In internetarchive/files.py, inside the download() method

        # Critical security check: Prevent directory traversal attacks by ensuring
        # the download path doesn't escape the target directory.
        try:
            # First, establish the absolute base directory for downloads.
            base_dir = Path(destdir).resolve() if destdir else Path.cwd().resolve()
            
            # Now, safely join the sanitized filename to the base directory to create
            # the final, absolute target path. This is the most reliable way to build the path.
            target_path = (base_dir / file_path).resolve()

            # Does the fully resolved final path
            # still start with the fully resolved base directory path?
            target_path.relative_to(base_dir)
        except ValueError:
            # This will catch legitimate errors where a malicious filename like '../secret.txt'
            # might cause the resolved path to be outside the base directory.
            raise ValueError(f"Download path {file_path} is outside target directory {base_dir}")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: [Windows] Filenames with colons cause download failure (ValueError / WinError 3) #717

Environment Details

Steps to Reproduce

Root Cause

Proposed Code Changes

1. Update `internetarchive/utils.py`

2. Update `internetarchive/files.py`

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: [Windows] Filenames with colons cause download failure (ValueError / WinError 3) #717

Description

Environment Details

Steps to Reproduce

Root Cause

Proposed Code Changes

1. Update internetarchive/utils.py

2. Update internetarchive/files.py

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Update `internetarchive/utils.py`

2. Update `internetarchive/files.py`