Skip to content

Bug: [Windows] Filenames with colons cause download failure (ValueError / WinError 3) #717

@polymood

Description

@polymood

Hello,

I have identified a platform-specific bug that prevents the download of any file with a colon (:) in its name on the Windows operating system. The program attempts to interpret the filename as a path, causing it to fail with a ValueError or a WinError 3, depending on the command-line arguments.

Environment Details

  • internetarchive version: 5.5.1
  • Operating System: Windows 11
  • Python Version: 3.9

Steps to Reproduce

The bug can be reliably reproduced on any Windows machine with any file containing a colon. For example, using the file 4:4rts-1_djvu.txt from item 44rts1:

1. Attempting to download to the current directory fails:

ia download 44rts1 --glob="*.txt"

Result: The program crashes with a security validation error.

ValueError: Download path 4:4rts-1_djvu.txt is outside target directory H:\internetarchive

2. Attempting to download to a specified directory also fails:

ia download 44rts1 --glob="*.txt" --destdir="./downloads"

Result: The program crashes with a Windows filesystem error.

error downloading file 4:4rts-1_djvu.txt, exception raised: [WinError 3] The system cannot find the path specified: '4:'

Root Cause

The single source of this failure is the internetarchive.utils.sanitize_filepath function. This utility is fundamentally broken on Windows when it processes an input string that is a simple filename containing a colon.

Here is the chain of failure:

  1. Flawed Parsing: The download logic passes the raw filename (e.g., '4:4rts-1_djvu.txt') to sanitize_filepath.
  2. Incorrect os.path Behavior: The utility then incorrectly uses os.path.dirname() and os.path.basename() to split this filename. On Windows, these functions misinterpret the colon as a drive letter separator:
    • os.path.dirname('4:4rts-1_djvu.txt') returns '4:'.
    • os.path.basename('4:4rts-1_djvu.txt') returns the truncated '4rts-1_djvu.txt'.
  3. Sanitization Failure: The function reassembles these parts and ultimately returns the original, unchanged, and unsanitized filename: '4:4rts-1_djvu.txt'. The intended percent-encoding of the colon never occurs.
  4. Cascading Failures: This incorrect, unsanitized filename is then passed back to the download method, which causes it to crash in one of two ways:
    • Without --destdir, the security check Path('4:4rts-1_djvu.txt').resolve() interprets the string as a path on a non-existent "4" drive, failing the validation check and raising the ValueError.
    • With --destdir, the download logic eventually tries to perform filesystem operations on the path. When it tries to create directories or write the file, the Windows API receives the path component '4:' and fails because a drive named "4" cannot be found, raising the WinError 3.

Proposed Code Changes

Here are the necessary changes to fix the bug. The solution requires updating one function in utils.py and refactoring the security check in files.py.

1. Update internetarchive/utils.py

The core of the fix is to replace the flawed sanitize_filepath function with a more robust version that correctly handles simple filenames and avoids the pitfalls of os.path functions on Windows.

In the file internetarchive/utils.py, replace the entire existing sanitize_filepath function with the following corrected code:

# In internetarchive/utils.py

def sanitize_filepath(filepath: str, avoid_colon: bool = False) -> str:
    """Sanitizes only the filename component of a path, preserving the directory structure.

    This function is robustly designed to handle two distinct cases and avoid
    platform-specific bugs on Windows:
    1. The input is a simple filename (e.g., "4:4rts-1_djvu.txt").
    2. The input is a full path (e.g., "/path/to/file:name.txt").

    Args:
        filepath (str): The file path or filename string to be sanitized.
        avoid_colon (bool): If True, colon ':' characters within the filename
            will be percent-encoded (`%3A`). Defaults to False.

    Returns:
        str: A path with its filename component made safe for filesystem use.
    """
    # FIX: First, check if the input is a simple filename (contains no separators).
    # This is the crucial step that prevents the crash.
    if os.sep not in filepath and '/' not in filepath:
        # The entire input string is treated as a filename and sanitized directly.
        return sanitize_filename(filepath, avoid_colon)

    # If we are here, the input is a full path. We must separate the directory
    # from the filename using a reliable method. We use rsplit() for this, as
    # it's a simple string operation that is not confused by colons.
    separator = '/' if '/' in filepath else os.sep
    parts = filepath.rsplit(separator, 1)

    if len(parts) == 2:
        # The path was successfully split into a directory and a filename.
        parent_dir, filename = parts
        sanitized_filename = sanitize_filename(filename, avoid_colon)
        
        # Rejoin the directory with the now-sanitized filename.
        return parent_dir + separator + sanitized_filename
    else:
        # This is a fallback for unusual cases (e.g., the path is just '/').
        # We treat the whole string as a filename to be safe.
        return sanitize_filename(filepath, avoid_colon)

2. Update internetarchive/files.py

With the utility function fixed, we should also make the security check in the download method more robust. This new logic is clearer and less prone to edge cases with path resolution.

In the file internetarchive/files.py, inside the download method, replace the existing security check try...except block with the following refactored code:

# In internetarchive/files.py, inside the download() method

        # Critical security check: Prevent directory traversal attacks by ensuring
        # the download path doesn't escape the target directory.
        try:
            # First, establish the absolute base directory for downloads.
            base_dir = Path(destdir).resolve() if destdir else Path.cwd().resolve()
            
            # Now, safely join the sanitized filename to the base directory to create
            # the final, absolute target path. This is the most reliable way to build the path.
            target_path = (base_dir / file_path).resolve()

            # Does the fully resolved final path
            # still start with the fully resolved base directory path?
            target_path.relative_to(base_dir)
        except ValueError:
            # This will catch legitimate errors where a malicious filename like '../secret.txt'
            # might cause the resolved path to be outside the base directory.
            raise ValueError(f"Download path {file_path} is outside target directory {base_dir}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions