-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
On a machine with an A100 GPU Dockerfile 0.5.5
🐛 Describe the bug
64 running build_ext
#34 3.401 Using MAX_JOBS=8 as the number of jobs.
#34 3.405 Using NVCC_THREADS=8 as the number of nvcc threads.
#34 3.660 -- The CXX compiler identification is GNU 11.4.0
#34 3.730 -- Detecting CXX compiler ABI info
#34 4.065 -- Detecting CXX compiler ABI info - done
#34 4.090 -- Check for working CXX compiler: /usr/bin/c++ - skipped
#34 4.090 -- Detecting CXX compile features
#34 4.091 -- Detecting CXX compile features - done
#34 4.092 -- Build type: Release
#34 4.092 -- Target device: cuda
#34 4.280 -- Found Python: /usr/bin/python3 (found version "3.10.12") found components: Interpreter Development.Module Development.SABIModule
#34 4.280 -- Found python matching: /usr/bin/python3.
#34 6.273 -- Found CUDA: /usr/local/cuda (found version "12.4")
#34 7.528 -- The CUDA compiler identification is NVIDIA 12.4.131
#34 7.540 -- Detecting CUDA compiler ABI info
#34 8.908 -- Detecting CUDA compiler ABI info - done
#34 8.998 -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
#34 9.027 -- Detecting CUDA compile features
#34 9.028 -- Detecting CUDA compile features - done
#34 9.036 -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.131")
#34 9.043 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
#34 9.347 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
#34 9.351 -- Found Threads: TRUE
#34 9.383 -- Caffe2: CUDA detected: 12.4
#34 9.383 -- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
#34 9.383 -- Caffe2: CUDA toolkit directory: /usr/local/cuda
#34 9.731 -- Caffe2: Header version is: 12.4
#34 9.971 -- /usr/local/cuda/lib64/libnvrtc.so shorthash is 6d168ef8
#34 9.972 -- USE_CUDNN is set to 0. Compiling without cuDNN support
#34 9.972 -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
#34 9.972 CMake Warning at /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Caffe2/public/utils.cmake:382 (message):
#34 9.972 In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
#34 9.972 to cmake instead of implicitly setting it as an env variable. This will
#34 9.972 become a FATAL_ERROR in future version of pytorch.
#34 9.972 Call Stack (most recent call first):
#34 9.972 /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Caffe2/public/cuda.cmake:300 (torch_cuda_get_nvcc_gencode_flag)
#34 9.972 /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
#34 9.972 /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
#34 9.972 CMakeLists.txt:70 (find_package)
#34 9.972
#34 9.972
#34 9.973 -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_89,code=sm_89;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90,code=compute_90
#34 9.993 CMake Warning at /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
#34 9.993 static library kineto_LIBRARY-NOTFOUND not found.
#34 9.993 Call Stack (most recent call first):
#34 9.993 /usr/local/lib/python3.10/dist-packages/torch/share/cmake/Torch/TorchConfig.cmake:120 (append_torchlib_if_found)
#34 9.993 CMakeLists.txt:70 (find_package)
#34 9.993
#34 9.993
#34 9.994 -- Found Torch: /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch.so
#34 9.995 -- Enabling core extension.
#34 9.995 -- CUDA supported arches: 7.0;7.5;8.0;8.6;8.9;9.0
#34 9.996 -- CUDA target arches: 70-real;75-real;80-real;86-real;89-real;90-real;90-virtual
#34 145.6 -- CMake Version: 3.30.2
#34 145.6 -- CUTLASS 3.5.1
#34 145.6 -- CUDART: /usr/local/cuda/lib64/libcudart.so
#34 145.6 -- CUDA Driver: /usr/local/cuda/lib64/stubs/libcuda.so
#34 145.6 -- NVRTC: /usr/local/cuda/lib64/libnvrtc.so
#34 145.6 -- Default Install Location: install
#34 145.8 -- Found Python3: /usr/bin/python3.10 (found suitable version "3.10.12", minimum required is "3.5") found components: Interpreter
#34 145.8 -- Make cute::tuple be the new standard-layout tuple type
#34 145.8 -- CUDA Compilation Architectures: 70;72;75;80;86;87;89;90;90a
#34 145.8 -- Enable caching of reference results in conv unit tests
#34 145.8 -- Enable rigorous conv problem sizes in conv unit tests
#34 145.8 -- Using NVCC flags: --expt-relaxed-constexpr;-DCUTE_USE_PACKED_TUPLE=1;-DCUTLASS_TEST_LEVEL=0;-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1;-DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1;-DCUTLASS_DEBUG_TRACE_LEVEL=0;-Xcompiler=-Wconversion;-Xcompiler=-fno-strict-aliasing
#34 145.9 fatal: not a git repository (or any of the parent directories): .git
#34 145.9 -- CUTLASS Revision: Unable to detect, Git returned code 128.
#34 145.9 -- Configuring cublas ...
#34 145.9 -- cuBLAS Disabled.
#34 145.9 -- Configuring cuBLAS ... done.
#34 146.2 -- Machete generation completed successfully.
#34 146.2 -- Machete generated sources: /workspace/csrc/quantization/machete/generated/machete_mm_bf16u4.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u4_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u4_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u4b8.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u4b8_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u4b8_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8b128.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8b128_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_bf16u8b128_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4b8.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4b8_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u4b8_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8b128.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8b128_impl_part0.cu;/workspace/csrc/quantization/machete/generated/machete_mm_f16u8b128_impl_part1.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_bf16u4.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_bf16u4b8.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_bf16u8.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_bf16u8b128.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_f16u4.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_f16u4b8.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_f16u8.cu;/workspace/csrc/quantization/machete/generated/machete_prepack_f16u8b128.cu
#34 146.2 -- Enabling C extension.
#34 146.2 -- Enabling moe extension.
#34 146.2 -- Configuring done (142.7s)
#34 146.3 -- Generating done (0.1s)
#34 146.3 -- Build files have been written to: /workspace/build/temp.linux-x86_64-cpython-310
#34 146.4 Using MAX_JOBS=8 as the number of jobs.
#34 146.4 Using NVCC_THREADS=8 as the number of nvcc threads.
#34 147.0 [1/66] Building CXX object CMakeFiles/_core_C.dir/csrc/core/torch_bindings.cpp.o
#34 147.0 FAILED: CMakeFiles/_core_C.dir/csrc/core/torch_bindings.cpp.o
#34 147.0 sccache /usr/bin/c++ -DPy_LIMITED_API=3 -DTORCH_EXTENSION_NAME=_core_C -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -D_core_C_EXPORTS -I/workspace/csrc -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -O3 -DNDEBUG -fPIC -D_GLIBCXX_USE_CXX11_ABI=0 -MD -MT CMakeFiles/_core_C.dir/csrc/core/torch_bindings.cpp.o -MF CMakeFiles/_core_C.dir/csrc/core/torch_bindings.cpp.o.d -o CMakeFiles/_core_C.dir/csrc/core/torch_bindings.cpp.o -c /workspace/csrc/core/torch_bindings.cpp
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::config] Attempting to read config file at "/root/.config/sccache/config"
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::config] Couldn't open config file: failed to open file /root/.config/sccache/config
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::config] Attempting to read config file at "/root/.config/sccache/config"
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::config] Couldn't open config file: failed to open file /root/.config/sccache/config
#34 147.0 [2024-08-27T13:58:57Z INFO sccache::server] start_server: port: 4226
#34 147.0 [2024-08-27T13:58:57Z INFO sccache::server] No scheduler address configured, disabling distributed sccache
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::cache::cache] Init s3 cache with bucket vllm-build-sccache, endpoint None
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend build started: S3Builder { config: S3Config { root: None, bucket: "vllm-build-sccache", endpoint: None, region: Some("us-west-2"), .. }, .. }
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend use root /
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend use bucket vllm-build-sccache
#34 147.0 [2024-08-27T13:58:57Z DEBUG reqsign::aws::config] load_via_profile_config_file failed: No such file or directory (os error 2)
#34 147.0
#34 147.0 Stack backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 [2024-08-27T13:58:57Z DEBUG reqsign::aws::config] load_via_profile_shared_credentials_file failed: No such file or directory (os error 2)
#34 147.0
#34 147.0 Stack backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend use region: us-west-2
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend use endpoint: https://s3.us-west-2.amazonaws.com/vllm-build-sccache
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services::s3::backend] backend build finished
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services] service=s3 operation=metadata -> started
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services] service=s3 operation=metadata -> finished: AccessorInfo { scheme: S3, root: "/", name: "vllm-build-sccache", native_capability: { Stat | Read | Write | Delete | Copy | List | Presign | Batch }, full_capability: { Stat | Read | Write | CreateDir | Delete | Copy | List | Presign | Batch } }
#34 147.0 [2024-08-27T13:58:57Z DEBUG opendal::services] service=s3 operation=stat path=.sccache_check -> started
#34 147.0 [2024-08-27T13:58:57Z DEBUG hyper::proto::h1::io] flushed 183 bytes
#34 147.0 [2024-08-27T13:58:57Z DEBUG hyper::proto::h1::io] parsed 5 headers
#34 147.0 [2024-08-27T13:58:57Z DEBUG hyper::proto::h1::conn] incoming body is content-length (154 bytes)
#34 147.0 [2024-08-27T13:58:57Z DEBUG hyper::proto::h1::conn] incoming body completed
#34 147.0 [2024-08-27T13:58:57Z DEBUG reqsign::aws::credential] load credential via imds_v2 failed: request to AWS EC2 Metadata Services failed:
#34 147.0
#34 147.0 <title>404 Not Found</title>
#34 147.0
#34 147.0
#34 147.0
404 Not Found
#34 147.0 The resource could not be found.
#34 147.0
#34 147.0
#34 147.0
#34 147.0
#34 147.0
#34 147.0
#34 147.0 Stack backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 9:
#34 147.0 10:
#34 147.0 11:
#34 147.0 12:
#34 147.0 13:
#34 147.0 14:
#34 147.0 15:
#34 147.0 16:
#34 147.0 17:
#34 147.0 18:
#34 147.0 19:
#34 147.0 20:
#34 147.0 21:
#34 147.0 22:
#34 147.0 23:
#34 147.0 24:
#34 147.0 25:
#34 147.0 [2024-08-27T13:58:57Z WARN opendal::services] service=s3 operation=stat path=.sccache_check -> PermissionDenied (permanent) at stat, context: { service: s3, path: .sccache_check } => no valid credential found and anonymous access is not allowed
#34 147.0 [2024-08-27T13:58:57Z ERROR sccache::server] storage check failed for: cache storage failed to read: PermissionDenied (permanent) at stat => no valid credential found and anonymous access is not allowed
#34 147.0
#34 147.0 Context:
#34 147.0 service: s3
#34 147.0 path: .sccache_check
#34 147.0
#34 147.0 Backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 9:
#34 147.0 10:
#34 147.0 11:
#34 147.0 12:
#34 147.0 13:
#34 147.0 14:
#34 147.0 15:
#34 147.0 16:
#34 147.0 17:
#34 147.0 18:
#34 147.0 19:
#34 147.0 20:
#34 147.0 21:
#34 147.0
#34 147.0
#34 147.0
#34 147.0 Stack backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 [2024-08-27T13:58:57Z DEBUG sccache::server] notify_server_startup(Err { reason: "cache storage failed to read: PermissionDenied (permanent) at stat => no valid credential found and anonymous access is not allowed\n\nContext:\n service: s3\n path: .sccache_check\n\nBacktrace:\n 0: \n 1: \n 2: \n 3: \n 4: \n 5: \n 6: \n 7: \n 8: \n 9: \n 10: \n 11: \n 12: \n 13: \n 14: \n 15: \n 16: \n 17: \n 18: \n 19: \n 20: \n 21: \n\n" })
#34 147.0 sccache: error: Server startup failed: cache storage failed to read: PermissionDenied (permanent) at stat => no valid credential found and anonymous access is not allowed
#34 147.0
#34 147.0 Context:
#34 147.0 service: s3
#34 147.0 path: .sccache_check
#34 147.0
#34 147.0 Backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 9:
#34 147.0 10:
#34 147.0 11:
#34 147.0 12:
#34 147.0 13:
#34 147.0 14:
#34 147.0 15:
#34 147.0 16:
#34 147.0 17:
#34 147.0 18:
#34 147.0 19:
#34 147.0 20:
#34 147.0 21:
#34 147.0
#34 147.0
#34 147.0 Run with SCCACHE_LOG=debug SCCACHE_NO_DAEMON=1 to get more information
#34 147.0 sccache: error: cache storage failed to read: PermissionDenied (permanent) at stat => no valid credential found and anonymous access is not allowed
#34 147.0
#34 147.0 Context:
#34 147.0 service: s3
#34 147.0 path: .sccache_check
#34 147.0
#34 147.0 Backtrace:
#34 147.0 0:
#34 147.0 1:
#34 147.0 2:
#34 147.0 3:
#34 147.0 4:
#34 147.0 5:
#34 147.0 6:
#34 147.0 7:
#34 147.0 8:
#34 147.0 9:
#34 147.0 10:
#34 147.0 11:
#34 147.0 12:
#34 147.0 13:
#34 147.0 14:
#34 147.0 15:
#34 147.0 16:
#34 147.0 17:
#34 147.0 18:
#34 147.0 19:
#34 147.0 20:
#34 147.0 21:
#34 147.0
#34 147.0
#34 147.0 ninja: build stopped: subcommand failed.
#34 147.0 Traceback (most recent call last):
#34 147.0 File "/workspace/setup.py", line 474, in
#34 147.0 setup(
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/init.py", line 111, in setup
#34 147.0 return distutils.core.setup(**attrs)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 184, in setup
#34 147.0 return run_commands(dist)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 200, in run_commands
#34 147.0 dist.run_commands()
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 964, in run_commands
#34 147.0 self.run_command(cmd)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 948, in run_command
#34 147.0 super().run_command(command)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 983, in run_command
#34 147.0 cmd_obj.run()
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/command/bdist_wheel.py", line 384, in run
#34 147.0 self.run_command("build")
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 316, in run_command
#34 147.0 self.distribution.run_command(command)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 948, in run_command
#34 147.0 super().run_command(command)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 983, in run_command
#34 147.0 cmd_obj.run()
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build.py", line 135, in run
#34 147.0 self.run_command(cmd_name)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 316, in run_command
#34 147.0 self.distribution.run_command(command)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 948, in run_command
#34 147.0 super().run_command(command)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 983, in run_command
#34 147.0 cmd_obj.run()
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/command/build_ext.py", line 96, in run
#34 147.0 _build_ext.run(self)
#34 147.0 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build_ext.py", line 359, in run
#34 147.0 self.build_extensions()
#34 147.0 File "/workspace/setup.py", line 238, in build_extensions
#34 147.0 subprocess.check_call(["cmake", *build_args], cwd=self.build_temp)
#34 147.0 File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
#34 147.0 raise CalledProcessError(retcode, cmd)
#34 147.0 subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=1', '--target=_core_C', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.
executor failed running [/bin/sh -c if [ "$USE_SCCACHE" = "1" ]; then echo "Installing sccache..." && tar -xzf sccache.tar.gz && sudo mv sccache-v0.8.1-x86_64-unknown-linux-musl/sccache /usr/bin/sccache && rm -rf sccache.tar.gz sccache-v0.8.1-x86_64-unknown-linux-musl && export SCCACHE_BUCKET=${SCCACHE_BUCKET_NAME} && export SCCACHE_REGION=${SCCACHE_REGION_NAME} && export SCCACHE_IDLE_TIMEOUT=0 && export CMAKE_BUILD_TYPE=Release && export SCCACHE_LOG=debug && export SCCACHE_NO_DAEMON=1 && sccache --show-stats && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 && sccache --show-stats; fi]: exit code: 1
Executing the command:DOCKER_BUILDKIT=1 docker build -t vllm0.5.5 -f Dockerfile0.5.5 . --build-arg USE_SCCACHE=1 --build-arg max_jobs=8
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.