-
Notifications
You must be signed in to change notification settings - Fork 13k
Open
Labels
help wantedNeeds help from the communityNeeds help from the community
Description
I am trying to use MPI but each node uses the full RAM. Is this how MPI is supposed to work? I didn't think it was. Here's the details.
I am on commit 1cbf561. I modified the Makefile so I could compile it like this (see #2208).
LLAMA_MPI=1 LLAMA_METAL=1 make CC=/opt/homebrew/bin/mpicc CXX=/opt/homebrew/bin/mpicxx
I run the following.
mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."
This is the output. It works, but each node uses 39 GB of RAM. Each node has 16 GB of RAM, so they swap bad.
main: build = 827 (1cbf561)
main: seed = 1689216374
main: build = 827 (1cbf561)
main: seed = 1689216374
main: build = 827 (1cbf561)
main: seed = 1689216374
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
llama_new_context_with_model: kv self size = 1280.00 MB
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A. Paris. [end of text]
llama_print_timings: load time = 149282.74 ms
llama_print_timings: sample time = 2.15 ms / 3 runs ( 0.72 ms per token, 1397.95 tokens per second)
llama_print_timings: prompt eval time = 20222.54 ms / 25 tokens ( 808.90 ms per token, 1.24 tokens per second)
llama_print_timings: eval time = 2537.97 ms / 2 runs ( 1268.99 ms per token, 0.79 tokens per second)
llama_print_timings: total time = 22764.59 ms
[[email protected]] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[[email protected]] control_cb (pm/pmiserv/pmiserv_cb.c:316): error writing to control socket
[[email protected]] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error status
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for event
[[email protected]] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
If I enable metal, it errors out.
mpirun -hostfile hostfile -n 3 ./main -m airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin -n 128 -ngl 1 -p "Q. What is the capital of Germany? A. Berlin. Q. What is the capital of France? A."
Output.
main: build = 827 (1cbf561)
main: seed = 1689216039
main: build = 827 (1cbf561)
main: seed = 1689216039
main: build = 827 (1cbf561)
main: seed = 1689216040
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x13b604a40
ggml_metal_init: loaded kernel_mul 0x13b605630
ggml_metal_init: loaded kernel_mul_row 0x13b605c20
ggml_metal_init: loaded kernel_scale 0x13b606210
ggml_metal_init: loaded kernel_silu 0x13b606800
ggml_metal_init: loaded kernel_relu 0x13b606df0
ggml_metal_init: loaded kernel_gelu 0x13b6073e0
ggml_metal_init: loaded kernel_soft_max 0x13b607cf0
ggml_metal_init: loaded kernel_diag_mask_inf 0x13b608400
ggml_metal_init: loaded kernel_get_rows_f16 0x13b608b40
ggml_metal_init: loaded kernel_get_rows_q4_0 0x12b6042f0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x12b604b70
ggml_metal_init: loaded kernel_get_rows_q2_K 0x12b605120
ggml_metal_init: loaded kernel_get_rows_q3_K 0x14b7050c0
ggml_metal_init: loaded kernel_get_rows_q4_K 0x14b705790
ggml_metal_init: loaded kernel_get_rows_q5_K 0x12b605460
ggml_metal_init: loaded kernel_get_rows_q6_K 0x12b605b30
ggml_metal_init: loaded kernel_rms_norm 0x12b606440
ggml_metal_init: loaded kernel_norm 0x12b606d50
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x12b6077e0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x12b607dd0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x12b6083d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x12b6089d0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x12b609170
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x12b609770
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x12b609d70
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x106304490
ggml_metal_init: loaded kernel_rope 0x106305300
ggml_metal_init: loaded kernel_alibi_f32 0x106305e20
ggml_metal_init: loaded kernel_cpy_f32_f16 0x106306920
ggml_metal_init: loaded kernel_cpy_f32_f32 0x106307420
ggml_metal_init: loaded kernel_cpy_f16_f16 0x106308070
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 140.62 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 8442462208
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 16884924416
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 25327386624
ggml_metal_add_buffer: allocated 'data ' buffer, size = 2821.31 MB, offs = 33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
llama_new_context_with_model: kv self size = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x1080044b0
ggml_metal_init: loaded kernel_mul 0x1080051c0
ggml_metal_init: loaded kernel_mul_row 0x1080057b0
ggml_metal_init: loaded kernel_scale 0x108104330
ggml_metal_init: loaded kernel_silu 0x108104a40
ggml_metal_init: loaded kernel_relu 0x108005c80
ggml_metal_init: loaded kernel_gelu 0x108006390
ggml_metal_init: loaded kernel_soft_max 0x108006ca0
ggml_metal_init: loaded kernel_diag_mask_inf 0x107704610
ggml_metal_init: loaded kernel_get_rows_f16 0x107704e70
ggml_metal_init: loaded kernel_get_rows_q4_0 0x107705420
ggml_metal_init: loaded kernel_get_rows_q4_1 0x107705b40
ggml_metal_init: loaded kernel_get_rows_q2_K 0x1077060f0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x1077066a0
ggml_metal_init: loaded kernel_get_rows_q4_K 0x1082041a0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x108204870
ggml_metal_init: loaded kernel_get_rows_q6_K 0x107706b30
ggml_metal_init: loaded kernel_rms_norm 0x107706f90
ggml_metal_init: loaded kernel_norm 0x1077078a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x1082051c0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x1082058d0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x108205ed0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x1082064d0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x108206c70
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x108207270
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x108207870
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x108207e70
ggml_metal_init: loaded kernel_rope 0x108208bc0
ggml_metal_init: loaded kernel_alibi_f32 0x1082096e0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x10820a1e0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x10820ace0
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1080078f0
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: max tensor size = 140.62 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 8442462208
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 16884924416
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 25327386624
ggml_metal_add_buffer: allocated 'data ' buffer, size = 2821.31 MB, offs = 33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
llama.cpp: loading model from airoboros-65B-gpt4-1.2.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 0.19 MB
llama_model_load_internal: mem required = 38610.47 MB (+ 5120.00 MB per state)
llama_new_context_with_model: kv self size = 1280.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/james/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x132605260
ggml_metal_init: loaded kernel_mul 0x132605e50
ggml_metal_init: loaded kernel_mul_row 0x132606440
ggml_metal_init: loaded kernel_scale 0x132606a30
ggml_metal_init: loaded kernel_silu 0x132607020
ggml_metal_init: loaded kernel_relu 0x132607610
ggml_metal_init: loaded kernel_gelu 0x132607c00
ggml_metal_init: loaded kernel_soft_max 0x132608510
llama_new_context_with_model: max tensor size = 140.62 MB
ggml_metal_init: loaded kernel_diag_mask_inf 0x1068046d0
ggml_metal_init: loaded kernel_get_rows_f16 0x106804e10
ggml_metal_init: loaded kernel_get_rows_q4_0 0x1068053c0
ggml_metal_init: loaded kernel_get_rows_q4_1 0x106805ae0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x106806090
ggml_metal_init: loaded kernel_get_rows_q3_K 0x106806640
ggml_metal_init: loaded kernel_get_rows_q4_K 0x106806bf0
ggml_metal_init: loaded kernel_get_rows_q5_K 0x1068071a0
ggml_metal_init: loaded kernel_get_rows_q6_K 0x106807750
ggml_metal_init: loaded kernel_rms_norm 0x106808060
ggml_metal_init: loaded kernel_norm 0x106808970
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x106809400
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x1068099f0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x106809ff0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x10680a5f0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x10680ad90
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x10680b390
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x10680b990
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x10680bf90
ggml_metal_init: loaded kernel_rope 0x10680cce0
ggml_metal_init: loaded kernel_alibi_f32 0x10680d800
ggml_metal_init: loaded kernel_cpy_f32_f16 0x10680e300
ggml_metal_init: loaded kernel_cpy_f32_f32 0x10680ee00
ggml_metal_init: loaded kernel_cpy_f16_f16 0x10680fa50
ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 8442462208
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 16884924416
ggml_metal_add_buffer: allocated 'data ' buffer, size = 8192.00 MB, offs = 25327386624
ggml_metal_add_buffer: allocated 'data ' buffer, size = 2821.31 MB, offs = 33769848832, (35589.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1536.00 MB, (37125.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 1282.00 MB, (38407.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 1024.00 MB, (39431.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 1024.00 MB, (40455.77 / 10922.67), warning: current allocated size is greater than the recommended max working set size
system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0
I'm guessing it fails because it runs out of memory.
Metadata
Metadata
Assignees
Labels
help wantedNeeds help from the communityNeeds help from the community