Skip to content

Conversation

GermanAizek
Copy link
Contributor

No description provided.

@mofosyne mofosyne added refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 14, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 537 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8694.58ms p(95)=20112.79ms fails=, finish reason: stop=468 truncated=69
  • Prompt processing (pp): avg=97.49tk/s p(95)=430.16tk/s
  • Token generation (tg): avg=32.38tk/s p(95)=45.99tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=remove-excess-condition-checks commit=a30c3ab02c8979f07c0de3ef18c2db8a4faa571e

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715662129 --> 1715662753
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 312.0, 312.0, 312.0, 312.0, 312.0, 523.68, 523.68, 523.68, 523.68, 523.68, 527.05, 527.05, 527.05, 527.05, 527.05, 555.28, 555.28, 555.28, 555.28, 555.28, 616.78, 616.78, 616.78, 616.78, 616.78, 653.38, 653.38, 653.38, 653.38, 653.38, 659.25, 659.25, 659.25, 659.25, 659.25, 680.42, 680.42, 680.42, 680.42, 680.42, 700.15, 700.15, 700.15, 700.15, 700.15, 701.9, 701.9, 701.9, 701.9, 701.9, 721.71, 721.71, 721.71, 721.71, 721.71, 750.78, 750.78, 750.78, 750.78, 750.78, 772.19, 772.19, 772.19, 772.19, 772.19, 784.59, 784.59, 784.59, 784.59, 784.59, 764.19, 764.19, 764.19, 764.19, 764.19, 763.49, 763.49, 763.49, 763.49, 763.49, 767.47, 767.47, 767.47, 767.47, 767.47, 765.08, 765.08, 765.08, 765.08, 765.08, 759.98, 759.98, 759.98, 759.98, 759.98, 762.46, 762.46, 762.46, 762.46, 762.46, 764.57, 764.57, 764.57, 764.57, 764.57, 771.79, 771.79, 771.79, 771.79, 771.79, 777.34, 777.34, 777.34, 777.34, 777.34, 780.16, 780.16, 780.16, 780.16, 780.16, 761.24, 761.24, 761.24, 761.24, 761.24, 763.31, 763.31, 763.31, 763.31, 763.31, 764.36, 764.36, 764.36, 764.36, 764.36, 761.71, 761.71, 761.71, 761.71, 761.71, 762.91, 762.91, 762.91, 762.91, 762.91, 764.4, 764.4, 764.4, 764.4, 764.4, 771.03, 771.03, 771.03, 771.03, 771.03, 770.15, 770.15, 770.15, 770.15, 770.15, 769.97, 769.97, 769.97, 769.97, 769.97, 774.41, 774.41, 774.41, 774.41, 774.41, 781.32, 781.32, 781.32, 781.32, 781.32, 786.56, 786.56, 786.56, 786.56, 786.56, 786.79, 786.79, 786.79, 786.79, 786.79, 789.86, 789.86, 789.86, 789.86, 789.86, 787.34, 787.34, 787.34, 787.34, 787.34, 788.02, 788.02, 788.02, 788.02, 788.02, 791.44, 791.44, 791.44, 791.44, 791.44, 794.03, 794.03, 794.03, 794.03, 794.03, 789.54, 789.54, 789.54, 789.54, 789.54, 782.77, 782.77, 782.77, 782.77, 782.77, 782.46, 782.46, 782.46, 782.46, 782.46, 781.72, 781.72, 781.72, 781.72, 781.72, 780.34, 780.34, 780.34, 780.34, 780.34, 786.17, 786.17, 786.17, 786.17, 786.17, 788.8, 788.8, 788.8, 788.8, 788.8, 788.84, 788.84, 788.84, 788.84, 788.84, 794.91, 794.91, 794.91, 794.91, 794.91, 793.75, 793.75, 793.75, 793.75, 793.75, 796.61, 796.61, 796.61, 796.61, 796.61, 801.63, 801.63, 801.63, 801.63, 801.63, 801.61, 801.61, 801.61, 801.61, 801.61, 808.08, 808.08, 808.08, 808.08, 808.08, 809.09, 809.09, 809.09, 809.09, 809.09, 808.79, 808.79, 808.79, 808.79, 808.79, 809.79, 809.79, 809.79, 809.79, 809.79, 812.28, 812.28, 812.28, 812.28, 812.28, 812.78, 812.78, 812.78]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715662129 --> 1715662753
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.42, 42.42, 42.42, 42.42, 42.42, 41.91, 41.91, 41.91, 41.91, 41.91, 32.61, 32.61, 32.61, 32.61, 32.61, 31.5, 31.5, 31.5, 31.5, 31.5, 32.44, 32.44, 32.44, 32.44, 32.44, 32.44, 32.44, 32.44, 32.44, 32.44, 33.49, 33.49, 33.49, 33.49, 33.49, 34.4, 34.4, 34.4, 34.4, 34.4, 34.81, 34.81, 34.81, 34.81, 34.81, 34.89, 34.89, 34.89, 34.89, 34.89, 34.3, 34.3, 34.3, 34.3, 34.3, 34.01, 34.01, 34.01, 34.01, 34.01, 33.93, 33.93, 33.93, 33.93, 33.93, 33.4, 33.4, 33.4, 33.4, 33.4, 32.51, 32.51, 32.51, 32.51, 32.51, 32.5, 32.5, 32.5, 32.5, 32.5, 32.54, 32.54, 32.54, 32.54, 32.54, 32.75, 32.75, 32.75, 32.75, 32.75, 32.81, 32.81, 32.81, 32.81, 32.81, 32.31, 32.31, 32.31, 32.31, 32.31, 32.27, 32.27, 32.27, 32.27, 32.27, 32.21, 32.21, 32.21, 32.21, 32.21, 32.33, 32.33, 32.33, 32.33, 32.33, 32.31, 32.31, 32.31, 32.31, 32.31, 32.27, 32.27, 32.27, 32.27, 32.27, 32.43, 32.43, 32.43, 32.43, 32.43, 32.51, 32.51, 32.51, 32.51, 32.51, 32.26, 32.26, 32.26, 32.26, 32.26, 31.95, 31.95, 31.95, 31.95, 31.95, 32.02, 32.02, 32.02, 32.02, 32.02, 32.14, 32.14, 32.14, 32.14, 32.14, 32.25, 32.25, 32.25, 32.25, 32.25, 32.41, 32.41, 32.41, 32.41, 32.41, 32.48, 32.48, 32.48, 32.48, 32.48, 32.42, 32.42, 32.42, 32.42, 32.42, 32.35, 32.35, 32.35, 32.35, 32.35, 32.14, 32.14, 32.14, 32.14, 32.14, 31.52, 31.52, 31.52, 31.52, 31.52, 31.45, 31.45, 31.45, 31.45, 31.45, 31.53, 31.53, 31.53, 31.53, 31.53, 31.58, 31.58, 31.58, 31.58, 31.58, 31.74, 31.74, 31.74, 31.74, 31.74, 31.85, 31.85, 31.85, 31.85, 31.85, 31.6, 31.6, 31.6, 31.6, 31.6, 31.27, 31.27, 31.27, 31.27, 31.27, 31.11, 31.11, 31.11, 31.11, 31.11, 29.92, 29.92, 29.92, 29.92, 29.92, 29.85, 29.85, 29.85, 29.85, 29.85, 29.83, 29.83, 29.83, 29.83, 29.83, 29.86, 29.86, 29.86, 29.86, 29.86, 29.94, 29.94, 29.94, 29.94, 29.94, 30.01, 30.01, 30.01, 30.01, 30.01, 30.11, 30.11, 30.11, 30.11, 30.11, 30.05, 30.05, 30.05, 30.05, 30.05, 29.94, 29.94, 29.94, 29.94, 29.94, 29.93, 29.93, 29.93, 29.93, 29.93, 29.93, 29.93, 29.93, 29.93, 29.93, 30.03, 30.03, 30.03, 30.03, 30.03, 30.21, 30.21, 30.21, 30.21, 30.21, 30.29, 30.29, 30.29, 30.29, 30.29, 30.38, 30.38, 30.38]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715662129 --> 1715662753
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.42, 0.42, 0.42, 0.42, 0.42, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.17, 0.17, 0.17, 0.17, 0.17, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.35, 0.35, 0.35, 0.35, 0.35, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.29, 0.29, 0.29, 0.29, 0.29, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.34, 0.34, 0.34, 0.34, 0.34, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.42, 0.42, 0.42, 0.42, 0.42, 0.58, 0.58, 0.58, 0.58, 0.58, 0.57, 0.57, 0.57, 0.57, 0.57, 0.54, 0.54, 0.54, 0.54, 0.54, 0.29, 0.29, 0.29, 0.29, 0.29, 0.18, 0.18, 0.18, 0.18, 0.18, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.19, 0.19, 0.19, 0.19, 0.19, 0.23, 0.23, 0.23, 0.23, 0.23, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 537 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715662129 --> 1715662753
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0]
                    
Loading

@mofosyne mofosyne requested review from ngxson and phymbert May 14, 2024 05:31
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 14, 2024
@ngxson ngxson requested a review from ggerganov May 14, 2024 08:37
@ggerganov ggerganov merged commit 359cbe3 into ggml-org:master May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants