Skip to content

intel mkl optimized tensorflow performance degradation #23238

@patelprateek

Description

@patelprateek

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Deep Learning VM
    Version: m10
    Based on: Debian GNU/Linux 9.5 (stretch) (GNU/Linux 4.9.0-8-amd64 x86_64\n)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): deep-learning image
  • TensorFlow version (use command below): 1.11
  • Python version: 2.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory: N/A

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the current behavior
Running deep model and some wide linear models. Inference performance is very bad. 2-4x slower relative to running inference without MKL
Describe the expected behavior
Performance should actually improve with intel mkl instruction set .

Code to reproduce the issue
code for deep and wide linear model. or logistic regression example code from tensorflow example

Other info / logs
When running using google deep learning image version M9 on gpu machine (image : tf-latest-cu92, version M9) . Note : the inference is only running on cpu as i turn off the visibility for cuda devices, So the tensorflow code runs only runs on cpu. The image family says they are intel optimized packages but when i rung the benchmarks with verbosity on , i do not observe any mkl related stuff.

I start another deep learning image (tf-latest-cpu , version M10): Running exact same code on this machine with environment variable (export MKL_VERBOSE=1): I can observe a lot of openMP thread settings , KMP_xxx settings and mkl instructions logged with some timing information. I didn't observe any such thing in the M9 gpu image , even though in both place when i execute command i observe following logs:
M9 gpu image
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x55fd25117d40,1,0x55fd25117d40,1) 1.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

M10 cpu image :
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime

User settings:

KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Effective settings:

KMP_ABORT_DELAY=0
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=0
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DEVICE_THREAD_LIMIT=2147483647
KMP_DISP_HAND_THREAD=false
KMP_DISP_NUM_BUFFERS=7
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_SPIN_BACKOFF_PARAMS='4096,100'
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASKLOOP_MIN_TASKS=0
KMP_TASK_STEALING_CONSTRAINT=1
KMP_TEAMS_THREAD_LIMIT=32
KMP_TOPOLOGY_METHOD=all
KMP_USER_LEVEL_MWAIT=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
OMP_ALLOCATOR=omp_default_mem_alloc
OMP_CANCELLATION=false
OMP_DEBUG=disabled
OMP_DEFAULT_DEVICE=0
OMP_DISPLAY_AFFINITY=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_MAX_TASK_PRIORITY=0
OMP_NESTED=false
OMP_NUM_THREADS='32'
OMP_PLACES: value is not defined
OMP_PROC_BIND='intel'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_TARGET_OFFLOAD=DEFAULT
OMP_THREAD_LIMIT=2147483647
OMP_TOOL=enabled
OMP_TOOL_LIBRARIES: value is not defined
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='verbose,warnings,respect,granularity=fine,compact,1,0'

OMP: Info #212: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #210: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #156: KMP_AFFINITY: 32 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 16 cores/pkg x 2 threads/core (16 total cores)
OMP: Info #214: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 0 core 3 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 0 core 4 thread 1

OMP: Info #250: KMP_AFFINITY: pid 8331 tid 8331 thread 0 bound to OS proc set 0
MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.20GHz lp64 intel_thread
MKL_VERBOSE SDOT(2,0x5622b7736500,1,0x5622b7736500,1) 2.54ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:16
1.11.0

So i assume intel mkl is being used in M10 image where as mkl is not being used in the M9 image (Note: i have turned off visibility for cuda devices so only cpu inference is being compared) . I observe 2-4x performance degradation with intel mkl.
The mkl suggested flags are appropriate:
KMP_AFFINITY=granularity=fine,verbose,compact,1,0
KMP_BLOCKTIME=0
KMP_SETTINGS=1
OMP_NUM_THREADS=32

Any ideas on how to debug the root cause and get the maximum performance for my models.

Metadata

Metadata

Labels

comp:mklMKL related issuesstaleThis label marks the issue/pr stale - to be closed automatically if no activitystat:awaiting responseStatus - Awaiting response from author

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions