Skip to content

BLAS isn't multi-core #1883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MiXaiLL76 opened this issue Nov 23, 2018 · 34 comments
Closed

BLAS isn't multi-core #1883

MiXaiLL76 opened this issue Nov 23, 2018 · 34 comments

Comments

@MiXaiLL76
Copy link

MiXaiLL76 commented Nov 23, 2018

When trying to work with openblas and dlib, multi-core processing does not work.
Previously (on another system, everything worked)

davisking/dlib#1004 (comment)
As I understand it, dlib uses multi-core openblas for multithreading.

OS: Ubuntu 18.04
Arch: arm64
Board: Orange pi
OpenBLAS version: git clone

Openblass build:
make CC=aarch64-linux-gnu-gcc FC=aarch64-linux-gnu-gfortran HOSTCC=gcc TARGET=ARMV8 -j8
Config:

openblas_get_num_threads: 4
openblas_get_num_procs: 4
openblas_get_parallel: 1
openblas_get_config: NO_AFFINITY ARMV8 MAX_THREADS=4
openblas_get_corename: ARMV8

Dlib build:
cmake -DCMAKE_C_FLAGS="-O3 -fprofile-use " -DDLIB_USE_CUDA=NO -DCMAKE_TOOLCHAIN_FILE=/mnt/c/Users/StepanOFF/Desktop/face_detect/min_core_aarch/aarch64.cmake –build –config Release ..

Both compilations have no errors and all libraries are present.

The picture shows that at startup only 1 core works.
in work

No processing
no in work

@martin-frbg
Copy link
Collaborator

What is your typical problem size ? OpenBLAS will (should) not use multiple threads for a very small matrix where the administrative overhead would exceed any possible gain from working in parallel.

@MiXaiLL76
Copy link
Author

What is your typical problem size ? OpenBLAS will (should) not use multiple threads for a very small matrix where the administrative overhead would exceed any possible gain from working in parallel.

The execution time of the script in 1 thread takes about 5-6 seconds, when multithreading is used, the execution time is 0.8 - 1 second.

It may be a problem in the wrong configuration of dlib, but in theory it should not be so.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

Reference BLAS is not multicore, that is for certain.

We will need a firm test case or at least what comes out form "perf record ./detect ; perf report"

Another thing - I suspect it is threads there in the pictures?
Can you attach GDB to your process and identify which are openblas threads in the picture, their CPU consumption looks very assymetric, in case of OpenBLAS main() should use more cpu time and all others equal slices)

$ script
$ gdb
gdb> attach 1438
gdb> t a a bt
!!! Output here is of importance !!!
gdb> detach
gdb> quit
$ quit
File is typescript

Try to describe which thread belong to main(), then make some guesses regarding asymmetry, if you cannot contain it, please edit out private things from the file and attach here.

@martin-frbg
Copy link
Collaborator

Running ldd on the dlib binary would probably show if it references openblas or any other implementation at all. (Not sure what the default on Ubuntu with its "alternatives" system is, could be dlib is still picking up an operating system link that points it at the netlib reference implementation. (Unless your openblas_get_num_threads output above was generated from within dlib)

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

@martin-frbg ubuntu libblas.so.3 priorities (highest becomes default when installed)
Priority 10 = netlib
30 = ATLAS
40 = OpenBLAS
41 = in FAQ

@MiXaiLL76
Copy link
Author

Reference BLAS is not multicore, that is for certain.

We will need a firm test case or at least what comes out form "perf record ./detect ; perf report"

Another thing - I suspect it is threads there in the pictures?
Can you attach GDB to your process and identify which are openblas threads in the picture, their CPU consumption looks very assymetric, in case of OpenBLAS main() should use more cpu time and all others equal slices)

$ script
$ gdb
gdb> attach 1438
gdb> t a a bt
!!! Output here is of importance !!!
gdb> detach
gdb> quit
$ quit
File is typescript

Try to describe which thread belong to main(), then make some guesses regarding asymmetry, if you cannot contain it, please edit out private things from the file and attach here.

(gdb) run
Starting program: /home/ubuntu/detect
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fb090ae40 (LWP 4755)]
[New Thread 0x7fb0109e40 (LWP 4756)]
[New Thread 0x7fae908e40 (LWP 4757)]
[New Thread 0x7fac107e40 (LWP 4758)]
openblas_get_num_threads: 4
openblas_get_num_procs: 4
openblas_get_parallel: 1
openblas_get_config: NO_AFFINITY ARMV8 MAX_THREADS=12
openblas_get_corename: ARMV8
Core init!
file: /samba/allaccess/i.JPG -> 1
file: /samba/allaccess/i.JPG -> open
file: /samba/allaccess/i.JPG -> exists
getRotationMatrix2D: /samba/allaccess/i.JPG -> start
getRotationMatrix2D: /samba/allaccess/i.JPG -> done
Size: /samba/allaccess/i.JPG -> start
Size: /samba/allaccess/i.JPG -> done[638 x 850]
warpAffine: /samba/allaccess/i.JPG -> start
[New Thread 0x7fab906e40 (LWP 4759)]
[New Thread 0x7fab105e40 (LWP 4760)]
[New Thread 0x7faa904e40 (LWP 4761)]
warpAffine: /samba/allaccess/i.JPG -> done
cv_image: /samba/allaccess/i.JPG -> start
cv_image: /samba/allaccess/i.JPG -> done
detector_: /samba/allaccess/i.JPG -> start
detector_: /samba/allaccess/i.JPG -> done
faces: 1 -> /samba/allaccess/i.JPG
Detecting faces: 1040 ms [ 0]
Encoded faces: 4177 ms
Total: 5671 ms

[Thread 0x7fac107e40 (LWP 4758) exited]
[Thread 0x7fb090ae40 (LWP 4755) exited]
[Thread 0x7fae908e40 (LWP 4757) exited]
[Thread 0x7fb0109e40 (LWP 4756) exited]
[Thread 0x7fab906e40 (LWP 4759) exited]
[Thread 0x7fb091e010 (LWP 4752) exited]
[Thread 0x7fab105e40 (LWP 4760) exited]
[Inferior 1 (process 4752) exited normally]```

@MiXaiLL76
Copy link
Author

MiXaiLL76 commented Nov 23, 2018

ldd detect

        **libopenblas**.so.0 => /usr/local/lib/libopenblas.so.0 (0x0000007fa7397000)
        **libpthread**.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000007fa7353000)
        libopencv_core.so.4.0 => /usr/local/lib/libopencv_core.so.4.0 (0x0000007fa7036000)
        libopencv_imgproc.so.4.0 => /usr/local/lib/libopencv_imgproc.so.4.0 (0x0000007fa6c81000)
        libopencv_calib3d.so.4.0 => /usr/local/lib/libopencv_calib3d.so.4.0 (0x0000007fa6b1c000)
        libopencv_videoio.so.4.0 => /usr/local/lib/libopencv_videoio.so.4.0 (0x0000007fa6aba000)
        libstdc++.so.6 => /usr/lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000007fa6923000)
        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000007fa6869000)
        libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000007fa6845000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000007fa66ec000)
        /lib/ld-linux-aarch64.so.1 (0x000000557e93b000)
        libgfortran.so.4 => /usr/lib/aarch64-linux-gnu/libgfortran.so.4 (0x0000007fa65e8000)
        libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000007fa65d3000)
        libopencv_features2d.so.4.0 => /usr/local/lib/libopencv_features2d.so.4.0 (0x0000007fa651b000)
        libopencv_flann.so.4.0 => /usr/local/lib/libopencv_flann.so.4.0 (0x0000007fa64b9000)
        libopencv_imgcodecs.so.4.0 => /usr/local/lib/libopencv_imgcodecs.so.4.0 (0x0000007fa62a9000)
        libgstreamer-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstreamer-1.0.so.0 (0x0000007fa6179000)
        libgobject-2.0.so.0 => /usr/lib/aarch64-linux-gnu/libgobject-2.0.so.0 (0x0000007fa611b000)
        libglib-2.0.so.0 => /usr/lib/aarch64-linux-gnu/libglib-2.0.so.0 (0x0000007fa600c000)
        libgstapp-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstapp-1.0.so.0 (0x0000007fa5fec000)
        libgstriff-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstriff-1.0.so.0 (0x0000007fa5fce000)
        libgstpbutils-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstpbutils-1.0.so.0 (0x0000007fa5f89000)
        libdc1394.so.22 => /usr/lib/aarch64-linux-gnu/libdc1394.so.22 (0x0000007fa5f08000)
        libavcodec.so.57 => /usr/lib/aarch64-linux-gnu/libavcodec.so.57 (0x0000007fa4cc2000)
        libavformat.so.57 => /usr/lib/aarch64-linux-gnu/libavformat.so.57 (0x0000007fa4a8b000)
        libavutil.so.55 => /usr/lib/aarch64-linux-gnu/libavutil.so.55 (0x0000007fa49fa000)
        libswscale.so.4 => /usr/lib/aarch64-linux-gnu/libswscale.so.4 (0x0000007fa4986000)
        libjpeg.so.8 => /usr/lib/aarch64-linux-gnu/libjpeg.so.8 (0x0000007fa493c000)
        libpng16.so.16 => /usr/lib/aarch64-linux-gnu/libpng16.so.16 (0x0000007fa4901000)
        libtiff.so.5 => /usr/lib/aarch64-linux-gnu/libtiff.so.5 (0x0000007fa4886000)
        libgmodule-2.0.so.0 => /usr/lib/aarch64-linux-gnu/libgmodule-2.0.so.0 (0x0000007fa4870000)
        libffi.so.6 => /usr/lib/aarch64-linux-gnu/libffi.so.6 (0x0000007fa4858000)
        libpcre.so.3 => /lib/aarch64-linux-gnu/libpcre.so.3 (0x0000007fa47e6000)
        libgstbase-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstbase-1.0.so.0 (0x0000007fa476f000)
        libgstaudio-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstaudio-1.0.so.0 (0x0000007fa46fb000)
        libgsttag-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgsttag-1.0.so.0 (0x0000007fa46b4000)
        libgstvideo-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libgstvideo-1.0.so.0 (0x0000007fa461f000)
        libraw1394.so.11 => /usr/lib/aarch64-linux-gnu/libraw1394.so.11 (0x0000007fa4603000)
        libusb-1.0.so.0 => /lib/aarch64-linux-gnu/libusb-1.0.so.0 (0x0000007fa45dd000)
        libswresample.so.2 => /usr/lib/aarch64-linux-gnu/libswresample.so.2 (0x0000007fa45b6000)
        libwebp.so.6 => /usr/lib/aarch64-linux-gnu/libwebp.so.6 (0x0000007fa455e000)
        libva.so.2 => /usr/lib/aarch64-linux-gnu/libva.so.2 (0x0000007fa452f000)
        libzvbi.so.0 => /usr/lib/aarch64-linux-gnu/libzvbi.so.0 (0x0000007fa449e000)
        libxvidcore.so.4 => /usr/lib/aarch64-linux-gnu/libxvidcore.so.4 (0x0000007fa43b7000)
        libx265.so.146 => /usr/lib/aarch64-linux-gnu/libx265.so.146 (0x0000007fa4159000)
        libx264.so.152 => /usr/lib/aarch64-linux-gnu/libx264.so.152 (0x0000007fa3ff5000)
        libwebpmux.so.3 => /usr/lib/aarch64-linux-gnu/libwebpmux.so.3 (0x0000007fa3fdc000)
        libwavpack.so.1 => /usr/lib/aarch64-linux-gnu/libwavpack.so.1 (0x0000007fa3faa000)
        libvpx.so.5 => /usr/lib/aarch64-linux-gnu/libvpx.so.5 (0x0000007fa3dfd000)
        libvorbisenc.so.2 => /usr/lib/aarch64-linux-gnu/libvorbisenc.so.2 (0x0000007fa3d4e000)
        libvorbis.so.0 => /usr/lib/aarch64-linux-gnu/libvorbis.so.0 (0x0000007fa3d18000)
        libtwolame.so.0 => /usr/lib/aarch64-linux-gnu/libtwolame.so.0 (0x0000007fa3ce9000)
        libtheoraenc.so.1 => /usr/lib/aarch64-linux-gnu/libtheoraenc.so.1 (0x0000007fa3ca7000)
        libtheoradec.so.1 => /usr/lib/aarch64-linux-gnu/libtheoradec.so.1 (0x0000007fa3c80000)
        libspeex.so.1 => /usr/lib/aarch64-linux-gnu/libspeex.so.1 (0x0000007fa3c59000)
        libsnappy.so.1 => /usr/lib/aarch64-linux-gnu/libsnappy.so.1 (0x0000007fa3c41000)
        libshine.so.3 => /usr/lib/aarch64-linux-gnu/libshine.so.3 (0x0000007fa3c27000)
        librsvg-2.so.2 => /usr/lib/aarch64-linux-gnu/librsvg-2.so.2 (0x0000007fa3be9000)
        libcairo.so.2 => /usr/lib/aarch64-linux-gnu/libcairo.so.2 (0x0000007fa3aed000)
        libopus.so.0 => /usr/lib/aarch64-linux-gnu/libopus.so.0 (0x0000007fa3aa0000)
        libopenjp2.so.7 => /usr/lib/aarch64-linux-gnu/libopenjp2.so.7 (0x0000007fa3a43000)
        libmp3lame.so.0 => /usr/lib/aarch64-linux-gnu/libmp3lame.so.0 (0x0000007fa39c7000)
        libgsm.so.1 => /usr/lib/aarch64-linux-gnu/libgsm.so.1 (0x0000007fa39ad000)
        liblzma.so.5 => /lib/aarch64-linux-gnu/liblzma.so.5 (0x0000007fa397d000)
        libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000007fa3950000)
        libssh-gcrypt.so.4 => /usr/lib/aarch64-linux-gnu/libssh-gcrypt.so.4 (0x0000007fa38d8000)
        libopenmpt.so.0 => /usr/lib/aarch64-linux-gnu/libopenmpt.so.0 (0x0000007fa3710000)
        libbluray.so.2 => /usr/lib/aarch64-linux-gnu/libbluray.so.2 (0x0000007fa36b9000)
        libgnutls.so.30 => /usr/lib/aarch64-linux-gnu/libgnutls.so.30 (0x0000007fa355b000)
        libxml2.so.2 => /usr/lib/aarch64-linux-gnu/libxml2.so.2 (0x0000007fa33bc000)
        libgme.so.0 => /usr/lib/aarch64-linux-gnu/libgme.so.0 (0x0000007fa3365000)
        libchromaprint.so.1 => /usr/lib/aarch64-linux-gnu/libchromaprint.so.1 (0x0000007fa3343000)
        libbz2.so.1.0 => /lib/aarch64-linux-gnu/libbz2.so.1.0 (0x0000007fa3321000)
        libX11.so.6 => /usr/lib/aarch64-linux-gnu/libX11.so.6 (0x0000007fa31f7000)
        libdrm.so.2 => /usr/lib/aarch64-linux-gnu/libdrm.so.2 (0x0000007fa31d8000)
        libvdpau.so.1 => /usr/lib/aarch64-linux-gnu/libvdpau.so.1 (0x0000007fa31c4000)
        libva-x11.so.2 => /usr/lib/aarch64-linux-gnu/libva-x11.so.2 (0x0000007fa31af000)
        libva-drm.so.2 => /usr/lib/aarch64-linux-gnu/libva-drm.so.2 (0x0000007fa319c000)
        libjbig.so.0 => /usr/lib/aarch64-linux-gnu/libjbig.so.0 (0x0000007fa317d000)
        liborc-0.4.so.0 => /usr/lib/aarch64-linux-gnu/liborc-0.4.so.0 (0x0000007fa3104000)
        libudev.so.1 => /lib/aarch64-linux-gnu/libudev.so.1 (0x0000007fa30da000)
        libsoxr.so.0 => /usr/lib/aarch64-linux-gnu/libsoxr.so.0 (0x0000007fa3079000)
        libnuma.so.1 => /usr/lib/aarch64-linux-gnu/libnuma.so.1 (0x0000007fa3059000)
        libogg.so.0 => /usr/lib/aarch64-linux-gnu/libogg.so.0 (0x0000007fa3042000)
        libgdk_pixbuf-2.0.so.0 => /usr/lib/aarch64-linux-gnu/libgdk_pixbuf-2.0.so.0 (0x0000007fa3013000)
        libgio-2.0.so.0 => /usr/lib/aarch64-linux-gnu/libgio-2.0.so.0 (0x0000007fa2e97000)
        libpangocairo-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libpangocairo-1.0.so.0 (0x0000007fa2e7b000)
        libpangoft2-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libpangoft2-1.0.so.0 (0x0000007fa2e56000)
        libpango-1.0.so.0 => /usr/lib/aarch64-linux-gnu/libpango-1.0.so.0 (0x0000007fa2dff000)
        libfontconfig.so.1 => /usr/lib/aarch64-linux-gnu/libfontconfig.so.1 (0x0000007fa2daf000)
        libcroco-0.6.so.3 => /usr/lib/aarch64-linux-gnu/libcroco-0.6.so.3 (0x0000007fa2d6b000)
        libpixman-1.so.0 => /usr/lib/aarch64-linux-gnu/libpixman-1.so.0 (0x0000007fa2d07000)
        libfreetype.so.6 => /usr/lib/aarch64-linux-gnu/libfreetype.so.6 (0x0000007fa2c5e000)
        libxcb-shm.so.0 => /usr/lib/aarch64-linux-gnu/libxcb-shm.so.0 (0x0000007fa2c49000)
        libxcb.so.1 => /usr/lib/aarch64-linux-gnu/libxcb.so.1 (0x0000007fa2c19000)
        libxcb-render.so.0 => /usr/lib/aarch64-linux-gnu/libxcb-render.so.0 (0x0000007fa2bfe000)
        libXrender.so.1 => /usr/lib/aarch64-linux-gnu/libXrender.so.1 (0x0000007fa2be5000)
        libXext.so.6 => /usr/lib/aarch64-linux-gnu/libXext.so.6 (0x0000007fa2bc5000)
        libgcrypt.so.20 => /lib/aarch64-linux-gnu/libgcrypt.so.20 (0x0000007fa2b08000)
        libgssapi_krb5.so.2 => /usr/lib/aarch64-linux-gnu/libgssapi_krb5.so.2 (0x0000007fa2ab8000)
        libmpg123.so.0 => /usr/lib/aarch64-linux-gnu/libmpg123.so.0 (0x0000007fa2a5b000)
        libvorbisfile.so.3 => /usr/lib/aarch64-linux-gnu/libvorbisfile.so.3 (0x0000007fa2a43000)
        libp11-kit.so.0 => /usr/lib/aarch64-linux-gnu/libp11-kit.so.0 (0x0000007fa2931000)
        libidn2.so.0 => /usr/lib/aarch64-linux-gnu/libidn2.so.0 (0x0000007fa2905000)
        libunistring.so.2 => /usr/lib/aarch64-linux-gnu/libunistring.so.2 (0x0000007fa2780000)
        libtasn1.so.6 => /usr/lib/aarch64-linux-gnu/libtasn1.so.6 (0x0000007fa275f000)
        libnettle.so.6 => /usr/lib/aarch64-linux-gnu/libnettle.so.6 (0x0000007fa271e000)
        libhogweed.so.4 => /usr/lib/aarch64-linux-gnu/libhogweed.so.4 (0x0000007fa26dd000)
        libgmp.so.10 => /usr/lib/aarch64-linux-gnu/libgmp.so.10 (0x0000007fa2660000)
        libicuuc.so.60 => /usr/lib/aarch64-linux-gnu/libicuuc.so.60 (0x0000007fa248b000)
        libXfixes.so.3 => /usr/lib/aarch64-linux-gnu/libXfixes.so.3 (0x0000007fa2473000)
        **libgomp**.so.1 => /usr/lib/aarch64-linux-gnu/libgomp.so.1 (0x0000007fa2436000)
        libselinux.so.1 => /lib/aarch64-linux-gnu/libselinux.so.1 (0x0000007fa2403000)
        libresolv.so.2 => /lib/aarch64-linux-gnu/libresolv.so.2 (0x0000007fa23de000)
        libmount.so.1 => /lib/aarch64-linux-gnu/libmount.so.1 (0x0000007fa2381000)
        libharfbuzz.so.0 => /usr/lib/aarch64-linux-gnu/libharfbuzz.so.0 (0x0000007fa22df000)
        libthai.so.0 => /usr/lib/aarch64-linux-gnu/libthai.so.0 (0x0000007fa22c7000)
        libexpat.so.1 => /lib/aarch64-linux-gnu/libexpat.so.1 (0x0000007fa2288000)
        libXau.so.6 => /usr/lib/aarch64-linux-gnu/libXau.so.6 (0x0000007fa2275000)
        libXdmcp.so.6 => /usr/lib/aarch64-linux-gnu/libXdmcp.so.6 (0x0000007fa2260000)
        libgpg-error.so.0 => /lib/aarch64-linux-gnu/libgpg-error.so.0 (0x0000007fa223c000)
        libkrb5.so.3 => /usr/lib/aarch64-linux-gnu/libkrb5.so.3 (0x0000007fa216d000)
        libk5crypto.so.3 => /usr/lib/aarch64-linux-gnu/libk5crypto.so.3 (0x0000007fa212f000)
        libcom_err.so.2 => /lib/aarch64-linux-gnu/libcom_err.so.2 (0x0000007fa211b000)
        libkrb5support.so.0 => /usr/lib/aarch64-linux-gnu/libkrb5support.so.0 (0x0000007fa2101000)
        libicudata.so.60 => /usr/lib/aarch64-linux-gnu/libicudata.so.60 (0x0000007fa0746000)
        libblkid.so.1 => /lib/aarch64-linux-gnu/libblkid.so.1 (0x0000007fa06f1000)
        libgraphite2.so.3 => /usr/lib/aarch64-linux-gnu/libgraphite2.so.3 (0x0000007fa06c0000)
        libdatrie.so.1 => /usr/lib/aarch64-linux-gnu/libdatrie.so.1 (0x0000007fa06aa000)
        libbsd.so.0 => /lib/aarch64-linux-gnu/libbsd.so.0 (0x0000007fa0688000)
        libkeyutils.so.1 => /lib/aarch64-linux-gnu/libkeyutils.so.1 (0x0000007fa0672000)
        libuuid.so.1 => /lib/aarch64-linux-gnu/libuuid.so.1 (0x0000007fa065b000)```

@martin-frbg
Copy link
Collaborator

So it appears to be using OpenBLAS alright, but the overall running time seems to be too short to attach gdb to the running program and get a thread status report ? (With "1438" in brada's script replaced by the actual process id of the running "detect")

@MiXaiLL76
Copy link
Author

So it appears to be using OpenBLAS alright, but the overall running time seems to be too short to attach gdb to the running program and get a thread status report ? (With "1438" in brada's script replaced by the actual process id of the running "detect")

I just ran it like this:

gdb ./detect

Then entered run, for start.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

Now you need to run samples until it gets asymmetrical thread CPU consumption as in pictures, then break with like Ctrl-C and inside gdb run t a a bt, then peak through output to see which are OpenBLAS threads, and which are not, you can get back to program with continue in gdb

perf and attach/detach is somewhat less burdening to main program.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

gomp gets loaded later, it is quite possible that pthread OpenBLAS gets called from various OMP threads yielding ncpu^2 worker threads. That should be worked around only in very recent develop versions soon to become 0.3.4 EDIT: #1875 is not yet applied to develop
perf-based program run profile would be beneficial to find if there are threads being used for supersmall samples (which is likely with image feature recognition), yielding pessimal performance.

@MiXaiLL76
Copy link
Author

I use develop branch from github

@MiXaiLL76
Copy link
Author

0x000000556255a544 in dlib::cpu::img2col(dlib::matrix<float, 0l, 0l, dlib::memory_manager_stateless_kernel_1<char>, dlib::row_major_layout>&, dlib::tensor const&, long, long, long, long, long, long, long) ()
(gdb) t a a bt

Thread 8 (Thread 0x7faa904e40 (LWP 4834)):
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5566d2a970) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x5566d2a910, cond=0x5566d2a948) at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x5566d2a948, mutex=0x5566d2a910) at pthread_cond_wait.c:655
#3  0x0000007fb74f269c in cv::WorkerThread::thread_body() () at /usr/local/lib/libopencv_core.so.4.0
#4  0x0000007fb74f2a68 in cv::WorkerThread::thread_loop_wrapper(void*) () at /usr/local/lib/libopencv_core.so.4.0
#5  0x0000007fb762d088 in start_thread (arg=0x7fffffdc8f) at pthread_create.c:463
#6  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 7 (Thread 0x7fab105e40 (LWP 4833)):
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5566d1f130) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x5566d1f0d0, cond=0x5566d1f108) at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x5566d1f108, mutex=0x5566d1f0d0) at pthread_cond_wait.c:655
#3  0x0000007fb74f269c in cv::WorkerThread::thread_body() () at /usr/local/lib/libopencv_core.so.4.0
#4  0x0000007fb74f2a68 in cv::WorkerThread::thread_loop_wrapper(void*) () at /usr/local/lib/libopencv_core.so.4.0
#5  0x0000007fb762d088 in start_thread (arg=0x7fffffdc8f) at pthread_create.c:463
#6  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 6 (Thread 0x7fab906e40 (LWP 4832)):
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55627001a4) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x5562700140, cond=0x5562700178) at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x5562700178, mutex=0x5562700140) at pthread_cond_wait.c:655
#3  0x0000007fb74f269c in cv::WorkerThread::thread_body() () at /usr/local/lib/libopencv_core.so.4.0
#4  0x0000007fb74f2a68 in cv::WorkerThread::thread_loop_wrapper(void*) () at /usr/local/lib/libopencv_core.so.4.0
#5  0x0000007fb762d088 in start_thread (arg=0x7fffffdc8f) at pthread_create.c:463
#6  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78
---Type <return> to continue, or q <return> to quit---

Thread 5 (Thread 0x7fac107e40 (LWP 4831)):
#0  0x0000007fb6a88048 in __GI___poll (fds=0x7fac107628, nfds=548537553032, timeout=<optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:41
#1  0x0000007fb48c9bc4 in  () at /lib/aarch64-linux-gnu/libusb-1.0.so.0
#2  0x0000007ffffff1a0 in  ()

Thread 4 (Thread 0x7fae908e40 (LWP 4830)):
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb7fd0968 <thread_status+360>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb7fd0910 <thread_status+272>, cond=0x7fb7fd0940 <thread_status+320>)
    at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x7fb7fd0940 <thread_status+320>, mutex=0x7fb7fd0910 <thread_status+272>) at pthread_cond_wait.c:655
#3  0x0000007fb7858874 in blas_thread_server () at /usr/local/lib/libopenblas.so.0
#4  0x0000007fb762d088 in start_thread (arg=0x7ffffff32f) at pthread_create.c:463
#5  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 3 (Thread 0x7fb0109e40 (LWP 4829)):
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb7fd08e8 <thread_status+232>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb7fd0890 <thread_status+144>, cond=0x7fb7fd08c0 <thread_status+192>)
    at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x7fb7fd08c0 <thread_status+192>, mutex=0x7fb7fd0890 <thread_status+144>) at pthread_cond_wait.c:655
#3  0x0000007fb7858874 in blas_thread_server () at /usr/local/lib/libopenblas.so.0
#4  0x0000007fb762d088 in start_thread (arg=0x7ffffff32f) at pthread_create.c:463
#5  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 2 (Thread 0x7fb090ae40 (LWP 4828)):
---Type <return> to continue, or q <return> to quit---
#0  0x0000007fb763322c in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb7fd0868 <thread_status+104>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  0x0000007fb763322c in __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb7fd0810 <thread_status+16>, cond=0x7fb7fd0840 <thread_status+64>)
    at pthread_cond_wait.c:502
#2  0x0000007fb763322c in __pthread_cond_wait (cond=0x7fb7fd0840 <thread_status+64>, mutex=0x7fb7fd0810 <thread_status+16>) at pthread_cond_wait.c:655
#3  0x0000007fb7858874 in blas_thread_server () at /usr/local/lib/libopenblas.so.0
#4  0x0000007fb762d088 in start_thread (arg=0x7ffffff32f) at pthread_create.c:463
#5  0x0000007fb6a914ec in thread_start () at ../sysdeps/unix/sysv/linux/aarch64/clone.S:78

Thread 1 (Thread 0x7fb091e010 (LWP 4826)):
#0  0x000000556255a544 in dlib::cpu::img2col(dlib::matrix<float, 0l, 0l, dlib::memory_manager_stateless_kernel_1<char>, dlib::row_major_layout>&, dlib::tensor const&, long, long, long, long, long, long, long) ()
#1  0x000000556258fe2c in dlib::cpu::tensor_conv::operator()(bool, dlib::tensor&, dlib::tensor const&, dlib::tensor const&) ()
#2  0x00000055625913bc in dlib::cpu::tensor_conv::operator()(bool, dlib::resizable_tensor&, dlib::tensor const&, dlib::tensor const&) ()
#3  0x0000005562518250 in dlib::add_layer<dlib::con_<64l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<64l, 3l, 3l, 1, 1, 1, 1>, dlib::add_tag_layer<1ul, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::add_prev_<dlib::tag2>, dlib::add_layer<dlib::avg_pool_<2l, 2l, 2, 2, 0, 0>, dlib::add_skip_layer<dlib::tag1, dlib::add_tag_layer<2ul, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<64l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<64l, 3l, 3l, 2, 2, 0, 0>, dlib::add_tag_layer<1ul, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::add_prev_<dlib::tag1>, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_tag_layer<1ul, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::add_prev_<dlib::tag1>, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_tag_layer<1ul, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::add_prev_<dlib::tag1>, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 3l, 3l, 1, 1, 1, 1>, dlib::add_tag_layer<1ul, dlib::add_layer<dlib::max_pool_<3l, 3l, 2, 2, 0, 0>, dlib::add_layer<dlib::relu_, dlib::add_layer<dlib::affine_, dlib::add_layer<dlib::con_<32l, 7l, 7l, 2, 2, 0, 0>, dlib::input_rgb_image_sized<150ul, 150ul>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void>, void> >, void>, void>, void>, void>, void>, void>, void>, void>::forward(dlib::tensor const&) ()```

@MiXaiLL76
Copy link
Author

Am I doing the right thing?

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

I use develop branch from github
That should be fine

You did the right thing of (idle) threads is they look like OMP in dlib and opencv code instead. EDIT threads 2 3 4 are OpenBLAS workers.
Can you run through with "perf" (kernel-tools on ubuntu)?

@MiXaiLL76
Copy link
Author

I don't have the perf command.
I can't install kernel-tools

ubuntu@zero-plus-2:~$ uname -a
Linux zero-plus-2 3.10.65-h5-1 #1 SMP PREEMPT Sat Oct 21 15:52:38 BRST 2017 aarch64 aarch64 aarch64 GNU/Linux

ubuntu@zero-plus-2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic

@MiXaiLL76
Copy link
Author

When I compiled opencv for multithreading it was written that it would use pthread.

Also, before that, I used TBB, but there are no special changes.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

All use pthread-s (otherwise there would be libgomp.so below libpthread.so) , you see in backtraces, which is sort of good.

kernel should be v4.15 in ubuntu 18.04 , you should ask for perf in the place your custom kernel was made (probably they dont have one)
If there is no success we can try with gprof, but it requires re-compiling code with extra debug flags.

@MiXaiLL76
Copy link
Author

Compiling code of what? opencv? dlib? openblas? or my application?

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

The problem I suspect lies in pointlessly spinning threads for doing almost nothing, with lots of CPU time being spent in single-threaded codes splitting the work among those.
We are after libopenblas.so only, so that has to be rebuild with both gcc -pg and gfortran -pg (see Makefile.rule for quick place to do so), then program ran in gprof, and called BLAS function list from its output. Such compilation has performance impact , so store default library and put it back as soon as you get output.
What follows is to look for missing threading thresholds in functions found, if none, then for suboptimal ones. perf would be much much quicker to get the result.

@MiXaiLL76
Copy link
Author

I collected openblas with -pg, but I do not know what's next.
It does not create gmon.out

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

gprof default is polluting stdout
gprof ./detect gmon.out
Should do the trick

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

Actually the suspect code is shared between architectures, perf might be easier available on x86_64 virtual machine for example.

@MiXaiLL76
Copy link
Author

MiXaiLL76 commented Nov 23, 2018

On another virtual machine, and not in my computer, everything works.
gprof ./detect gmon.out

I can not start, because the compilation with flags -pg did not produce results.
gmon.out file does not exist

Even if the compilation is done with libopenblas.a

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

In shadow that gprof build conf is not widely used, I meant the other tool with normally built library:

perf record ./detect
perf report 

OPENBLAS_NUM_THREADS=1 perf record ./detect
perf report

@MiXaiLL76
Copy link
Author

I can not do so (
Are there any other options?

@MiXaiLL76
Copy link
Author

profile.txt
I got a gprof output

@MiXaiLL76
Copy link
Author

When using openMP, the results are depressing.
One core is used.

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

OpenBLAS takes <1% of your processing time.
2nd core for 0.2% of time is not noticeable in "top"

  0.26     11.33     0.03                             dlartg_ 
  0.09     11.47     0.01                             blas_thread_server
  0.09     11.48     0.01                             daxpy_k
  0.09     11.49     0.01                             dgemv_t

1st function is from reference BLAS, completely in fortran, not calling OpenBLAS
2nd is thread management (you can check if it goes away with OPENBLAS_NUM_THREADS=1, but no effect
Last 2 are fused against pre-mature multithreading (I will try to recalibrate and admit here, but I dont think you will notice 0.1% speedup at all)

Probably it is worth going after top calls (in the profile) from DLIB, and try to eliminate those, or look for parallelized versions.

@MiXaiLL76
Copy link
Author

The first few functions are dlib, can these functions be multithreaded?

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

It is within scope of dlib programming, you got some advice in dlib issue already, try to link that with top calls encountered in the profile. OpenBLAS makes almost unnoticeable part of your computation. I will check in regard of two functions if they can be improved, but you will not see any improvement in overall code run time from there.
EDIT: REF: https://en.wikipedia.org/wiki/Amdahl%27s_law

@MiXaiLL76
Copy link
Author

image

@brada4
Copy link
Contributor

brada4 commented Nov 23, 2018

Since you already use that (OpenBLAS as closest to MKL on ARM CPU) for 1% of code, probably worth looking into reworking 99% remaining so that it is either parallel itself or offloads more to parallel OpenBLAS or to GPU OpenCL (translated from CUDA into your SoC parlance)?
PS: you really need to show them the code you are trying to optimize, otherwise you get such generic "common sense" advice.
PPS it took tens of posts to work around absence of basic linux performance troubleshooting tool
EDIT2: the more you use matrix functions the more chance processing go offloaded to BLAS http://dlib.net/linear_algebra.html#matrix

@martin-frbg
Copy link
Collaborator

Please reopen if you get any clear evidence that this is a problem in OpenBLAS rather than your code or dlib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants