Commit 4602dfa
committed
[tput] empty commit with the current V100/T4 status for the paper
This is the status of the current upstream/master
NB! The EvtsPerSec[MatrixElems] in CUDA are varying wildly.
In double I see this go from 5.9E8 to 6.7E8 today, yesterday 7.2E8 with the same code.
The EvtsPerSec[MatrixElems] however is much more stable around 1.37E9 in all those cases.
I will therefore keep the numbers in the paper from previous measurements and vCHEP
for EvtsPerSec[MatrixElems] (namely 7.25E8 /double and 1.59E9 /float)
and add any recent EvtsPerSec[MECalcOnly] as 1.37E9 /double and 3.28E9 /float
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.743408e+08 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.367534e+09 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 0.743706 sec
2,588,597,917 cycles # 2.648 GHz
3,526,137,396 instructions # 1.36 insn per cycle
1.048524808 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.516160e+09 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.280236e+09 ) sec^-1
MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0
TOTAL : 0.690254 sec
2,368,009,568 cycles # 2.649 GHz
3,367,981,007 instructions # 1.42 insn per cycle
0.978942760 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.301297e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 7.192022 sec
19,241,362,073 cycles # 2.673 GHz
48,583,081,581 instructions # 2.52 insn per cycle
7.203110555 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.532683e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.831134 sec
12,924,690,078 cycles # 2.671 GHz
29,940,147,028 instructions # 2.32 insn per cycle
4.842102414 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.593254e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.662869 sec
9,247,838,169 cycles # 2.520 GHz
16,560,392,033 instructions # 1.79 insn per cycle
3.673581024 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.892414e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.603951 sec
9,129,054,132 cycles # 2.528 GHz
16,497,282,072 instructions # 1.81 insn per cycle
3.614598246 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.738379e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.020666 sec
8,891,618,256 cycles # 2.208 GHz
13,361,398,930 instructions # 1.50 insn per cycle
4.035327755 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.211023e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0
TOTAL : 7.119093 sec
19,060,830,463 cycles # 2.675 GHz
47,728,069,981 instructions # 2.50 insn per cycle
7.129763876 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 578) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.507245e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270375e-06 ) GeV^0
TOTAL : 3.351029 sec
8,971,452,327 cycles # 2.672 GHz
19,719,600,560 instructions # 2.20 insn per cycle
3.362076705 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3719) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.160704e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0
TOTAL : 2.695110 sec
6,906,959,097 cycles # 2.556 GHz
12,504,366,929 instructions # 1.81 insn per cycle
2.706037106 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3077) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.869343e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0
TOTAL : 2.621328 sec
6,734,710,370 cycles # 2.562 GHz
12,522,685,250 instructions # 1.86 insn per cycle
2.631936677 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2917) (512y: 81) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 7.434054e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270340e-06 ) GeV^0
TOTAL : 2.750946 sec
6,437,749,555 cycles # 2.334 GHz
10,930,642,414 instructions # 1.70 insn per cycle
2.761224316 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1559) (512y: 179) (512z: 2157)
=========================================================================
On lxplus770.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla T4]:
=========================================================================
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 3.891814e+07 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 4.009506e+07 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 1.144038 sec
672,779,391 cycles:u # 0.552 GHz
1,362,529,910 instructions:u # 2.03 insn per cycle
1.285736848 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.377864e+08 ) sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 8.217691e+08 ) sec^-1
MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0
TOTAL : 0.825550 sec
386,914,589 cycles:u # 0.437 GHz
828,824,446 instructions:u # 2.14 insn per cycle
0.963103649 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 64
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.275975e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 7.515231 sec
19,209,549,764 cycles:u # 2.561 GHz
48,627,460,774 instructions:u # 2.53 insn per cycle
7.553956776 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.542904e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.980303 sec
12,899,873,466 cycles:u # 2.584 GHz
29,995,111,387 instructions:u # 2.33 insn per cycle
5.033503232 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.508129e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.899591 sec
9,295,935,243 cycles:u # 2.391 GHz
16,619,353,171 instructions:u # 1.79 insn per cycle
3.969700585 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2746) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.775861e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 3.870736 sec
9,103,619,309 cycles:u # 2.376 GHz
16,556,440,384 instructions:u # 1.82 insn per cycle
3.911828159 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.381439e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0
TOTAL : 4.330337 sec
9,151,794,863 cycles:u # 2.106 GHz
13,418,541,159 instructions:u # 1.47 insn per cycle
4.463700520 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1127) (512y: 205) (512z: 2045)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.169022e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0
TOTAL : 7.431409 sec
19,242,935,493 cycles:u # 2.591 GHz
47,777,821,032 instructions:u # 2.48 insn per cycle
7.485035275 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 578) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.425426e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371706e-02 +- 3.270375e-06 ) GeV^0
TOTAL : 3.526400 sec
8,942,606,781 cycles:u # 2.588 GHz
19,782,130,775 instructions:u # 2.21 insn per cycle
3.631748518 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3719) (avx2: 0) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.068817e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0
TOTAL : 2.782846 sec
6,900,906,289 cycles:u # 2.473 GHz
12,568,910,963 instructions:u # 1.82 insn per cycle
2.833799464 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 3077) (512y: 0) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.514326e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0
TOTAL : 2.790172 sec
6,805,090,790 cycles:u # 2.475 GHz
12,587,787,277 instructions:u # 1.85 insn per cycle
2.943286999 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2917) (512y: 81) (512z: 0)
-------------------------------------------------------------------------
Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 6.798061e+06 ) sec^-1
MeanMatrixElemValue = ( 1.371705e-02 +- 3.270340e-06 ) GeV^0
TOTAL : 2.938570 sec
6,641,119,578 cycles:u # 2.249 GHz
10,993,670,890 instructions:u # 1.66 insn per cycle
2.979389891 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 0) (avx2: 1559) (512y: 179) (512z: 2157)
=========================================================================1 parent 795db2a commit 4602dfa
File tree
0 file changed
+0
-0
lines changed0 file changed
+0
-0
lines changed
0 commit comments