Files
OSACA/validation/README.md

89 lines
5.7 KiB
Markdown

# Validation
View the validation analysis at https://nbviewer.jupyter.org/github/RRZE-HPC/OSACA/blob/master/validation/Analysis.ipynb
To reconstruct, download the `validataion-data.tar.gz` from a github release and extraxct into the base folder (`OSACA`). Alternativly, update the configuration in `buiild_and_run.py` and run your own measurements.
## Dev Stories
### ZEN2: only simple or all store AGU on port 10?
`('ZEN2','clang','O1','add')`
was predicted too slow (1.5 vs 1.0 cy/it) with only simple address generation on port 10
with only store address-genration, but both complex and simple, this fits perfectly
moved too slow predictions generally into the categories of "perfect" or "too fast".
### ZEN & ZEN2: LEA on ALU or AGU?
`('ZEN','clang','O2','3d-r3-11pt')`
was predicted too slow (10 vs 8.3 cy/it.) with LEAs bound to AGUs. Changing this on ZEN and ZEN2 moved too slow predictions into perfect or a little too fast regions.
# sumreduction and gs-2d-5pt: overlapping iterations
If the compile was unable to remove the dependency chain in this kernel, performance can still exceed predictions because of the way benchmarks were run. With short inner ("kernel") and long outer ("repeat") loop, kernel iterations can overlap and lead to better-than-predicted measurments (which would normally contradict the model assumptions). The measured performance converges towards the prediction
### Special knowlege on scheduling
('IVB','clang', 'O3','3d-7pt')
IACA predicts Port5 as Bottleneck and thus a throughput which fits measurment. Ports 1 and 2 could theoretically take load from Port 5, if perfectly scheduled, because all instructions on Port 5 could also be executed on Ports 1 and 2. The scheduling decision is unexplained, but the relevant instructions are all on the critical path, this could mean that instructions reusing results of previous are preferrably scheduled on the same port.
### Frontend bottlenecks
IACA models the front end and therefore predicts better in scenarios where this is the bottleneck.
### Pre Indexed
idx = ('TX2','gcc', 'O2','gs-2d-5pt')
decent with pre_indexed support
### Undetected memory dependency
('A64FX','gcc', 'O1','gs-2d-5pt')
### ZEN2: Stragen Load throughput
build/ZEN2/clang/O3/3d-27pt.marked.s
is almost perfectly predicted by LLVM-MCA, because `vaddsd` is assigned by it to a single prot. This contradicts micro-benchmarks, where `vaddsd` has a throughput of 0.5 cy. The kernel is also heavy on loads
Checking with the ('ZEN2','clang', 'Ofast','copy') kernel, which is load/store bound with vector instructions, reveals that both LLVM-MCA and OSACA can not predict it well (33% and 27% relative error).
1cy/load ('ZEN2','icc', 'Ofast','copy') with vmovupd (%r8,%rax,8), %ymm0; vmovupd %ymm0, (%rdx,%rax,8)
400f83: c4 c1 7d 10 04 c0 vmovupd (%r8,%rax,8),%ymm0
400f89: c5 fd 11 04 c2 vmovupd %ymm0,(%rdx,%rax,8)
400f8e: 48 83 c0 04 add $0x4,%rax
400f92: 49 3b c1 cmp %r9,%rax
400f95: 72 ec jb 400f83 <kernel+0xe3>
2cy/load ('ZEN2','gcc', 'Ofast','copy') with vmovupd (%r15,%rax), %ymm1; vmovupd %ymm1, (%rdx,%rax)
400df0: c4 c1 7d 10 0c 07 vmovupd (%r15,%rax,1),%ymm1
400df6: c5 fd 11 0c 02 vmovupd %ymm1,(%rdx,%rax,1)
400dfb: 48 83 c0 20 add $0x20,%rax
400dff: 48 39 d8 cmp %rbx,%rax
400e02: 75 ec jne 400df0 <kernel+0xd0>
2cy/load ('ZEN2','clang', 'Ofast','copy') with 4x unrolled vmovups (%rbp,%rax,8), %ymm0; vmovups %ymm0, (%rdi,%rax,8)
4009e0: c5 fc 10 44 c5 00 vmovups 0x0(%rbp,%rax,8),%ymm0
4009e6: c5 fc 10 4c c5 20 vmovups 0x20(%rbp,%rax,8),%ymm1
4009ec: c5 fc 10 54 c5 40 vmovups 0x40(%rbp,%rax,8),%ymm2
4009f2: c5 fc 10 5c c5 60 vmovups 0x60(%rbp,%rax,8),%ymm3
4009f8: c5 fc 11 04 c7 vmovups %ymm0,(%rdi,%rax,8)
4009fd: c5 fc 11 4c c7 20 vmovups %ymm1,0x20(%rdi,%rax,8)
400a03: c5 fc 11 54 c7 40 vmovups %ymm2,0x40(%rdi,%rax,8)
400a09: c5 fc 11 5c c7 60 vmovups %ymm3,0x60(%rdi,%rax,8)
400a0f: 48 83 c0 10 add $0x10,%rax
400a13: 49 39 c5 cmp %rax,%r13
400a16: 75 c8 jne 4009e0 <kernel+0x90>
1cy/load likwid-bench -t copy_avx -w S0:8kB:1
bbc0: c5 fc 28 0c c6 vmovaps (%rsi,%rax,8),%ymm1
bbc5: c5 fc 28 54 c6 20 vmovaps 0x20(%rsi,%rax,8),%ymm2
bbcb: c5 fc 28 5c c6 40 vmovaps 0x40(%rsi,%rax,8),%ymm3
bbd1: c5 fc 28 64 c6 60 vmovaps 0x60(%rsi,%rax,8),%ymm4
bbd7: c5 fc 29 0c c2 vmovaps %ymm1,(%rdx,%rax,8)
bbdc: c5 fc 29 54 c2 20 vmovaps %ymm2,0x20(%rdx,%rax,8)
bbe2: c5 fc 29 5c c2 40 vmovaps %ymm3,0x40(%rdx,%rax,8)
bbe8: c5 fc 29 64 c2 60 vmovaps %ymm4,0x60(%rdx,%rax,8)
bbee: 48 83 c0 10 add $0x10,%rax
bbf2: 48 39 f8 cmp %rdi,%rax
bbf5: 7c c9 jl bbc0 <copy_avx+0x40>
### Update Kernel: FrontEnd and Branch TP limit
On Zen2 the update kernel is preticted 50% faster, because OSACA does not see a bottleneck due to short loop body. When unrolled, the performance increases and is well predicted.
On SKX unrolling and wider SIMD does not give better performance
"The fused branch instructions can execute at a throughput of two such branches per clock
cycle if they are not taken, or one branch per two clock cycles if taken" (21.4., Agner Fog)
### Store AGU usage on ZEN
When looking at predictions of the copy kernel, it became clear that the AGUs for store also required to µops if a 256-bit wide store was issued and two addresses need to be generated.