OSACA/validation/README.md

# Validation

View the validation analysis at https://nbviewer.jupyter.org/github/RRZE-HPC/OSACA/blob/master/validation/Analysis.ipynb

To reconstruct, download the `validataion-data.tar.gz` from a github release and extraxct into the base folder (`OSACA`). Alternativly, update the configuration in `buiild_and_run.py` and run your own measurements.

## Dev Stories
### ZEN2: only simple or all store AGU on port 10?
`('ZEN2','clang','O1','add')`
was predicted too slow (1.5 vs 1.0 cy/it) with only simple address generation on port 10
with only store address-genration, but both complex and simple, this fits perfectly
moved too slow predictions generally into the categories of "perfect" or "too fast".

### ZEN & ZEN2: LEA on ALU or AGU?
`('ZEN','clang','O2','3d-r3-11pt')`
was predicted too slow (10 vs 8.3 cy/it.) with LEAs bound to AGUs. Changing this on ZEN and ZEN2 moved too slow predictions into perfect or a little too fast regions.

# sumreduction and gs-2d-5pt: overlapping iterations
If the compile was unable to remove the dependency chain in this kernel, performance can still exceed predictions because of the way benchmarks were run. With short inner ("kernel") and long outer ("repeat") loop, kernel iterations can overlap and lead to better-than-predicted measurments (which would normally contradict the model assumptions). The measured performance converges towards the prediction

### Special knowlege on scheduling
('IVB','clang', 'O3','3d-7pt')

IACA predicts Port5 as Bottleneck and thus a throughput which fits measurment. Ports 1 and 2 could theoretically take load from Port 5, if perfectly scheduled, because all  instructions on Port 5 could also be executed on Ports 1 and 2. The scheduling decision is unexplained, but the relevant instructions are all on the critical path, this could mean that instructions reusing results of previous are preferrably scheduled on the same port.


### Frontend bottlenecks
IACA models the front end and therefore predicts better in scenarios where this is the bottleneck.

### Pre Indexed
idx = ('TX2','gcc', 'O2','gs-2d-5pt')
decent with pre_indexed support

### Undetected memory dependency
('A64FX','gcc', 'O1','gs-2d-5pt')

### ZEN2: Stragen Load throughput
build/ZEN2/clang/O3/3d-27pt.marked.s
is almost perfectly predicted by LLVM-MCA, because `vaddsd` is assigned by it to a single prot. This contradicts micro-benchmarks, where `vaddsd` has a throughput of 0.5 cy. The kernel is also heavy on loads

Checking with the ('ZEN2','clang', 'Ofast','copy') kernel, which is load/store bound with vector instructions, reveals that both LLVM-MCA and OSACA can not predict it well (33% and 27% relative error).

1cy/load ('ZEN2','icc', 'Ofast','copy') with  vmovupd (%r8,%rax,8), %ymm0; vmovupd %ymm0, (%rdx,%rax,8)
  400f83:       c4 c1 7d 10 04 c0       vmovupd (%r8,%rax,8),%ymm0
  400f89:       c5 fd 11 04 c2          vmovupd %ymm0,(%rdx,%rax,8)
  400f8e:       48 83 c0 04             add    $0x4,%rax
  400f92:       49 3b c1                cmp    %r9,%rax
  400f95:       72 ec                   jb     400f83 <kernel+0xe3>
2cy/load ('ZEN2','gcc', 'Ofast','copy') with  vmovupd (%r15,%rax), %ymm1; vmovupd %ymm1, (%rdx,%rax)
  400df0:       c4 c1 7d 10 0c 07       vmovupd (%r15,%rax,1),%ymm1
  400df6:       c5 fd 11 0c 02          vmovupd %ymm1,(%rdx,%rax,1)
  400dfb:       48 83 c0 20             add    $0x20,%rax
  400dff:       48 39 d8                cmp    %rbx,%rax
  400e02:       75 ec                   jne    400df0 <kernel+0xd0>
2cy/load ('ZEN2','clang', 'Ofast','copy') with 4x unrolled  vmovups (%rbp,%rax,8), %ymm0; vmovups %ymm0, (%rdi,%rax,8)
  4009e0:       c5 fc 10 44 c5 00       vmovups 0x0(%rbp,%rax,8),%ymm0
  4009e6:       c5 fc 10 4c c5 20       vmovups 0x20(%rbp,%rax,8),%ymm1
  4009ec:       c5 fc 10 54 c5 40       vmovups 0x40(%rbp,%rax,8),%ymm2
  4009f2:       c5 fc 10 5c c5 60       vmovups 0x60(%rbp,%rax,8),%ymm3
  4009f8:       c5 fc 11 04 c7          vmovups %ymm0,(%rdi,%rax,8)
  4009fd:       c5 fc 11 4c c7 20       vmovups %ymm1,0x20(%rdi,%rax,8)
  400a03:       c5 fc 11 54 c7 40       vmovups %ymm2,0x40(%rdi,%rax,8)
  400a09:       c5 fc 11 5c c7 60       vmovups %ymm3,0x60(%rdi,%rax,8)
  400a0f:       48 83 c0 10             add    $0x10,%rax
  400a13:       49 39 c5                cmp    %rax,%r13
  400a16:       75 c8                   jne    4009e0 <kernel+0x90>
1cy/load likwid-bench -t copy_avx -w S0:8kB:1
    bbc0:       c5 fc 28 0c c6          vmovaps (%rsi,%rax,8),%ymm1
    bbc5:       c5 fc 28 54 c6 20       vmovaps 0x20(%rsi,%rax,8),%ymm2
    bbcb:       c5 fc 28 5c c6 40       vmovaps 0x40(%rsi,%rax,8),%ymm3
    bbd1:       c5 fc 28 64 c6 60       vmovaps 0x60(%rsi,%rax,8),%ymm4
    bbd7:       c5 fc 29 0c c2          vmovaps %ymm1,(%rdx,%rax,8)
    bbdc:       c5 fc 29 54 c2 20       vmovaps %ymm2,0x20(%rdx,%rax,8)
    bbe2:       c5 fc 29 5c c2 40       vmovaps %ymm3,0x40(%rdx,%rax,8)
    bbe8:       c5 fc 29 64 c2 60       vmovaps %ymm4,0x60(%rdx,%rax,8)
    bbee:       48 83 c0 10             add    $0x10,%rax
    bbf2:       48 39 f8                cmp    %rdi,%rax
    bbf5:       7c c9                   jl     bbc0 <copy_avx+0x40>

### Update Kernel: FrontEnd and Branch TP limit
On Zen2 the update kernel is preticted 50% faster, because OSACA does not see a bottleneck due to short loop body. When unrolled, the performance increases and is well predicted.

On SKX unrolling and wider SIMD does not give better performance

"The fused branch instructions can execute at a throughput of two such branches per clock
cycle if they are not taken, or one branch per two clock cycles if taken" (21.4., Agner Fog)

### Store AGU usage on ZEN
When looking at predictions of the copy kernel, it became clear that the AGUs for store also required to µops if a 256-bit wide store was issued and two addresses need to be generated.