cycle-by-cycle measurements

2025-12-13 10:10:04 +01:00 · 2022-01-21 00:12:50 +01:00
parent 7f7a7eb53a
commit d3d4060be3
6 changed files with 655 additions and 83 deletions
--- a/README.md
+++ b/README.md
@@ -23,6 +23,8 @@ More information about *nanoBench* can be found in the paper [nanoBench: A Low-O
 ### Kernel Module
 *Note: The following is not necessary if you would just like to use the user-space version.*

+    sudo apt install python3 python3-pip
+    pip3 install plotly
    git clone https://github.com/andreas-abel/nanoBench.git
    cd nanoBench
    make kernel
@@ -41,7 +43,7 @@ For obtaining repeatable results, it can help to disable hyper-threading. This c

 The following command will benchmark the assembler code sequence "ADD RAX, RBX; ADD RBX, RAX" on a Skylake-based system.

-    sudo ./nanoBench.sh -asm "ADD RAX, RBX; add RBX, RAX" -config configs/cfg_Skylake_common.txt
+    sudo ./nanoBench.sh -asm "ADD RAX, RBX; ADD RBX, RAX" -config configs/cfg_Skylake_common.txt

 It will produce an output similar to the following.

@@ -71,7 +73,7 @@ All other registers have initially undefined values. They can, however, be initi

 ### Example 2: Load Latency

-    sudo ./nanoBench.sh -asm_init "mov RAX, R14; sub RAX, 8; mov [RAX], RAX" -asm "mov RAX, [RAX]" -config configs/cfg_Skylake_common.txt
+    sudo ./nanoBench.sh -asm_init "MOV RAX, R14; SUB RAX, 8; MOV [RAX], RAX" -asm "MOV RAX, [RAX]" -config configs/cfg_Skylake_common.txt

 The `asm-init` code is executed once in the beginning. It first sets RAX to R14-8 (thus, RAX now contains a valid memory address), and then sets the memory at address RAX to its own address. Then, the `asm` code is executed repeatedly. This code loads the value at the address in RAX into RAX. Thus, the execution time of this instruction corresponds to the L1 data cache latency.

@@ -121,8 +123,6 @@ The result that is finally reported by *nanoBench* is the difference between the

 Before the first execution of `run(...)`, the performance counters are configured according to the event specifications in the `-config` file. If this file contains more events than there are programmable performance counters available, `run(...)` is executed multiple times with different performance counter configurations.

-
-
 ## Command-line Options

 Both `nanoBench.sh` and `kernel-nanoBench.sh` support the following command-line parameters. All parameters are optional. Parameter names may be abbreviated if the abbreviation is unique (e.g., `-l` may be used instead of `-loop_count`).
@@ -173,6 +173,23 @@ The following parameter is only supported by `kernel-nanoBench.sh`.
 |----------------------|-------------|
 | `-msr_config <file>` | File with performance counter event specifications for counters that can only be read with the `RDMSR` instruction, such as uncore counters. Details are described [below](#msr-performance-counter-config-files). |

+## Cycle-by-Cycle Measurements
+
+The `cycleByCycle.py` script provides the option to perform cycle-by-cycle measurements on recent Intel CPUs. This is achieved by enabling the `Freeze_Perfmon_On_PMI` feature, by setting the value of the core cycles counter to *N* cycles below overflow, and by repeating the measurements multiple times with different values for *N*. This approach is based on [Brandon Falk's Sushi Roll technique](https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll.html).
+
+As an example, the script can be used as follows.
+
+    sudo ./cycleByCycle.py -asm "MOVQ XMM0, RAX; MOVQ RAX, XMM0" -config configs/cfg_Skylake_common.txt -unroll 10
+
+`cycleByCycle.py` supports mostly the same options as `kernel-nanoBench.sh`, with the following exceptions. The `-fixed_counters` and `-msr_config` options are not available. The `-basic_mode`, `-df`, and `-no_normalization` options are used by default. The default for the `-unroll_count` parameter is `1`, and the default aggregate function is the median.
+
+`cycleByCycle.py` supports the following additional parameters.
+
+| Option             | Description |
+|--------------------|-------------|
+| `-html <filename>` | Generates an HTML file with a graphical representation of the measurement data. The filename is optional. `[Default: graph.html]` |
+| `-csv <filename>`  | Generates a CSV file that contains the measurement data. The filename is optional. `[Default: stdout]` |
+| `-end_to_end`      | By default, `cycleByCycle.py` tries to remove the overhead that comes from the instructions that enable/disable the performance counters, and from the instructions that drain the front end before/after the code of the benchmark is executed. However, this does not always work properly. In such cases, the `-end_to_end` option can be used; with this option, the output includes all of the overhead. |

 ## Performance Counter Config Files

@@ -198,7 +215,7 @@ can be used to count the number of last-level cache lookups in C-Box 0 on a Skyl

 ## Pausing Performance Counting

-If the `-no_mem` option is used, nanoBench provides a feature to temporarily pause performance counting. This is enabled by including the *magic* byte sequences `0xF0B513B1C2813F04` (for stopping the counters), and `0xE0B513B1C2813F04` (for restarting them) in the code of the microbenchmark.
+If the `-no_mem` option is used, nanoBench provides a feature to temporarily pause performance counting (however, this feature is not available for cycle-by-cycle measurements). This is enabled by including the *magic* byte sequences `0xF0B513B1C2813F04` (for stopping the counters), and `0xE0B513B1C2813F04` (for restarting them) in the code of the microbenchmark.

 Using this feature incurs a certain timing overhead that will be included in the measurement results. It is therefore, in particular, useful for microbenchmarks that do not measure the time, but e.g., cache hits or misses, such as the microbenchmarks generated by the tools in [tools/CacheAnalyzer](tools/CacheAnalyzer).

@@ -208,6 +225,6 @@ If the debug mode is enabled, the [generated code](#generated-code) contains a b

 ## Supported Platforms

-*nanoBench* should work with all Intel processors supporting architectural performance monitoring version ≥ 2, as well as with AMD Family 17h processors.
+*nanoBench* should work with all Intel processors supporting architectural performance monitoring version ≥ 2, as well as with AMD Family 17h processors. Cycle-by-cycle measurements are only available on Intel CPUs with at least four programmable performance counters.

-The code was developed and tested using Ubuntu 18.04.
+The code was developed and tested using Ubuntu 18.04 and 20.04.
--- a/common/nanoBench.c
+++ b/common/nanoBench.c
@@ -571,16 +571,6 @@ void create_runtime_code(char* measurement_template, long local_unroll_count, lo
                *(int32_t*)(&runtime_code[rcI]) = (int32_t)local_loop_count; rcI += 4; // mov R15, local_loop_count
            }

-            if (drain_frontend) {
-                strcpy(&runtime_code[rcI], "\x0F\xAE\xE8"); rcI += 3; // lfence
-                for (int i=0; i<192; i++) {
-                    strcpy(&runtime_code[rcI], NOPS[1]); rcI += 1;
-                }
-                for (int i=0; i<64; i++) {
-                    strcpy(&runtime_code[rcI], NOPS[15]); rcI += 15;
-                }
-            }
-
            int dist = get_distance_to_code(measurement_template, templateI) + code_late_init_length;
            int n_fill = (64 - ((uintptr_t)&runtime_code[rcI+dist] % 64)) % 64;
            n_fill += alignment_offset;
@@ -589,6 +579,16 @@ void create_runtime_code(char* measurement_template, long local_unroll_count, lo
                strcpy(&runtime_code[rcI], NOPS[nop_len]); rcI += nop_len;
                n_fill -= nop_len;
            }
+
+            if (drain_frontend) {
+                strcpy(&runtime_code[rcI], "\x0F\xAE\xE8"); rcI += 3; // lfence
+                for (int i=0; i<189; i++) {
+                    strcpy(&runtime_code[rcI], NOPS[1]); rcI += 1;
+                }
+                for (int i=0; i<64; i++) {
+                    strcpy(&runtime_code[rcI], NOPS[15]); rcI += 15;
+                }
+            }
        } else if (starts_with_magic_bytes(&measurement_template[templateI], MAGIC_BYTES_PFC_START)) {
            magic_bytes_pfc_start_I = templateI;
            templateI += 8;
@@ -1567,6 +1567,52 @@ void measurement_RDMSR_template_noMem() {
    asm(".quad "STRINGIFY(MAGIC_BYTES_TEMPLATE_END));
 }

+void measurement_cycleByCycle_template_Intel() {
+    SAVE_REGS_FLAGS();
+    asm(".intel_syntax noprefix                 \n"
+        ".quad "STRINGIFY(MAGIC_BYTES_INIT) "   \n"
+        "push rax                               \n"
+        "push rcx                               \n"
+        "push rdx                               \n"
+        "mov rcx, 0x38F                         \n"
+        "mov rax, 0xF                           \n"
+        "mov rdx, 0x7                           \n"
+        "wrmsr                                  \n"
+        "pop rdx                                \n"
+        "pop rcx                                \n"
+        "pop rax                                \n"
+        "lfence                                 \n"
+        ".quad "STRINGIFY(MAGIC_BYTES_CODE) "   \n"
+        "lfence                                 \n"
+        "mov rcx, 0x38F                         \n"
+        "mov rax, 0x0                           \n"
+        "mov rdx, 0x0                           \n"
+        "wrmsr                                  \n"
+        ".att_syntax prefix                     \n");
+    RESTORE_REGS_FLAGS();
+    asm(".quad " STRINGIFY(MAGIC_BYTES_TEMPLATE_END));
+}
+
+void measurement_cycleByCycle_template_Intel_noMem() {
+    SAVE_REGS_FLAGS();
+    asm(".intel_syntax noprefix                 \n"
+        ".quad "STRINGIFY(MAGIC_BYTES_INIT) "   \n"
+        "mov rcx, 0x38F                         \n"
+        "mov rax, 0xF                           \n"
+        "mov rdx, 0x7                           \n"
+        "wrmsr                                  \n"
+        "lfence                                 \n"
+        ".quad "STRINGIFY(MAGIC_BYTES_CODE) "   \n"
+        "lfence                                 \n"
+        "mov rcx, 0x38F                         \n"
+        "mov rax, 0x0                           \n"
+        "mov rdx, 0x0                           \n"
+        "wrmsr                                  \n"
+        ".att_syntax prefix                     \n");
+    RESTORE_REGS_FLAGS();
+    asm(".quad " STRINGIFY(MAGIC_BYTES_TEMPLATE_END));
+}
+
 void one_time_init_template() {
    SAVE_REGS_FLAGS();
    asm(".quad "STRINGIFY(MAGIC_BYTES_INIT));
--- a/common/nanoBench.h
+++ b/common/nanoBench.h
@@ -80,6 +80,10 @@
 #define CORE_X86_MSR_PERF_CTR         0xC0010201
 #endif

+#define FIXED_CTR_INST_RETIRED 0
+#define FIXED_CTR_CORE_CYCLES  1
+#define FIXED_CTR_REF_CYCLES   2
+

 // How often the measurement will be repeated.
 extern long n_measurements;
@@ -311,6 +315,8 @@ void measurement_RDTSC_template(void);
 void measurement_RDTSC_template_noMem(void);
 void measurement_RDMSR_template(void);
 void measurement_RDMSR_template_noMem(void);
+void measurement_cycleByCycle_template_Intel(void);
+void measurement_cycleByCycle_template_Intel_noMem(void);
 void one_time_init_template(void);
 void initial_warm_up_template(void);

@@ -357,4 +363,4 @@ void initial_warm_up_template(void);
        "pop rbx\n"                                       \
        ".att_syntax noprefix");

-#endif
+#endif
--- a/cycleByCycle.py
+++ b/cycleByCycle.py
@@ -0,0 +1,131 @@
+#!/usr/bin/env python3
+
+import argparse
+import os
+import sys
+from kernelNanoBench import *
+from tools.CPUID.cpuid import CPUID, micro_arch
+
+def writeHtmlFile(filename, title, head, body, includeDOCTYPE=True):
+   with open(filename, 'w') as f:
+      if includeDOCTYPE:
+         f.write('<!DOCTYPE html>\n')
+      f.write('<html>\n'
+              '<head>\n'
+              '<meta charset="utf-8"/>'
+              '<title>' + title + '</title>\n'
+              + head +
+              '</head>\n'
+              '<body>\n'
+              + body +
+              '</body>\n'
+              '</html>\n')
+
+def main():
+   parser = argparse.ArgumentParser(description='Cycle-by-Cycle Measurements')
+   parser.add_argument('-html', help='HTML filename [Default: graph.html]', nargs='?', const='', metavar='filename')
+   parser.add_argument('-csv', help='CSV filename [Default: stdout]', nargs='?', const='', metavar='filename')
+   parser.add_argument('-end_to_end', action='store_true', help='Do not try to remove overhead.')
+   parser.add_argument('-asm', metavar='code', help='Assembler code string (in Intel syntax) to be benchmarked.')
+   parser.add_argument('-asm_init', metavar='code', help='Assembler code string (in Intel syntax) to be executed once in the beginning.')
+   parser.add_argument('-asm_late_init', metavar='code', help='Assembler code string (in Intel syntax) to be executed once immediately before the code to be benchmarked.')
+   parser.add_argument('-asm_one_time_init', metavar='code', help='Assembler code string (in Intel syntax) to be executed once before the first measurement.')
+   parser.add_argument('-code', metavar='filename', help='Binary file containing the code to be benchmarked.')
+   parser.add_argument('-code_init', metavar='filename', help='Binary file containing code to be executed once in the beginning.')
+   parser.add_argument('-code_late_init', metavar='filename', help='Binary file containing code to be executed once immediately before the code to be benchmarked.')
+   parser.add_argument('-code_one_time_init', metavar='filename', help='Binary file containing code to be executed once before the first measurement.')
+   parser.add_argument('-cpu', metavar='n', help='Pins the measurement thread to CPU n.')
+   parser.add_argument('-config', metavar='filename', help='File with performance counter event specifications.', required=True)
+   parser.add_argument('-unroll_count', metavar='n', help='Number of copies of the benchmark code inside the inner loop.', default=1)
+   parser.add_argument('-loop_count', metavar='n', help='Number of iterations of the inner loop.')
+   parser.add_argument('-n_measurements', metavar='n', help='Number of times the measurements are repeated.')
+   parser.add_argument('-warm_up_count', metavar='n', help='Number of runs before the first measurement gets recorded.')
+   parser.add_argument('-initial_warm_up_count', metavar='n', help='Number of runs before any measurement is performed.')
+   parser.add_argument('-alignment_offset', metavar='n', help='Alignment offset.')
+   parser.add_argument('-avg', action='store_const', const='avg', help='Selects the arithmetic mean (excluding the top and bottom 20%% of the values) as the '
+                                                                       'aggregate function.')
+   parser.add_argument('-median', action='store_const', const='med', help='Selects the median as the aggregate function.')
+   parser.add_argument('-min', action='store_const', const='min', help='Selects the minimum as the aggregate function.')
+   parser.add_argument('-max', action='store_const', const='max', help='Selects the maximum as the aggregate function.')
+   parser.add_argument('-no_mem', action='store_true', help='The code for reading the perf. ctrs. does not make memory accesses.')
+   parser.add_argument('-remove_empty_events', action='store_true', help='Removes events from the output that did not occur.')
+   parser.add_argument('-verbose', action='store_true', help='Outputs the results of all performance counter readings.')
+
+   args = parser.parse_args()
+
+   uArch = micro_arch(CPUID())
+   detP23 = (uArch in ['SNB', 'IVB', 'HSW', 'BDW', 'SKL', 'SKX', 'CLX', 'KBL', 'CFL', 'CNL'])
+
+   setNanoBenchParameters(basicMode=True, drainFrontend=True)
+   setNanoBenchParameters(config=readFile(args.config),
+                          unrollCount=args.unroll_count,
+                          loopCount=args.loop_count,
+                          nMeasurements=args.n_measurements,
+                          warmUpCount=args.warm_up_count,
+                          initialWarmUpCount=args.initial_warm_up_count,
+                          alignmentOffset=args.alignment_offset,
+                          aggregateFunction=(args.avg or args.median or args.min or args.max or 'med'),
+                          noMem=args.no_mem,
+                          verbose=args.verbose,
+                          endToEnd=args.end_to_end)
+
+   nbDict = runNanoBenchCycleByCycle(code=args.asm, codeBinFile=args.code,
+                                 init=args.asm_init, initBinFile=args.code_init,
+                                 lateInit=args.asm_late_init, lateInitBinFile=args.code_late_init,
+                                 oneTimeInit=args.asm_one_time_init, oneTimeInitBinFile=args.code_one_time_init,
+                                 cpu=args.cpu, detP23=detP23)
+
+   if nbDict is None:
+      print('Error: nanoBench did not return a valid result.', file=sys.stderr)
+      if not args.end_to_end:
+         print('Try using the -end_to_end option.', file=sys.stderr)
+      exit(1)
+
+   if (uArch in ['TGL', 'RKL']) and (not args.end_to_end):
+      # on TGL and RKL, the wrmsr instruction sometimes appears to need an extra cycle
+      print('Note: If the results look incorrect, try using the -end_to_end option.', file=sys.stderr)
+
+   if args.remove_empty_events:
+      for k in list(nbDict.keys()):
+         if max(nbDict[k]) == 0:
+            del nbDict[k]
+
+   if args.csv is not None:
+      csvString = '\n'.join(k + ',' + ','.join(map(str, v)) for k, v in nbDict.items())
+      if args.csv:
+         with open(args.csv, 'w') as f:
+            f.write(csvString + '\n')
+         os.chown(args.csv, int(os.environ['SUDO_UID']), int(os.environ['SUDO_GID']))
+      else:
+         print(csvString)
+
+   if (args.html is not None) or (args.csv is None):
+      from plotly.offline import plot
+      import plotly.graph_objects as go
+
+      fig = go.Figure()
+      fig.update_xaxes(title_text='Cycle')
+
+      for name, values in nbDict.items():
+         fig.add_trace(go.Scatter(y=values, mode='lines+markers', line_shape='linear', name=name, marker_size=5, hoverlabel = dict(namelength = -1)))
+
+      config = {'displayModeBar': True,
+                'modeBarButtonsToRemove': ['autoScale2d', 'select2d', 'lasso2d'],
+                'modeBarButtonsToAdd': ['toggleSpikelines', 'hoverclosest', 'hovercompare',
+                                        {'name': 'Toggle interpolation mode', 'icon': 'iconJS', 'click': 'interpolationJS'}]}
+      body = plot(fig, include_plotlyjs='cdn', output_type='div', config=config)
+
+      body = body.replace('"iconJS"', 'Plotly.Icons.drawline')
+      body = body.replace('"interpolationJS"', 'function (gd) {Plotly.restyle(gd, "line.shape", gd.data[0].line.shape == "hv" ? "linear" : "hv")}')
+
+      cmdLine = ' '.join(('"'+p+'"' if ((' ' in p) or (';' in p)) else p) for p in sys.argv)
+      body += '<p><code>sudo ' + cmdLine + '</code></p>'
+
+      htmlFilename = args.html or 'graph.html'
+      writeHtmlFile(htmlFilename, 'Graph', '', body, includeDOCTYPE=False) # if DOCTYPE is included, scaling doesn't work properly
+      os.chown(htmlFilename, int(os.environ['SUDO_UID']), int(os.environ['SUDO_GID']))
+      print('Output written to ' + htmlFilename)
+
+
+if __name__ == "__main__":
+   main()
--- a/kernel/nb_km.c
+++ b/kernel/nb_km.c
@@ -68,6 +68,10 @@ unsigned long kallsyms_lookup_name(const char* name) {
 // 4 Mb is the maximum that kmalloc supports on my machines
 #define KMALLOC_MAX (4*1024*1024)

+// If enabled, for cycle-by-cycle measurements, the output includes all of the measurement overhead; otherwise, only the cycles between adding the first
+// instruction of the benchmark to the IDQ, and retiring the last instruction of the benchmark are considered.
+int end_to_end = false;
+
 char* runtime_code_base = NULL;

 size_t code_offset = 0;
@@ -299,6 +303,15 @@ static ssize_t alignment_offset_store(struct kobject *kobj, struct kobj_attribut
 }
 static struct kobj_attribute alignment_offset_attribute =__ATTR(alignment_offset, 0660, alignment_offset_show, alignment_offset_store);

+static ssize_t end_to_end_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) {
+    return sprintf(buf, "%d\n", end_to_end);
+}
+static ssize_t end_to_end_store(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count) {
+    sscanf(buf, "%d", &end_to_end);
+    return count;
+}
+static struct kobj_attribute end_to_end_attribute =__ATTR(end_to_end, 0660, end_to_end_show, end_to_end_store);
+
 static ssize_t drain_frontend_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) {
    return sprintf(buf, "%d\n", drain_frontend);
 }
@@ -486,6 +499,8 @@ static ssize_t reset_show(struct kobject *kobj, struct kobj_attribute *attr, cha
    alignment_offset = ALIGNMENT_OFFSET_DEFAULT;
    drain_frontend = DRAIN_FRONTEND_DEFAULT;

+    end_to_end = false;
+
    code_init_length = 0;
    code_late_init_length = 0;
    code_one_time_init_length = 0;
@@ -731,11 +746,244 @@ static int run_nanoBench(struct seq_file *m, void *v) {
    return 0;
 }

-static int open_nanoBench(struct inode *inode, struct  file *file) {
-    return single_open_size(file, run_nanoBench, NULL, (n_pfc_configs+4*use_fixed_counters)*128);
+// Unlike with run_experiment(), create_runtime_code() needs to be called before calling run_experiment_with_freeze_on_PMI().
+// If n_used_counters is > 0, the programmable counters from 0 to n_used_counters-1 are read; otherwise, the fixed counters are read.
+// pmi_counter: 0-2: fixed counters, 3-n: programmable counters
+// pmi_counter_val: value that is written to pmi_counter before each measurement
+static void run_experiment_with_freeze_on_PMI(int64_t* results[], int n_used_counters, int pmi_counter, uint64_t pmi_counter_val) {
+    if (pmi_counter <= 2) {
+        set_bit_in_msr(MSR_IA32_FIXED_CTR_CTRL, pmi_counter*4 + 3);
+    } else {
+        set_bit_in_msr(MSR_IA32_PERFEVTSEL0 + (pmi_counter - 3), 20);
+    }
+
+    for (long ri=-warm_up_count; ri<n_measurements; ri++) {
+        disable_perf_ctrs_globally();
+        clear_perf_counters();
+        clear_overflow_status_bits();
+
+        if (pmi_counter <= 2) {
+            write_msr(MSR_IA32_FIXED_CTR0 + pmi_counter, pmi_counter_val);
+        } else {
+            write_msr(MSR_IA32_PMC0 + (pmi_counter - 3), pmi_counter_val);
+        }
+
+        ((void(*)(void))runtime_code)();
+
+        if (n_used_counters > 0) {
+            for (int c=0; c<n_used_counters; c++) {
+                results[c][max(0L, ri)] = read_pmc(c);
+            }
+        } else {
+            for (int c=0; c<3; c++) {
+                results[c][max(0L, ri)] = read_pmc(0x40000000 + c);
+            }
+        }
+    }
+
+    if (pmi_counter <= 2) {
+        clear_bit_in_msr(MSR_IA32_FIXED_CTR_CTRL, pmi_counter*4 + 3);
+    } else {
+        clear_bit_in_msr(MSR_IA32_PERFEVTSEL0 + (pmi_counter - 3), 20);
+    }
 }

-// since 5.6 the struct for fileops has changed
+static uint64_t get_max_FF_ctr_value(void) {
+    return ((uint64_t)1 << Intel_FF_ctr_width) - 1;
+}
+
+static uint64_t get_max_programmable_ctr_value(void) {
+    return ((uint64_t)1 << Intel_programmable_ctr_width) - 1;
+}
+
+static uint64_t get_end_to_end_cycles(void) {
+    run_experiment_with_freeze_on_PMI(measurement_results, 0, 0, 0);
+    uint64_t cycles = get_aggregate_value(measurement_results[FIXED_CTR_CORE_CYCLES], n_measurements, 1);
+    print_verbose("End-to-end cycles: %llu\n", cycles);
+    return cycles;
+}
+
+static uint64_t get_end_to_end_retired(void) {
+    run_experiment_with_freeze_on_PMI(measurement_results, 0, 0, 0);
+    uint64_t retired = get_aggregate_value(measurement_results[FIXED_CTR_INST_RETIRED], n_measurements, 1);
+    print_verbose("End-to-end retired instructions: %llu\n", retired);
+    return retired;
+}
+
+// Returns the cycle with which the fixed cycle counter has to be programmed such that the programmable counters are frozen immediately after retiring the last
+// instruction of the benchmark (if include_lfence is true, after retiring the lfence instruction that follows the code of the benchmark).
+static uint64_t get_cycle_last_retired(bool include_lfence) {
+    uint64_t perfevtsel2 = (uint64_t)0xC0 | (1ULL << 17) | (1ULL << 22); // Instructions retired
+    // we use counter 2 here, because the counters 0 and 1 do not freeze at the same time on some microarchitectures
+    write_msr(MSR_IA32_PERFEVTSEL0+2, perfevtsel2);
+
+    uint64_t last_applicable_instr = get_end_to_end_retired() - 258 + include_lfence;
+
+    run_experiment_with_freeze_on_PMI(measurement_results, 0, 3 + 2, get_max_programmable_ctr_value() - last_applicable_instr);
+    uint64_t time_to_last_retired = get_aggregate_value(measurement_results[1], n_measurements, 1);
+
+    // The counters freeze a few cycles after an overflow happens; additionally the programmable and fixed counters do not freeze (or do not start) at exactly
+    // the same time. In the following, we search for the value that we have to write to the fixed counter such that the programmable counters stop immediately
+    // after the last applicable instruction is retired.
+    uint64_t cycle_last_retired = 0;
+    for (int64_t cycle=time_to_last_retired; cycle>=0; cycle--) {
+        run_experiment_with_freeze_on_PMI(measurement_results, 3, FIXED_CTR_CORE_CYCLES, get_max_FF_ctr_value() - cycle);
+        if (get_aggregate_value(measurement_results[2], n_measurements, 1) < last_applicable_instr) {
+            cycle_last_retired = cycle+1;
+            break;
+        }
+    }
+    print_verbose("Last instruction of benchmark retired in cycle: %llu\n", cycle_last_retired);
+    return cycle_last_retired;
+}
+
+// Returns the cycle with which the fixed cycle counter has to be programmed such that the programmable counters are frozen in the cycle in which the first
+// instruction of the benchmark is added to the IDQ.
+static uint64_t get_cycle_first_added_to_IDQ(uint64_t cycle_last_retired_empty) {
+    uint64_t perfevtsel2 = (uint64_t)0x79 | ((uint64_t)0x04 << 8) | (1ULL << 22) | (1ULL << 17); // IDQ.MITE_UOPS
+    write_msr(MSR_IA32_PERFEVTSEL0+2, perfevtsel2);
+
+    uint64_t cycle_first_added_to_IDQ = 0;
+    uint64_t prev_uops = 0;
+    for (int64_t cycle=cycle_last_retired_empty-3; cycle>=0; cycle--) {
+        run_experiment_with_freeze_on_PMI(measurement_results, 3, FIXED_CTR_CORE_CYCLES, get_max_FF_ctr_value() - cycle);
+        uint64_t uops = get_aggregate_value(measurement_results[2], n_measurements, 1);
+
+        if ((prev_uops != 0) && (prev_uops - uops > 1)) {
+            cycle_first_added_to_IDQ = cycle + 1;
+            break;
+        }
+        prev_uops = uops;
+    }
+    print_verbose("First instruction added to IDQ in cycle: %llu\n", cycle_first_added_to_IDQ);
+    return cycle_first_added_to_IDQ;
+}
+
+// Programs the fixed cycle counter such that it overflows in the specified cycle, runs the benchmark,
+// and stores the measurements of the programmable counters in results.
+static void perform_measurements_for_cycle(uint64_t cycle, uint64_t* results) {
+    // on several microarchitectures, the counters 0 or 1 do not freeze at the same time as the other counters
+    int avoid_counters = 0;
+    if (displ_model == 0x97) { // Alder Lake
+        avoid_counters = (1 << 0);
+    } else if ((Intel_perf_mon_ver >= 3) && (Intel_perf_mon_ver <= 4) && (displ_model >= 0x3A)) {
+        avoid_counters = (1 << 1);
+    }
+
+    // the higher counters don't count some of the events properly (e.g., D1.01 on RKL)
+    int n_used_counters = 4;
+
+    size_t next_pfc_config = 0;
+    while (next_pfc_config < n_pfc_configs) {
+        size_t cur_pfc_config = next_pfc_config;
+        char* pfc_descriptions[MAX_PROGRAMMABLE_COUNTERS] = {0};
+        next_pfc_config = configure_perf_ctrs_programmable(next_pfc_config, true, true, n_used_counters, avoid_counters, pfc_descriptions);
+
+        run_experiment_with_freeze_on_PMI(measurement_results, n_used_counters, FIXED_CTR_CORE_CYCLES, get_max_FF_ctr_value() - cycle);
+
+        for (size_t c=0; c<n_used_counters; c++) {
+            if (pfc_descriptions[c]) {
+                results[cur_pfc_config] = get_aggregate_value(measurement_results[c], n_measurements, 1);
+                cur_pfc_config++;
+            }
+        }
+    }
+}
+
+static int run_nanoBench_cycle_by_cycle(struct seq_file *m, void *v) {
+    if (is_AMD_CPU) {
+        pr_err("Cycle-by-cycle measurements are not supported on AMD CPUs\n");
+        return -1;
+    }
+    if (n_programmable_counters < 4) {
+        pr_err("Cycle-by-cycle measurements require at least four programmable counters\n");
+        return -1;
+    }
+    if (!check_memory_allocations()) {
+        return -1;
+    }
+
+    kernel_fpu_begin();
+    disable_interrupts_preemption();
+
+    clear_perf_counter_configurations();
+    enable_freeze_on_PMI();
+    configure_perf_ctrs_FF_Intel(0, 1);
+
+    char* measurement_template;
+    if (no_mem) {
+        measurement_template = (char*)&measurement_cycleByCycle_template_Intel_noMem;
+    } else {
+        measurement_template = (char*)&measurement_cycleByCycle_template_Intel;
+    }
+
+    create_runtime_code(measurement_template, 0, 0); // empty benchmark
+
+    uint64_t cycle_last_retired_empty = get_cycle_last_retired(false);
+    uint64_t* results_empty = vmalloc(sizeof(uint64_t[n_pfc_configs]));
+    perform_measurements_for_cycle(cycle_last_retired_empty, results_empty);
+
+
+    uint64_t cycle_last_retired_empty_with_lfence = get_cycle_last_retired(true);
+    uint64_t* results_empty_with_lfence = vmalloc(sizeof(uint64_t[n_pfc_configs]));
+    perform_measurements_for_cycle(cycle_last_retired_empty_with_lfence, results_empty_with_lfence);
+
+    uint64_t first_cycle = 0;
+    uint64_t last_cycle = 0;
+
+    if (!end_to_end) {
+        first_cycle = get_cycle_first_added_to_IDQ(cycle_last_retired_empty);
+    }
+
+    create_runtime_code(measurement_template, unroll_count, loop_count);
+
+    if (end_to_end) {
+        last_cycle = get_end_to_end_cycles();
+    } else {
+        // Here, we take the cycle after retiring the lfence instruction because some uops of the lfence might retire in the same cycle
+        // as the last instruction of the benchmark; this way it is easier to determine the correct count for the number of retired uops.
+        last_cycle = get_cycle_last_retired(true);
+    }
+
+    uint64_t (*results)[n_pfc_configs] = vmalloc(sizeof(uint64_t[last_cycle+1][n_pfc_configs]));
+
+    for (uint64_t cycle=first_cycle; cycle<=last_cycle; cycle++) {
+        perform_measurements_for_cycle(cycle, results[cycle]);
+    }
+
+    disable_perf_ctrs_globally();
+    disable_freeze_on_PMI();
+    clear_overflow_status_bits();
+    clear_perf_counter_configurations();
+
+    restore_interrupts_preemption();
+    kernel_fpu_end();
+
+    for (size_t i=0; i<n_pfc_configs; i++) {
+        seq_printf(m, "%s", pfc_configs[i].description);
+        seq_printf(m, ",%lld", results_empty[i]);
+        seq_printf(m, ",%lld", results_empty_with_lfence[i]);
+        for (long cycle=first_cycle; cycle<=last_cycle; cycle++) {
+            seq_printf(m, ",%lld", results[cycle][i]);
+        }
+        seq_printf(m, "\n");
+    }
+
+    vfree(results_empty);
+    vfree(results_empty_with_lfence);
+    vfree(results);
+    return 0;
+}
+
+static int open_nanoBench(struct inode *inode, struct file *file) {
+    return single_open_size(file, run_nanoBench, NULL, (n_pfc_configs + n_msr_configs + 4*use_fixed_counters) * 128);
+}
+
+static int open_nanoBenchCycleByCycle(struct inode *inode, struct file *file) {
+    return single_open_size(file, run_nanoBench_cycle_by_cycle, NULL, n_pfc_configs * 4096);
+}
+
+// in kernel 5.6, the struct for fileops has changed
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(5, 6, 0)
 static const struct proc_ops proc_file_fops_nanoBench = {
    .proc_lseek = seq_lseek,
@@ -743,6 +991,12 @@ static const struct proc_ops proc_file_fops_nanoBench = {
    .proc_read = seq_read,
    .proc_release = single_release,
 };
+static const struct proc_ops proc_file_fops_nanoBenchCycleByCycle = {
+    .proc_lseek = seq_lseek,
+    .proc_open = open_nanoBenchCycleByCycle,
+    .proc_read = seq_read,
+    .proc_release = single_release,
+};
 #else
 static const struct file_operations proc_file_fops_nanoBench = {
    .llseek = seq_lseek,
@@ -751,6 +1005,13 @@ static const struct file_operations proc_file_fops_nanoBench = {
    .read = seq_read,
    .release = single_release,
 };
+static const struct file_operations proc_file_fops_nanoBenchCycleByCycle = {
+    .llseek = seq_lseek,
+    .open = open_nanoBenchCycleByCycle,
+    .owner = THIS_MODULE,
+    .read = seq_read,
+    .release = single_release,
+};
 #endif

 static struct kobject* nb_kobject;
@@ -825,6 +1086,7 @@ static int __init nb_init(void) {
    error |= sysfs_create_file(nb_kobject, &warm_up_attribute.attr);
    error |= sysfs_create_file(nb_kobject, &initial_warm_up_attribute.attr);
    error |= sysfs_create_file(nb_kobject, &alignment_offset_attribute.attr);
+    error |= sysfs_create_file(nb_kobject, &end_to_end_attribute.attr);
    error |= sysfs_create_file(nb_kobject, &drain_frontend_attribute.attr);
    error |= sysfs_create_file(nb_kobject, &agg_attribute.attr);
    error |= sysfs_create_file(nb_kobject, &basic_mode_attribute.attr);
@@ -842,7 +1104,8 @@ static int __init nb_init(void) {
    }

    struct proc_dir_entry* proc_file_entry = proc_create("nanoBench", 0, NULL, &proc_file_fops_nanoBench);
-    if(proc_file_entry == NULL) {
+    struct proc_dir_entry* proc_file_entry2 = proc_create("nanoBenchCycleByCycle", 0, NULL, &proc_file_fops_nanoBenchCycleByCycle);
+    if(proc_file_entry == NULL || proc_file_entry2 == NULL) {
        pr_err("failed to create file in /proc/\n");
        return -1;
    }
@@ -883,6 +1146,7 @@ static void __exit nb_exit(void) {

    kobject_put(nb_kobject);
    remove_proc_entry("nanoBench", NULL);
+    remove_proc_entry("nanoBenchCycleByCycle", NULL);
 }

 module_init(nb_init);
--- a/kernelNanoBench.py
+++ b/kernelNanoBench.py
@@ -1,9 +1,11 @@
 import atexit
-import collections
 import os
 import subprocess
 import sys

+from collections import OrderedDict
+from shutil import copyfile
+
 PFC_START_ASM = '.quad 0xE0B513B1C2813F04'
 PFC_STOP_ASM = '.quad 0xF0B513B1C2813F04'

@@ -24,7 +26,7 @@ def assemble(code, objFile, asmFile='/tmp/ramdisk/asm.s'):
         code = code.replace('|13', '.byte 0x66,0x66,0x66,0x66,0x2e,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
         code = code.replace('|12', '.byte 0x66,0x66,0x66,0x2e,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
         code = code.replace('|11', '.byte 0x66,0x66,0x2e,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
-         code = code.replace('|10',  'byte 0x66,0x2e,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
+         code = code.replace('|10', '.byte 0x66,0x2e,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
         code = code.replace('|9',  '.byte 0x66,0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
         code = code.replace('|8',  '.byte 0x0f,0x1f,0x84,0x00,0x00,0x00,0x00,0x00;')
         code = code.replace('|7',  '.byte 0x0f,0x1f,0x80,0x00,0x00,0x00,0x00;')
@@ -53,12 +55,17 @@ def objcopy(sourceFile, targetFile):
      exit(1)


-def filecopy(sourceFile, targetFile):
-   try:
-      subprocess.check_call(['cp', sourceFile, targetFile])
-   except subprocess.CalledProcessError as e:
-      sys.stderr.write("Error (cp): " + str(e))
-      exit(1)
+def createBinaryFile(targetFile, asm=None, objFile=None, binFile=None):
+   if asm:
+      objFile = '/tmp/ramdisk/tmp.o'
+      assemble(asm, objFile)
+   if objFile is not None:
+      objcopy(objFile, targetFile)
+      return True
+   if binFile is not None:
+      copyfile(binFile, targetFile)
+      return True
+   return False


 # Returns the size in bytes.
@@ -87,7 +94,7 @@ paramDict = dict()
 # Otherwise, reset() needs to be called first.
 def setNanoBenchParameters(config=None, configFile=None, msrConfig=None, msrConfigFile=None, fixedCounters=None, nMeasurements=None, unrollCount=None,
                           loopCount=None, warmUpCount=None, initialWarmUpCount=None, alignmentOffset=None, codeOffset=None, drainFrontend=None,
-                           aggregateFunction=None, basicMode=None, noMem=None, noNormalization=None, verbose=None):
+                           aggregateFunction=None, basicMode=None, noMem=None, noNormalization=None, verbose=None, endToEnd=None):
   if config is not None:
      if paramDict.get('config', None) != config:
         configFile = '/tmp/ramdisk/config'
@@ -174,75 +181,176 @@ def setNanoBenchParameters(config=None, configFile=None, msrConfig=None, msrConf
         writeFile('/sys/nb/verbose', str(int(verbose)))
         paramDict['verbose'] = verbose

+   if endToEnd is not None:
+      if paramDict.get('endToEnd', None) != endToEnd:
+         writeFile('/sys/nb/end_to_end', str(int(endToEnd)))
+         paramDict['endToEnd'] = endToEnd
+

 def resetNanoBench():
   with open('/sys/nb/reset') as resetFile: resetFile.read()
   paramDict.clear()


+def _getNanoBenchOutput(procFile, code, codeObjFile, codeBinFile,
+                                  init, initObjFile, initBinFile,
+                                  lateInit, lateInitObjFile, lateInitBinFile,
+                                  oneTimeInit, oneTimeInitObjFile, oneTimeInitBinFile, cpu, detP23):
+   with open('/sys/nb/clear') as clearFile: clearFile.read()
+
+   tmpCodeBinFile = '/tmp/ramdisk/code.bin'
+   if createBinaryFile(tmpCodeBinFile, code, codeObjFile, codeBinFile):
+      writeFile('/sys/nb/code', tmpCodeBinFile)
+
+   tmpInitBinFiles = []
+   if detP23:
+      tmpP23BinFile = '/tmp/ramdisk/p23.bin'
+      tmpInitBinFiles.append(tmpP23BinFile)
+      createBinaryFile(tmpP23BinFile, asm=detP23Asm)
+   tmpInitMainBinFile = '/tmp/ramdisk/init_main.bin'
+   if createBinaryFile(tmpInitMainBinFile, init, initObjFile, initBinFile):
+      tmpInitBinFiles.append(tmpInitMainBinFile)
+   if tmpInitBinFiles:
+      tmpInitBinFile = '/tmp/ramdisk/init.bin'
+      with open(tmpInitBinFile, 'wb') as initBin:
+         for filename in tmpInitBinFiles:
+            with open(filename, 'rb') as f:
+               initBin.write(f.read())
+      writeFile('/sys/nb/init', tmpInitBinFile)
+
+   tmpLateInitBinFile = '/tmp/ramdisk/late_init.bin'
+   if createBinaryFile(tmpLateInitBinFile, lateInit, lateInitObjFile, lateInitBinFile):
+      writeFile('/sys/nb/late_init', tmpLateInitBinFile)
+
+   tmpOneTimeInitBinFile = '/tmp/ramdisk/one_time_init.bin'
+   if createBinaryFile(tmpOneTimeInitBinFile, oneTimeInit, oneTimeInitObjFile, oneTimeInitBinFile):
+      writeFile('/sys/nb/one_time_init', tmpOneTimeInitBinFile)
+
+   try:
+      if cpu is None:
+         output = readFile(procFile)
+      else:
+         output = subprocess.check_output(['taskset', '-c', str(cpu), 'cat', procFile]).decode()
+   except Exception as e:
+      print('nanoBench failed; details might be available from dmesg', file=sys.stderr)
+      sys.exit()
+
+   return output
+
+
 # code, codeObjFile, codeBinFile cannot be specified at the same time (same for init, initObjFile and initBinFile)
 def runNanoBench(code='', codeObjFile=None, codeBinFile=None,
                 init='', initObjFile=None, initBinFile=None,
                 lateInit='', lateInitObjFile=None, lateInitBinFile=None,
-                 oneTimeInit='', oneTimeInitObjFile=None, oneTimeInitBinFile=None,
-                 cpu=None):
-   with open('/sys/nb/clear') as clearFile: clearFile.read()
+                 oneTimeInit='', oneTimeInitObjFile=None, oneTimeInitBinFile=None, cpu=None, detP23=False):
+   output = _getNanoBenchOutput('/proc/nanoBench', code, codeObjFile, codeBinFile,
+                                                   init, initObjFile, initBinFile,
+                                                   lateInit, lateInitObjFile, lateInitBinFile,
+                                                   oneTimeInit, oneTimeInitObjFile, oneTimeInitBinFile, cpu, detP23)

-   if code:
-      codeObjFile = '/tmp/ramdisk/code.o'
-      assemble(code, codeObjFile)
-   if codeObjFile is not None:
-      objcopy(codeObjFile, '/tmp/ramdisk/code.bin')
-      writeFile('/sys/nb/code', '/tmp/ramdisk/code.bin')
-   elif codeBinFile is not None:
-      writeFile('/sys/nb/code', codeBinFile)
-
-   if init:
-      initObjFile = '/tmp/ramdisk/init.o'
-      assemble(init, initObjFile)
-   if initObjFile is not None:
-      objcopy(initObjFile, '/tmp/ramdisk/init.bin')
-      writeFile('/sys/nb/init', '/tmp/ramdisk/init.bin')
-   elif initBinFile is not None:
-      writeFile('/sys/nb/init', initBinFile)
-
-   if lateInit:
-      lateInitObjFile = '/tmp/ramdisk/late_init.o'
-      assemble(lateInit, lateInitObjFile)
-   if lateInitObjFile is not None:
-      objcopy(lateInitObjFile, '/tmp/ramdisk/late_init.bin')
-      writeFile('/sys/nb/late_init', '/tmp/ramdisk/late_init.bin')
-   elif lateInitBinFile is not None:
-      writeFile('/sys/nb/late_init', lateInitBinFile)
-
-   if oneTimeInit:
-      oneTimeInitObjFile = '/tmp/ramdisk/one_time_init.o'
-      assemble(oneTimeInit, oneTimeInitObjFile)
-   if oneTimeInitObjFile is not None:
-      objcopy(oneTimeInitObjFile, '/tmp/ramdisk/one_time_init.bin')
-      writeFile('/sys/nb/one_time_init', '/tmp/ramdisk/one_time_init.bin')
-   elif oneTimeInitBinFile is not None:
-      writeFile('/sys/nb/one_time_init', oneTimeInitBinFile)
-
-   if cpu is None:
-      output = readFile('/proc/nanoBench')
-   else:
-      try:
-         output = subprocess.check_output(['taskset', '-c', str(cpu), 'cat', '/proc/nanoBench']).decode()
-      except subprocess.CalledProcessError as e:
-         sys.exit(e)
-
-   ret = collections.OrderedDict()
+   ret = OrderedDict()
   for line in output.split('\n'):
      if not ':' in line: continue
-      line_split = line.split(':')
-      counter = line_split[0].strip()
-      value = float(line_split[1].strip())
+      lineSplit = line.split(':')
+      counter = lineSplit[0].strip()
+      value = float(lineSplit[1].strip())
      ret[counter] = value
-
   return ret


+# code, codeObjFile, codeBinFile cannot be specified at the same time (same for init, initObjFile and initBinFile)
+def runNanoBenchCycleByCycle(code='', codeObjFile=None, codeBinFile=None,
+                             init='', initObjFile=None, initBinFile=None,
+                             lateInit='', lateInitObjFile=None, lateInitBinFile=None,
+                             oneTimeInit='', oneTimeInitObjFile=None, oneTimeInitBinFile=None, cpu=None, detP23=False):
+   prevConfig = paramDict.get('config', '')
+   if not paramDict.get('endToEnd'):
+      curConfig = prevConfig +  '\n'
+      curConfig += '79.30 IDQ.MS_UOPS_internal\n'
+      curConfig += 'C0.00 INST_RETIRED_internal\n'
+      setNanoBenchParameters(config=curConfig)
+
+   output = _getNanoBenchOutput('/proc/nanoBenchCycleByCycle', code, codeObjFile, codeBinFile,
+                                                               init, initObjFile, initBinFile,
+                                                               lateInit, lateInitObjFile, lateInitBinFile,
+                                                               oneTimeInit, oneTimeInitObjFile, oneTimeInitBinFile, cpu, detP23)
+
+   if not paramDict.get('endToEnd'):
+      setNanoBenchParameters(config=prevConfig)
+
+   nbDict = OrderedDict()
+   for line in output.split('\n'):
+      if not ',' in line: continue
+      lineSplit = line.split(',')
+      counter = lineSplit[0].strip()
+      valueEmpty = int(lineSplit[1])
+      valueEmptyWithLfence = int(lineSplit[2])
+      values = [int(v) for v in lineSplit[3:]]
+      nbDict[counter] = (valueEmpty, valueEmptyWithLfence, values)
+
+   if paramDict.get('verbose'):
+      print('\n'.join((k + ': ' + str(v)) for k, v in nbDict.items()))
+
+   if paramDict.get('endToEnd'):
+      return OrderedDict((k, v) for k, (_, _, v) in nbDict.items() if "_internal" not in k)
+   else:
+      instRetired = nbDict['INST_RETIRED_internal'][2]
+      if len(instRetired) < 3:
+         return None
+      if (instRetired[-1] == instRetired[-2]) or (instRetired[-2] != instRetired[-3]):
+         return None
+      cycleLastInstrRetired = min(i for i, v in enumerate(instRetired) if v == instRetired[-2])
+
+      msUops = nbDict['IDQ.MS_UOPS_internal'][2]
+      cycleOfLfenceUop = max((i for i, v in enumerate(msUops) if v < msUops[-1] and msUops[i] == msUops[i+1]), default=None)
+      if cycleOfLfenceUop is None:
+         return None
+
+      result = OrderedDict()
+      for k, (valueEmpty, valueEmptyWithLfence, values) in nbDict.items():
+         if "_internal" in k: continue
+
+         leftMin = values[0]
+         rightMax = values[-1]
+
+         if any((x in k.upper()) for x in ['RETIRE']):
+            leftMin = valueEmpty
+            if 'UOP' in k.upper():
+               rightMax = values[-1] - (valueEmptyWithLfence - valueEmpty)
+            else:
+               rightMax = values[cycleLastInstrRetired]
+         elif any((x in k.upper()) for x in ['ISSUE']):
+            rightMax = values[cycleLastInstrRetired-1] - (valueEmpty - values[0])
+         elif 'IDQ' in k:
+            rightMax = values[cycleOfLfenceUop - 1]
+
+         result[k] = [max(0, min(v, rightMax) - leftMin) for v in values[:cycleLastInstrRetired + 1]]
+
+      return result
+
+
+detP23Asm = ("push rax; push rcx; push rdx;" # save registers
+             "mov ecx, 0x186; rdmsr; push rax; push rdx;" # save IA32_PERFEVTSEL0
+             "mov ecx, 0x0C1; rdmsr; push rax; push rdx;" # save IA32_PMC0
+             "mov ecx, 0x38F; rdmsr; push rax; push rdx;" # save IA32_PERF_GLOBAL_CTRL
+             "mov ecx, 0x38F; mov eax, 0; mov edx, 0; wrmsr;" # disable all counters
+             "mov ecx, 0x186; mov eax, 0x4204A1; mov edx, 0; wrmsr;" # count UOPS_DISPATCHED_PORT.PORT_2 on counter 0
+             "mov ecx, 0x0C1; mov eax, 0; mov edx, 0; wrmsr;" # clear counter 0
+             "mov ecx, 0x38F; mov eax, 1; mov edx, 0; wrmsr;" # enable counter 0
+             "mov eax, [rsp];" # perform one memory access
+             "mov ecx, 0x38F; mov eax, 0; mov edx, 0; wrmsr;" # disable counter 0
+             "mov ecx, 0; rdpmc;" # read counter 0
+             "test eax, eax;"
+             "lfence;"
+             "jnz end;"
+             "mov eax, [rsp];" # perform another access if first access was not on port 2
+             "end:"
+             "mov ecx, 0x38F; pop rdx; pop rax; wrmsr;" # restore IA32_PERF_GLOBAL_CTRL
+             "mov ecx, 0x0C1; pop rdx; pop rax; wrmsr;" # restore IA32_PMC0
+             "mov ecx, 0x186; pop rdx; pop rax; wrmsr;" # restore IA32_PERFEVTSEL0
+             "pop rdx; pop rcx; pop rax;") # restore registers
+
+
 def createRamdisk():
   try:
      subprocess.check_output('mkdir -p /tmp/ramdisk; mount -t tmpfs -o size=100M none /tmp/ramdisk/', shell=True)