CPU: AMD Opteron(tm) Processor 6378
Number of cores: 32
NUMA configuration: 8x4
TODO: likwid input here
The topology information gives us a rough understanding of the expected performance. We complement this with real measurements conducted on the hardware. For this purpose, we use ''pairwise'' a micro benchmark that ping-pongs messages between any combination of cores.
The benchmark measures the send, receive and roundtrip times, i.e. the time it takes until smlt_qp_send() or smlt_qp_recv() return.
A comparison of this benchmark can be found on this page.
We now show the results of our micro benchmarks. For reference, see
bench/ab-bench
in the Smelt directory.
A comparison of this benchmark can be found on this page.
We now show the results of our micro benchmarks for multicasts. For
reference, see
bench/ab-bench-scale
in the Smelt directory.
Showing plot ab
.
Showing plot reduction
.
Showing plot barriers
.
Showing plot agreement
.
A comparison of this benchmark can be found on this page.
The following is a benchmark for collective operations in MPI, OpenMP and Smelt.
A comparison of this benchmark can be found on this page.
Execution of the EPCC benchmark with gcc's unmodified OpenMP compared to an instance using Smelt's barrier.
Showing plot csv
.
A comparison of this benchmark can be found on this page.
We now show the results of our micro benchmarks measuringperformance of multicasts given a round-robin vs. fillingthread to core allocation strategy.
A comparison of this benchmark can be found on this page.
PARSEC Streamcluster solves the online clustering problem. We execute it with various barrier implementations and report the runtime.