Comprehend that benchmarking is also a basis for dealing with questions emerging from tuning, e.g.:
What is the appropriate task size (big vs. small) that may have a positive performance impact on my program?
Is the use of hyper-threading technology advantageous?
What is the best mapping of processes to nodes, pinning of processes/threads to CPUs or cores, and setting memory affinities to NUMA nodes in order to speed up a parallel program?
What is the best compiler selection for my program (GCC, Intel, PGI, …), in combination with the most suitable MPI environment (Open MPI, Intel MPI, …)?
What is the best compiler generation/version for my program?
What are the best compiler options regarding, for example, the optimization level -O2, -O3, . . . , for building the executable program?
Is the use of PGO (Profile Guided Optimization) or other high-level optimization, e.g. using IPA/IPO (Inter-Procedural Analyzer/Inter-Procedural Optimizer), helpful?
What is the performance behavior after a (parallel) algorithm has been improved, i.e. to what extent are speedup, efficiency, and scalability improved?