The OpenMP user conference 2019 was held in the University of Edinburgh by EPCC that hosted 10 speakers in person and one remotely over two days for around 30 attendees.
Day 1- Tuesday 4th June
I joined the Programming Your GPU with OpenMP: A Hands-On Introduction presented by the Professor of High Performance Computing at the University of Bristol, Simon McIntosh-Smith. He started with an overview of OpenMP and the difference of the terms used in the GPU programming model, the host, the device, and the clauses used in the target device. After that, we started the active learning by interacting with the ARM supercomputer called Isambard. Initially, we run a basic serial program to add vectors, where the clause #pragma omp target was set in the processing loop.Usually, the serial version took more time than the parallel job. We need to check in deep in case we have a different result.
It was also explained how to read nvprof, a profiler CUDA toolkit. The command used was nvprof – -print-gpu-trace ./vadd. The first two lines show the cost of offloading in percentage. The calls from the host to the device are mostly greater than the calls from the device to the host after doing the reduction calculations in the device.
Later, the levels of parallelism were explained where the team of threads to be distributed follows the flow: target → device → compute unit → processing element. The term team of threads to be distributed is defined to be used inside the device. The jacobi_solver exercise was so useful to understand how the team of threads works. In this example, a pointer to a fixed array of floats that needs to be explicitly defined in the code as follows:
After adding the omp directives, we have that the parallelised version goes twice faster:
To control the memory movement, the target enter data and the target exit data allows data construction to create a data region. These clauses are set in the area of swapping data because the exchange is expensive. The target enter data allocates and copies data to the device, and target exit data directive copies back or destroys the data.
After these configurations, a very good optimization of the data movement is noticeable reducing the execution time 10 times from the serial version of the jacobi_solver.
2. Advanced OpenMP: Performance and 5.0 Features
An interesting talk about the performance of OpenMP was given bythe Senior Principal Engineer of Intel Corporation in the U.K., Jim Cownie. He shared OpenMP programming knowledge and parallel concepts related, as well as the best practices used.
Day 2- Wednesday 5th June
Nine talks were presented, including a remote presentation of the Director of Supercomputing Center of Excellence at Cray Inc., John M. Levesque who explained how to read the report provided by the CrayPat tool. I highlight these two efforts of the industry in pictures, Dr. Glover from MetOffice trying to accelerate UM and NEMO, and the representative of ARM, Oliver who optimized the performance of OpenMP intranode.
In the academy, the FFLUX optimization by Benjamin Symons from the University of Manchester inspired me to do a comparison with the application I am studying. The OpenMP Parallelisation of Quantum Computing Simulators by Youssef Moawad was also impressive to me, he did an excellent job in his presentation (including the rainbow:).
Experts in OpenMP offered themselves to be asked about OpenMP. They are working towards OpenMP5 and the compatibilities with upcoming versions as GCC 10 and doing a common collaboration to adapt this interface to different HPC architectures.Food of Event
The OpenMP community in the U.K.
Thanks to Dr. Bull for the invitation that let us enhance our skills, and to Professor McIntosh-Smith. It was exciting to meet in person the author of a paper I’ve referenced.
Good event in general! It would have been nice to have a couple of workshops and 45 minutes per talk. The projects were very interesting, but some talks lasted 15 minutes.