An OpenMP Runtime API for Profiling

Marty Itzkowitz, Oleg Mazurov, Nawal Copty, and Yuan Lin

Sun Microsystems, Inc.
16 Network Circle
Menlo Park, CA 94025

ABSTRACT

This paper describes an OpenMP Runtime API for use in profiling OpenMP applications. The paper is being submitted to the OpenMP Architecture Review Board for adoption as an official ARB White Paper. The API is designed to provide data that enables performance tools to present measurements in the user-model of OpenMP programming. The paper gives the motivation and design of the API, an example of its use, and a description as to how the data provided by the API can be used to represent performance of OpenMP programs.

The API was primarily designed to support statistical sampling of callstacks, but it also contains extensions to support tracing. It addresses two difficult problems in OpenMP performance measurement: the presentation of performance data in the user model, and the understanding of the behavior of the OpenMP runtime. The API is implemented entirely within the OpenMP runtime, and requires no compiler support. It is described generally here, and its detailed technical description is provided as an accompanying include file.

  1. Introduction
  2. OpenMP has become a very successful user-model for developing parallel applications, widely adopted and supported. It does not, however, provide a standardized interface that can be used for profiling. This paper describes such an interface, designed and implemented at Sun Microsystems, and the paper is being proposed for adoption by the OpenMP Architecture Review Board as an official ARB White Paper.

    OpenMP programs present a challenge for performance measurement tools because the OpenMP specification is in terms of a user-model of the behavior of a conforming application and implementation, but the specification does not specify any implementation details. Performance measurement tools typically record data in the implementation-model, and have to map that behavior back to the user-model.

    There are two main technologies used for performance measurement on OpenMP programs: tracing and statistical profiling. Each has its strengths and weaknesses.

    Tracing technologies have the advantage that they can capture all the relevant events describing the runtime behavior of the application, but the overhead of tracing can be substantial. Some operations to be traced, a loop iteration, for example, can take less time than recording the traced event. Furthermore, compiling the program to make some of the desired events traceable can impede optimization: for example, if an event is to be generated for a loop iteration, unrolling, and optimizing the resultant instruction stream by interleaving instructions from different iterations cannot be performed. Traces scale poorly with increasing runtime, with no obvious mechanism of throttling the data volume, and without capturing all function call and return points, the dynamic program graph can not be generated.

    Many implementations of tracing methodologies have been developed, along with a rich set of data-reduction and visualization tools. See, for example, Paraver, KOJAK, CATCH, TAU, and VAMPIR, among others. One tracing methodology, POMP, was proposed as a standard for OpenMP performance measurement, but it was not adopted.

    On the other hand, statistical profiles have the advantage that they can scale very nicely to long-running programs, simply by throttling the profiling rate. By capturing complete callstacks with each event, the dynamic callgraph behavior becomes obvious. As will be discussed below, capturing callstacks for user-model OpenMP programs is a challenge. Statistical profiling has one major disadvantage: it is statistical in nature, and may miss some rare events; furthermore, if the application behavior is correlated with the profiling clock, the measurements may be misleading.

    The API proposed here was primarily intended to support statistical profiling, but, with the input from various people on the OpenMP Tools Email alias, it has been extended to optionally support tracing of many events. It mandates functionality in the OpenMP runtime, but requires no compiler support.

    The next section of this paper, section 2, describes a typical implementation of the OpenMP compiler and runtime, and describes the difficulties of mapping the implementation behavior to the user-model. Section 3 describes the techniques used for performance measurement of OpenMP programs; section 4 describes the proposed API defining a very narrow interface between the performance measurement collection and the OpenMP runtime, and section 5 describes how the data is captured using that API. Section 6 describes using the data to construct a user-model profile for OpenMP applications. The final section, 7, presents our conclusions.

  3. How OpenMP Applications Run
  4. The user-level execution behavior of OpenMP applications is described in the OpenMP specifications (See, for example, OpenMP Application Program Interface, Version 2.5, section 1.3.) The specification does not constrain many implementation details, but the behavior of the implementation may be very important for the user to understand the performance and scalability of the application.

    While performance measurement in general, and statistical profiling in particular, collects data in the implementation-model, that is, as the instructions are executed in the machine, the performance tools would like to present that information to the user in the context of the user-model. This section will describe the differences between the models.

    The discussion is based on the implementation details specific to the SunTM Studio Compilers and Tools, but the general architecture is similar to many other implementations, for example, the Intel, IBM, and SGI implementations. In the discussion below, generic terms are used for the system-startup routine (<start>), the thread-startup routine (<thread_start>), and the OpenMP runtime library (<omp-rt>).

    As any single-threaded program runs, its callstack shows its current location, and a trace of how it got there, starting from the beginning instructions in a system-specific startup routine, which calls main, which then proceeds and calls various routines within the program. When a routine contains a loop, the program executes the code inside the loop repeatedly, until the loop exit criterion is reached. It then proceeds on to the next sequence of code, etc.

    When the program is parallelized with OpenMP (or by autoparallelization), the behavior is different. The user-model of that behavior has the main, or master, thread executing just as a single-threaded program. When it reaches a parallel loop or parallel region, additional slave threads appear, each appearing to be a clone of the master thread, with all of them executing the contents of the loop or parallel region, in parallel, each for different chunks of work. When all chunks of work are completed, all the threads are synchronized, the slave threads disappear, and the master thread proceeds.

    The implementation behavior is not so straightforward. The compiler typically extracts the code to be executed inside a parallel region or loop (or any other OpenMP construct), and makes the extracted code into an independent function, called (in Sun's implementation) an mfunction. (It may also be referred to as an outlined function, or a loop-body-function.) The body of the parallel region is replaced by a call to the OpenMP runtime, passing in the address of the mfunction. The runtime then arranges for the mfunction to be called in the master and the slave threads. This difference is exposed by the tools, which will show the mfunction called from the introduced call site in the routine from which it is abstracted. In the discussion below, names of mfunctions extracted from an OpenMP construct from function foo are of the form foo-OMPa, foo-OMPb, etc..

    When a parallel region is entered (a fork event), the user-model implies that new threads magically appear, and on exit from the region (a join event), they disappear. While a runtime could very well explicitly use the UNIX fork() system call to do that work, overhead would be prohibitive. Most implementations rely on the OpenMP runtime to create and to manage the additional threads, and when those threads run, their callstacks, as seen from a profiler do not match the user-model.

    Consider a simple OpenMP program, with a main program, calling function foo, containing a single OpenMP parallel construct, executing with four threads. Before the first entry into the parallel region, the program has only one thread, the master thread, and that thread has the same callstack in both the user-model, and the implementation-model:

    Master
    foo
    main
    <start>

    After the entry into the parallel region, there are four threads, and in the user-model, their callstacks would look like:

    Master Slave 1 Slave 2 Slave 3
    foo-OMPa foo-OMPafoo-OMPafoo-OMPa
    foo foofoofoo
    main mainmainmain
    <start> <start><start><start>

    All the leaf functions are shown as the function foo-OMPa, but the actual leaf PC would typically be different in each thread. In the implementation-model, their callstacks would look like:

    Master Slave 1 Slave 2 Slave 3
    foo-OMPa    
    <omp-rt>    
    foo foo-OMPafoo-OMPafoo-OMPa
    main <omp-rt><omp-rt><omp-rt>
    <start> <thread_start><thread_start><thread_start>

    In this representation, the function <omp-rt> represents one or more frames of execution within the OpenMP runtime library. Note that in the slave threads, there is no connection between main and, for that matter, foo, and the actual executing function, foo-OMPa.

    The user-model callstacks, and the actual implementation callstacks around fork and join events, and scheduling and synchronization events may be even more confusing. Frames representing user code may be entirely absent, or may show additional calls into the OpenMP runtime.

    The OpenMP runtime is responsible for many operations performed by the process, acquiring locks, executing forks and joins, scheduling iterations, synchronizing at barriers, performing reductions, and so forth. The specification does not constrain any specific implementation for these functions, neither as an algorithm, nor as a program structure. However, their performance can be important to understanding the behavior and scalability of the OpenMP application, so some visibility into their behavior is needed.

    The next section of this paper describes the methodology of performance tools for OpenMP programs, the challenges to building such tools, and the data necessary to meet those challenges.

  5. Understanding OpenMP Performance
  6. There are two challenges to building a performance tool to help users understand OpenMP performance. The first is to reconstruct the user-model callstacks from the measured implementation-model callstacks. The second is to understand the behavior within the OpenMP runtime when it is performing operations, such as synchronizations or reductions. This section will propose solutions to those two problems, and describe the data necessary to implement the solutions.

    When an OpenMP program executes, it will generate many entries into parallel regions; each entry into a parallel region is a separate instance of that region. Knowing which instance of which parallel region a thread is executing, and knowing which thread generated the fork event marking the entry into the region, is sufficient to reconstruct user call stacks.

    Specifically, the OpenMP runtime must support that information by assigning a unique ID to each parallel region instance, and delivering Dan event, as requested, to the performance tool telling it the new ID and which thread created the instance. The tool can record the callstack of the thread generating the fork event at the time of the fork. In addition, the OpenMP runtime must support a query that provides the current parallel region ID of the calling thread, and the performance tool records that information in every profile packet. For fork events, that ID reports nested parallelism. User-model callstacks can be synthesized from that data, as described in section 6, below.

    To solve the second challenge, the OpenMP runtime must maintain a state for each thread and must support a query that returns the current state of the calling thread.

    The functionality described above is sufficient for a statistical-sampling based profiler, but trace-based profilers need more events. The mechanism used above for the fork event can simply be used to provide other events. In collaboration with the members of the OpenMP Tools Mailing List, events were defined for join, and for entry and exit to many of the OpenMP constructs. Those events may come at significant cost in overhead and data volume, and their implementation is optional. Trace-based support will not be further described here.

  7. The Collector/OpenMP Runtime API
  8. The API presented here is intended to allow a performance measurement tool to garner enough information to meet those challenges. The interface is intended for the OpenMP runtime to know little about the collector, and for the collector to know little about the OpenMP runtime, giving each the freedom to evolve independently. The description here is somewhat schematic; the precise implementation specification for the API is given in the accompanying include file.

    The first problem to resolve is that of rendezvous, that is, how the OpenMP runtime and the collector can find each other. Since an OpenMP program usually runs without a collector, and a collector can run on non-OpenMP programs, each must work in the absence of the other. Rendezvous is achieved actively by the collector, and passively by the OpenMP runtime. The OpenMP runtime exports a symbol, __omp_collector_api, and the collector asks the dynamic linker if that symbol is present. If so, it determines its address, and makes an OMP_REQ_START request via that API. The OpenMP runtime does not need to do anything until and unless that request is made. It may choose to maintain states and region IDs, but it need not. All requests from the collector to the OpenMP runtime are made as subsequent requests to that routine.

    The API is prototyped as:

    int __omp_collector_api( void *arg );
    The argument is a pointer to a byte array containing one or more requests. Each request has a 4-word header consisting of 32-bit words containing size, request number, error code, and return size. Each request header is followed by a character array whose size is given by the size value in the header, less the size of the header itself (16-bytes). Size must be a multiple of 4. The character array memory is used to pass parameters to the OpenMP runtime, and to hold return values from the runtime. All memory allocation is the responsibility of the caller (i.e., the collector).

    One of the requests to that routine is an OMP_REQ_REGISTER request, that specifies an event of interest, and a callback for when that event occurs. If no register request is made, the OpenMP runtime will never generate events to the collector.

    Other requests are available for OMP_REQ_UNREGISTER, OMP_REQ_STATE, OMP_REQ_CURRENT_PRID, OMP_REQ_PARENT_PRID, OMP_REQ_STOP, OMP_REQ_PAUSE and OMP_REQ_RESUME. Some requests are often made from signal handlers, and those requests must be ensured to be safe from a signal context. The include file details which ones are safe to be called from an asynchronous signal handler. Since not all requests are required, one of the possible error codes returned is that the request is not implemented.

    In the next section we will show how these requests, the data provided through them, and the generated events can be used to collect data to support constructing a complete user-model of an OpenMP application's behavior.

  9. Using the API to Collect the Data
  10. The above section described the API between the collector and the OpenMP runtime. This section describes how Sun's collector is implemented to use that API to record data. This represents only one use of the API; many others are possible. We will not describe here how the profiling triggers and events are handled, nor how the signal-handlers are implented, nor how the data is externalized; such issues belong to the tool developer, and not the OpenMP API it is using.

    The collector in Sun's performance tools, is a shared object that is LD_PRELOAD'ed into the target's address space. It has a init section, and that section asks the runtime linker if the OpenMP API symbol is present. If it is, the API is called with an OMP_REQ_START request, and an OMP_REQ_REGISTER request for fork events and join events. (In this usage, only the fork and join events are used.)

    The remainder of initialization of the collector sets up profiling triggers, either clock- or hardware-counter-based, and then allows the application to proceed, just as it would for any program, OpenMP or not.

    When a profile trigger fires, a signal is synchronously delivered to each appropriate thread. (Clock-profile signals are delivered at approximately the same time to each active thread, while HWC-profile signals can be completely uncorrelated across threads.) From the signal handler, if the application is an OpenMP application, the collector running on each thread obtains that thread's current state and parallel region ID by issuing OMP_REQ_STATE and OMP_REQ_CURRENT_PRID requests.

    When the OpenMP runtime delivers a fork event, the collector issues OMP_REQ_CURRENT_PRID and OMP_REQ_PARENT_PRID requests of the runtime, and defines the newly-created parallel region ID. It also keeps a "used" bit associated with that region, initialized to zero (not used). When a thread requests the state and parallel region ID in response to a profile tick, the "used" bit is set to 1.

    At join time, the collector looks at the "used" bit. If it is zero, nothing need be done. If it is one, the collector records a fork event with the current callstack of the thread (master thread), and the newly-joined parallel region ID. The current parallel region ID of the master is also recorded to support measurements on nested parallelism. Note that the callstack at join is identical to the callstack at fork. Recording the data at join time avoids recording a large number of fork events that are never needed, since they correspond to parallel regions that are so short they did not get any profile ticks.

    This API could equally well support an alternate tracing-based collector, although this would require additional implementation in the OpenMP runtime. If supported, it could request events for all possible events in the life of the process and record those events with their thread ID and timestamps. They could also record parallel region IDs and any other data appropriate to those events and the particular objectives of the tracing tool.

  11. Using the Data to Construct a User-Model Callstack Profile
  12. This section describes the data recorded and how it is processed to produce performance measurements presented in the user-model of OpenMP programs.

    Using the API as described in the previous section, two streams of data are produced, with specific information about the OpenMP behavior of the application. One stream, generated in response to the delivered fork events, consists of a time-stamped trace of fork events, each with the current implementation-model callstack of the thread generating the event, the parallel-region ID of that thread before the fork, and the ID of the newly created region. The second stream is a trace of timestamped profile packets. Each packet has normal performance data, augmented with the current parallel-region ID within which the thread is executing and the OpenMP runtime state for that thread. The tools derives OpenMP metrics from these streams, and synthesizes the user-model callstack from the implementation model data in the streams.

    In the performance tools each profile packet contains data specific to the profile being collected, which is used to construct metrics of performance. Those metrics are attributed to user-code by mapping each PC from the callstacks for the packet to functions, source-lines, callers and callees. For clock-profiling, the metrics include Total-time, User-CPU-Time, System-CPU-Time, and a number of others. When OpenMP performance data is recorded, additional metrics are computed, and the attribution of data is done to a synthesized user-model callstack, not to the implementation-model callstack as recorded.

    1. OpenMP Metrics
    2. When recorded with clock-based profiles, the OpenMP state information is used to construct two additional metrics of performance, "OpenMP Work" and "OpenMP Wait". Together they add up to the total elapsed time (integrated over threads).

      "OpenMP Work" represents time executing whether serial or parallel, as well as OpenMP library time spent doing reductions. It is computed by summing across the work states reported by the OpenMP runtime: THR_NO_STATE, THR_WORK_STATE, THR_SERL_STATE, THR_RDUC_STATE, THR_MSTR_STATE. THR_SNGL_STATE, and THR_ORDD_STATE.

      "OpenMP Wait" represents the total time waiting, and includes lock and barrier waits, ordered-waits, and OpenMP library overhead, such as scheduling, thread-management, and so forth. It accumulates either when the thread is using CPU time to wait (busy wait), or when the thread is sleeping (sleep-wait). It is computed by summing across the non-work states reported by the OpenMP runtime: THR_WAIT_STATE, THR_IBAR_STATE, THR_EBAR_STATE, THR_LKWT_STATE, THR_CTWT_STATE, THR_ODWT_STATE, THR_OVHD_STATE, and THR_ATWT_STATE.

      On Solaris, where microstate accounting can generate profile ticks in accounting states other than User-CPU, the "OpenMP Work" metric should be incremented when the OpenMP state is one of the work states, and the microstate accounting state is User-CPU. When the OpenMP state is one of the wait states, or the accounting state is not User-CPU, the "OpenMP Wait" metric is incremented.

    3. User-Model Callstack Synthesis
    4. When profile data is processed, the callstack as would be shown in the user-model is synthesized for each event, and the performance data is processed as if the synthesized callstack was the actual one captured.

      Before further processing the performance data, the record of fork events is processed, and a table giving the parallel-region ID and master-callstack-preface is constructed. For simplicity, we'll first consider the case where there is no nested parallelism.

      When a fork event is generated, its recorded callstack would be:

      Master
      <omp-rt>
      foo
      main
      <start>

      where <omp-rt> refers to one or more frames from within the OpenMP runtime. Those frames are recognized because they come from the OpenMP runtime shared object whose name is known. They are deleted, and the resultant master-callstack-preface would be:

      foo
      main
      <start>

      Profile packet implementation-model callstacks are transformed into user-model callstacks as follows. First, if the OpenMP state reported in a profile event shows the thread as waiting for work, a single-frame callstack showing only <OMP-idle> is used to replace the entire callstack for the profile event. Otherwise, any frames at the top (leaf) of the stack coming from the OpenMP runtime can be removed, as they should not appear in the user-model. Some of those frames represent thread management activities from the library, while others represent user API calls. But, in either case, they can be removed, and replaced with pseudo-functions, based on the OpenMP state, representing a user-model of what the OpenMP runtime is doing on behalf of the process. If the state at the time the packet was recorded was OpenMP work, no pseudo-function is appended. Otherwise, the pseudo-function is determined from the OpenMP state of the thread, and is one of: <OMP-overhead>, <OMP-implicit_barrier>, <OMP-explicit_barrier>, <OMP-lock_wait>, <OMP-critical_section_wait>, <OMP-ordered_section_wait>, or <OMP-atomic_section_wait>.

      If the packet is not within a parallel region, no further processing is needed; if it is, the next step is to determine where that parallel region is calling back into user code. Profile packets from the master or the slave look a bit different:

      Master Slave
      foo-OMPa  
      <omp-rt>  
      foo foo-OMPa
      main <omp-rt>
      <start> <thread_start>

      In either case, the call from the parallel region is determined as the first frame before the OpenMP runtime in the stack; all frames below that are removed:

      Master Slave
      foo-OMPa foo-OMPa

      and then from the parallel region ID, the master-callstack-preface is determined and used as the base of the stack, to yield the same resultant callstack for master and slave:

      Master Slave
      foo-OMPa foo-OMPa
      foo foo
      main main
      <start> <start>

      The above description explicitly made the assumption that the application only had a single level of parallelism. Nested parallelism is accommodated straightforwardly. It is recognized when a fork event shows that it came from a thread with a non-zero parent-parallel-region-ID. In that case, the master-callstack-preface is synthesized from its recorded callstack, and the master-callstack-preface of its parent, much as user-model profile callstacks are synthesized above.

      The Sun Studio Performance Tools implemented with a prototype of this API in the OpenMP runtime were used in the paper by Gabriele Jost, Oleg Mazurov and Dieter an Mey, Adding New Dimensions to Performance Analysis through User-Defined Objects, IWOMP 2006, Reims, France, June 12-15, 2006.

  13. Conclusions
  14. We have described the challenges in measuring performance of OpenMP programs and described some existing tools used for profiling. We then detailed the differences between the user-model of OpenMP programs, and the implementation-model in which they execute. We explained the data needed to map from the implementation-model to the user-model, and described an API that supports obtaining that data, both for statistical profiling, and for tracing. We then described Sun's performance tools implementation using that API, what data is recorded, and how that data is processed. We referenced the recent IWOMP paper describing the use of the methodology and tools.

    In summary, we have described an efficient, lightweight API to be exported from the OpenMP runtime providing the support to allow performance tools to present measurements in the user-model of OpenMP programming, independent of the implementation details of the runtime support.