Performance Analysis for Complex Scientific
Workflows

Aliza Lisan

Department of Computer Science
University of Oregon
Eugene OR, USA

alisan@uoregon.edu

Abstract—Although bulk-synchronous high-performance com-
puting (HPC) applications have long been the dominant approach
within scientific computing on supercomputers, complex scientific
workflows are increasingly popular. Complex scientific workflows
are workflows that span multiple tasks, and often include
heterogeneous architectures and/or the integration of AI and
machine learning. The increasing prevalence of complex scientific
workflows requires fresh investigation of performance analysis
and visualization approaches that were originally designed for
single, tightly coupled applications. This survey examines the
current ecosystem of HPC performance and visualization tools,
highlighting design trade-offs between accuracy, overhead, scala-
bility, and usability. It categorizes tools by methodological choices
such as instrumentation versus sampling and profile versus
trace analysis, while assessing their capability for capturing
both inter- and intra-application performance behavior. The
survey also discusses Workflow Management Systems (WMS)
in terms of workflow orchestration, performance measurement,
and visualization capabilities. Finally, it also reviews visualization
frameworks ranging from low-level profilers to holistic workflow
visualizers, highlighting gaps in the integration of data from
disparate sources. Collectively, these insights highlight the need
for workflow-aware visualization and analysis tools that enable
coherent, fine-, and coarse-grained understanding of performance
in emerging complex scientific computing workflows.

Index Terms—Scientific workflows, AI/ML-enabled workflows,
performance analysis, visualization

I. INTRODUCTION

HPC has been one of the most active research areas over the
last 40 years [1]. High-performance computing (HPC) bulk-
parallel applications execute a single program simultaneously
on many processors. These applications, often built using
programming models such as MPI, have underpinned scientific
computing for decades. These bulk-parallel applications dom-
inated large-scale HPC facilities, with a focus on scalability of
a single binary over many cores. However, with the growing
use of AI/ML in software and heterogeneity in hardware,
these monolithic applications are increasingly complemented
by scientific workflows.

Scientific workflows let scientists model data process-
ing tasks and their dependencies as directed acyclic graphs
(DAGs), where nodes represent tasks and edges represent
data flows [2]. HPC users can embed ML training, inference,
and analytics within scientific workflows, often running on
heterogeneous infrastructure or in cloud/HPC hybrid setups,

hence increasing their complexity. These complex scientific
workflows explore a wide range of parameters, utilize dis-
tributed data setups, and consist of multiple steps of inter-
operating tasks. As this complexity increases, understanding
and improving how efficiently such workflows use underlying
architectural resources becomes crucial. Consequently, the
scope of performance analysis, the study of how well appli-
cations utilize computational resources, should expand from
focusing primarily on bulk-parallel applications to address
the diverse behaviors and multistage structure of scientific
workflows [3].

Modern supercomputers are extremely costly to operate, and
their performance can degrade significantly without careful
tuning, making inefficient execution both wasteful and scien-
tifically limiting. Because of this, performance analysis tools
are critical in HPC since they enable developers and scientists
to achieve high utilization and ensure that scientific codes run
efficiently on large-scale HPC and cloud platforms. They en-
able users and developers to understand how HPC applications
interact with system hardware, reveal hidden inefficiencies,
and guide optimization decisions across compute, memory,
communication, and I/O layers. As HPC systems become
more complex, with heterogeneous processors, deep memory
hierarchies, and massive parallelism, these tools become indis-
pensable in capturing scalable and detailed performance data.

There exists a wide variety of performance analysis tools
tailored to different needs; some specialize in I/O behavior,
others focus on CPU/memory profiling, and many provide
multiple types of performance data. These tools differ in
capability (e.g., sampling vs. full instrumentation, profiling
vs. tracing, hardware counter monitoring vs. software event
logging) and target different layers of the system. Some of
the tools have their own analysis and visualization features,
while others generate data in formats which can be ingested by
existing visualization tools. However, most of the performance
tools were originally developed with the performance analysis
of a single bulk-parallel application or simulation in mind.

As scientific workflows grow in complexity, featuring het-
erogeneous tasks, dynamic data flows, and distributed execu-
tion across HPC and cloud platforms, the requirements placed
on performance analysis tools are evolving. Many existing
tools were designed for tightly coupled bulk-parallel applica-


tions and may struggle to accommodate the data movement,
task orchestration, and multi-component dependencies found
in modern complex workflows. Consequently, performance
tools must now support richer instrumentation, cross-task
correlation, heterogeneous execution models, and scalable end-
to-end analysis and visualization beyond single-application
runs.

This paper presents a survey of the state-of-the-art in perfor-
mance analysis, evaluation, and visualization tools for large-
scale parallel applications and scientific workflows running
on HPC systems. By combining insights from these areas,
the paper proposes a taxonomy of performance analysis and
visualization tools, which constitutes its core contribution. In
a nutshell, this paper contributes the following:

• Description of bulk-synchronous HPC applications and
their characteristic (Section II).

• Review of the nature of complex scientific workflows,
their performance analysis requirements, and Workflow
Management Systems (section III).

• Survey of state-of-the-art performance analysis tools
for bulk-parallel HPC applications and examination of
whether they are “workflow aware” (Section IV).

• Survey of existing performance visualization tools and
discussion of their ability to handle workflow-level per-
formance data from disparate sources (Section V).

Lastly, section VI concludes this survey. It also discusses
emerging research directions in which Large Language Models
(LLMs) are applied for performance prediction in order to
guide users towards optimal workflow configurations.

II. BULK-SYNCHRONOUS HPC APPLICATIONS

Bulk-synchronous HPC applications adhere to the Bulk
Synchronous Parallel (BSP) model, a structured approach
to parallel computing, published in 1990 by Leslie Valiant.
BSP is a parallel computing model used for the design and
implementation of parallel algorithms and applications [4]. In
the BSP model, the computation is organized into a sequence
of supersteps. The processors proceed through each superstep
asynchronously, while synchronization is performed by the
barrier at the end of each superstep–hence the name of “bulk-
synchronous” parallelism [5], [6]. Each superstep comprises
three distinct phases. 1) Local computations: Each processor
performs computations using its local data. 2) Communi-
cation: Processors exchange data with others as needed. 3)
Synchronization: A barrier synchronization ensures that all
processors have completed their computations and communi-
cations before proceeding to the next superstep. This model
simplifies the design and analysis of parallel algorithms by
providing clear synchronization points, which help manage
data dependencies and communication overhead. Figure 1
shows a similar representation of the BSP model.

Bulk-parallel HPC applications are particularly prevalent
in fields requiring large-scale data processing and complex
computations. Common examples include:

Fig. 1. Illustration of a superstep for the bulk-synchronous parallel model:
processing, communication, and global synchronization [7].

• Scientific Simulations: Modeling physical phenomena
such as climate patterns, astrophysical events, or molec-
ular interactions.

• Graph Processing: Analyzing large networks in social
sciences, biology, or computer science, where operations
such as traversal and shortest path calculations are com-
mon.

• Data Analytics: Processing vast datasets to extract mean-
ingful insights, often involving operations like sorting,
searching, and statistical computations.

The BSP model’s structured approach allows these applica-
tions to scale efficiently across numerous processors, main-
taining performance predictability, and simplifying the man-
agement of inter-processor communications.

With the evolution of computer architectures (multi-core
and many-core), parallel computing has also evolved from the
theoretical models such as BSP and Parallel Random Access
Machine (PRAM) to today’s advanced multi-core and hetero-
geneous architectures. Although BSP provided a foundational
structure for parallel computing, contemporary applications
have diversified in complexity, architecture, and execution
models. As computational problems grew in complexity and
hardware architectures became more heterogeneous, BSP’s
limitations, including its rigid synchronization barriers, be-
came apparent.

Traditionally, most scientific applications are implemented
using the Message-Passing Interface (MPI), often coupled
with another model for threaded on-node execution. The other
models MPI is coupled with include OpenMP for multi-core
architectures and CUDA and OpenACC for GPU architec-
tures. However, in order to achieve improved performance,
productivity, or both on exascale systems, new and emerging
programming models have been introduced, including Chapel,
Charm++, Liszt, and Loci. Karlin et al. [8] compare the newer
programming approaches with the traditional MPI implemen-
tation and found that Loci and Chapel result in programs that
are up to 80% smaller based on source lines of code (SLOC).

The following list points out some of the characteristics
implemented to augment bulk-synchronous parallel applica-
tions [9].

• Heterogeneous Computing: Modern applications often
leverage a mix of CPUs, GPUs, and specialized accel-
erators. This heterogeneity allows for optimized perfor-


mance across diverse tasks, but introduces complexity in
workload distribution and memory management.

• Asynchronous Execution: To mitigate the bottlenecks
of global synchronization, many applications adopt asyn-
chronous models. This approach enhances scalability and
performance, especially in environments with variable
processing times.

• Dynamic Load Balancing: Given the irregular nature
of many modern computational problems, dynamic load
balancing ensures efficient resource utilization by redis-
tributing workloads in real-time based on current system
states.

• Task-Based Parallelism: Instead of dividing data among
processors, task-based models focus on decomposing the
problem into discrete tasks.

• Integration with Machine Learning: The rise of AI
and machine learning has influenced parallel application
design. For example, the Elastic BSP [10] model adapts
the traditional BSP framework to better suit the iterative
and data-intensive nature of distributed deep learning.

Key Takeaways

The Bulk Synchronous Parallel (BSP) model provides
a structured foundation for scalable parallel computing
through well-defined phases of computation, communi-
cation, and synchronization. It remains central to many
scientific applications, particularly in simulation and data
analytics. As architectures and workloads have evolved,
applications have incorporated heterogeneous computing,
asynchronous execution, and task-based parallelism for
better scalability and flexibility, driving the need for
advanced performance analysis methodologies.

III. SCIENTIFIC WORKFLOWS

A scientific workflow can be described as a process de-
signed to accomplish a scientific objective by defining a
set of tasks and their dependencies [11]–[15]. Al-Saadi et
al. describe workflows by grouping them into four motifs:
single simulations, ensembles, analysis (experiment driven),
and dynamic workflows (execution characteristics are not
established a priori) [16]. Workflows are commonly rep-
resented in the form of Directed Acyclic Graphs (DAGs),
where vertices represent computational tasks, while edges
represent data dependencies and flows between the tasks.
Scientific workflows aim to accelerate scientific discovery
in many ways, including by workflow Automation, Scaling,
Adoption, and Provenance support, or ASAP for short [17].
Compute-intensive workflows are often based on computa-
tional science simulations including, but not limited to, run-
ning climate and ocean models, simulations based on physics,
chemistry, astronomy, biology, etc. Unlike traditional HPC
applications, Workflow Management Systems (WMS) are used
to run the disparate components of workflows seamlessly on
distributed resources and provide a communication channel
between these components. WMSs provide scientists with

all the necessary tools required to define, execute, publish,
and document their workflows. Some examples of complex
compute-intensive workflows include Multiscale Machine-
Learned Modeling Infrastructure (MuMMI) [18], American
Heart Association Molecule Screening (AHA MoleS) [19],
Autonomous MultiScale Library (AMS) [20], ICECap [21],
CyberShake [22], Galactic [23], and GUIDE [24].

Data-intensive workflows are commonly used in many
data-driven disciplines today, using diverse data sources and
distributed computing platforms [14], [25], [26]. Workflows
are a systemic way of defining the needed methods and
provide an interface between domain scientists and computing
infrastructures. With diversity in all disciplines and the large
increase in data sources, workflows play an important role in
extracting meaningful information from multiple data sources
and exploiting a wide range of computational resources [27].

Traditionally, simulation workflows consist of three basic
steps. First, the instructions are read from the disk or created
from the beginning. Second, the simulation, which is usually
some type of stepwise physics code, runs. Third, the results
are written back to the disk. In some cases, check pointing or
profiling maybe happening and in that case simulations spend a
considerable time writing large performance data to disk [3].
Generally, the model remains the same, where a simulation
reads, writes one or more times, and terminates.

Figure 2 shows some basic components of a workflow
based on data partitioning and aggregation. Usually, several
of these structures or components are composed into complex
workflows by scientific communities. The Workflow Patterns
Initiative [28] identifies different workflow patterns based on
control-flow, data, resource, and exception handling perspec-
tives. The basic structure is the process structure which takes
some data as input, works on it, and produces an output.
A sequential combination of several such processes forms
a pipeline, where the output of a process is taken as input
by the other process. Data distribution or data partitioning
jobs either produce data as output or divide larger data into
subsets that multiple other jobs process. Data aggregation
jobs process the outputs of several other jobs and produce
the resulting data. The resulting data is often redistributed
by data redistribution jobs, which act as a synchronization
point. Data distribution jobs increase parallelism in a work-
flow, while data aggregation may represent a reduction in
parallelism [29].

A. AI/ML-enabled HPC Workflows

Recently, scientists have started leveraging machine learning
capabilities in their workflows, which has lead to a new cate-
gory of workflows, called as scientific ML (SciML) workflows
or AI/ML-enabled workflows [30]. Introduction of Artificial
Intelligence (AI) and Machine Learning (ML) in HPC has
enabled highly accurate modeling with reduced computational
needs. Many workflows now integrate ML models to guide
analysis, use specialized computing hardware, and directly
integrate simulation, data collection, and training. [31]–[33].
Figure 3 shows the three different modes in which ML can


Fig. 2. The five basic structures in scientific workflows based on data partitioning and aggregation [29].

be integrated in HPC codes: ML-in-HPC, ML-out-HPC, and
ML-about-HPC.

ML-in-HPC is the scenario where the ML model replaces
a computational component in an HPC simulation or possibly
the complete computation itself, i.e., ML model serves as a
“surrogate.” ML-out-HPC represents the situation where the
ML model resides “outside” of the traditional HPC simulation
loop, but it controls the progression of the HPC workflow
dynamically. Lastly, ML-about-HPC denotes a situation where
ML models are coupled to main HPC tasks and operate
concurrently. The combined usage of all three modes, called
the “learning everywhere” paradigm, can lead to effective
performance for the workflow. Effective performance can
be measured in terms of reduced computational cost, time-
to-solution, or the achievement of the scientific objective of
the workflow [34]. However, AI/ML-enabled workflows often
require large quantities of training data, multi-node tasks, and
heterogeneous hardware such as CPUs, GPUs, and TPUs.

B. Difference between Traditional and AI/ML Workflows
Traditional HPC workflows focus on parallel processing for

large-scale simulations and data-intensive tasks, while modern
AI/ML workflows emphasize iterative model training, data
preprocessing, and optimization. They often leverage special-
ized hardware (e.g., GPUs) to perform tasks like deep learning
and pattern recognition on large datasets. They consist of
multiple complex tasks or stages that continue to communicate
and exchange data with each other. Tal et al. [3] highlight some
key differences between traditional and modern HPC AI/ML
workflows:

• Distributed systems. They require multiple nodes on
heterogeneous resources that span CPUs, GPUs, and
FPGAs. This makes network and Storage I/O become
much more important.

• Orchestration. Workflows require scheduling differ-
ent types of computations on heterogeneous infrastruc-

Fig. 3. Three modes of integrating ML in HPC: (1) ML-in-HPC: surrogate
models are used to replace part(s) of or the complete simulation. (2) ML-out-
HPC: An external AI/ML algorithm (active or reinforcement learning) is used
to dynamically control the campaign, or steer the complete workflow. (3) ML-
about-HPC: AI/ML complements traditional computational tasks, improving
their results or efficiency [34].

tures [18], [36], which can be unpredictable. We may
need to dynamically initiate ML or simulation jobs for
training or inference. Additionally, workloads can in-
crease or decrease over time, which MPI ranks are not
well-equipped to handle.

• Complex scheduling. Large number of jobs may be
required to run on CPUs and GPUs for training and
uncertainty quantification. Existing HPC schedulers like
Slurm [37] cannot co-schedule heterogeneous resources
and do not scale well to millions of training jobs that
may be required on simulation data.

• Monitoring. They may require a coarse level of moni-
toring along the critical path, given their complex nature


Fig. 4. One of the five axes, Workflow Characteristics, with corresponding terms and sub-terms that categorize workflows running on workflow management
systems (adapted from [35]).

and codebases. Each component may require unique
monitoring and logging mechanisms, and their evaluation
criteria may differ.

• Robustness. State management in modern workflows
may be significantly different than traditional check-
pointed simulations. This means that fault tolerance
strategies can vary throughout the workflow.

C. Workflow Management Systems (WMS)

A Workflow Management System (WMS) (or sometimes
Scientific Workflow Management System, SWfMS) is a soft-
ware that allows scientists to define, compose, and execute
computational and data processing pipelines made of interde-
pendent tasks. Tasks may involve simulations, data transfor-
mation, analysis, visualization, etc., arranged into dependency
graphs (often DAGs), with edges representing either data
flow or control dependencies. The WMS handles orches-
tration (scheduling, dispatching tasks), data movement, fault
tolerance, provenance (recording how outputs were derived),
resource allocation, and often supports heterogeneous com-
pute/storage infrastructures [38].

Ferreira da Silva et al. classifies WMSs in their work ac-
cording to the execution model, the handling of heterogeneous
environments, and the data access methods [2]. However, more
recent work by Suter et al. codifies a set of axes to compare
and describe different WMSs. Figure 4 shows one of the five
axes, workflow characteristics, which is the most relevant to
this study. [35]. However, following are the brief descriptions
of each of these axes:

• Workflow Characteristics. This axis captures how work-
flows are organized (tasks vs data-driven), how complex
the components are, what kinds of dependencies connect
components, and whether execution paths can change at
runtime. These structural features strongly shape how a
workflow system can optimize scheduling, resource use,
and performance adaptivity.

• Composition. This axes refers to how workflows are
defined, configured, and organized in a WMS. Overall,
this axes helps users with different technical backgrounds
understand how accessible WMSs are.

• Orchestration describes how a WMS implements and
manages execution of workflow tasks. Whether task
launching and scheduling are static or dynamic, whether
triggers/events at runtime can influence execution, and
how resources are allocated (directly by the WMS, del-
egated to external resource managers, or cloud-based).
These decisions affect how efficiently workflows use
computing infrastructure.

• Data Management describes how workflows handle
data through their lifecycle: how data moves between
tasks, where and when data is stored (intermediate, final
outputs), and how access patterns are optimized (caching,
layout, transfer strategies). These strategies are especially
crucial in data-intensive workflows and heavily influence
performance.

• Metadata Capture involves gathering contextual data
during workflow execution—such as provenance (both
what the workflow is supposed to do and what it ac-
tually did), runtime state, performance monitoring, and
detecting anomalies. These capabilities are essential for
reproducibility, debugging, and optimization.

Table I summarizes the types of workflows each WMS is able
to support and execute based on the sub-terms in the first
axis (Workflow Characteristics). Workflow Characteristics is
more focused on the types of workflow which a WMS can
support than on the characteristics of the WMS itself, i.e.,
on what the workflow does. Hence, the first axis is the most
relevant to this survey and we explain its corresponding and
sub-terms next, which are used in Table I. First, flow is one
of the important workflow features which directly impacts its
optimization by WMSs. Workflows have different components
which get inputs, do processing, generate outputs, and then


TABLE I
CLASSIFICATION OF SOME SCIENTIFIC WORKFLOW MANAGEMENT SYSTEMS BASED ON WORKFLOW CHARACTERISTICS [35].

WMS Flow Granularity Coupling Dynamicity Domain

AiiDA Task, Iterative Sub-workflows, Executables, Functions Loose Branches, Intervention Agnostic
COMPSs Task, Iterative Sub-workflows, Executables, Functions Loose Branches Agnostic
Cylc Task, Iterative Sub-workflows, Executables Loose Branches, Intervention Agnostic
FireWorks Task Sub-workflows Tight Branches Agnostic
Galaxy Data Sub-workflows, Executables Loose Branches, Intervention Agnostic
Merlin Task, Iterative Sub-workflows Loose - Agnostic
Nextflow Data Sub-workflows Loose Branches Agnostic
Parsl Data Sub-workflows Loose Branches Agnostic
Pegasus Data Sub-workflows, Executables Loose Branches Agnostic
Radical Task, Iterative Functions Tight - Agnostic
Snakemake Task, Iterative Sub-workflows, Executables, Functions Loose, Tight Branches Agnostic
Swift/T Task, Data Functions Tight Branches Agnostic
Kepler Data Sub-workflows, Executables, Functions Loose Branches, Intervention Agnostic

terminate. The structure of a workflow is defined by all
these tasks performed by the workflow components. All these
different tasks in a workflow can be executed multiple times in
an iterative manner. The data that flows within the workflow
components may also drive the structure and execution of the
overall workflow.

Second, coupling of workflow tasks is another prominent
workflow feature. Coupling refers to the degree of interde-
pendence between workflow tasks. Tasks with tight coupling
are heavily dependent on each other’s internal details (data,
order, or state), while loosely coupled tasks interact through
well-defined interfaces and can be changed or executed inde-
pendently.

Third, the granularity of individual workflow tasks also
causes differences in workflow structures. Some workflows
may only have multiple function calls to get complex tasks
done, while others may exist of a script of programs. However,
the most common definition of a workflow is the composition
of standalone executables, each of which consists of multiple
function calls with inputs and generated outputs to complete
complex tasks. As the complexity of workflows increases, it
is not uncommon to express them in a hierarchical format
consisting of sub-workflows.

Fourth, the dynamicity of a workflow is its ability to
modify its structure during execution based on conditional
branches. These branches are activated when a pre-defined
condition is realized or due to an external event. Another way
to activate these branches is through runtime interventions
where control is given back to the user or two an automated
external decision process. Lastly, different workflow domains
are served by different WMSs. Some are domain-specific
while others are application-agnostic [35].

1) WMSs for AI/ML-enabled Scientific Workflows: With the
growth in complexity and introduction of AI/ML in scientific
workflows, WMS designers need a broader and deeper un-
derstanding of workflow requirements and behaviors to have
improved algorithms for resource provisioning, computational
job scheduling, and data management. First, in order for
WMSs to support large scale, ML workflows, a shared NFS
file system should be a top feature. This will make the system

much more convenient as compared to when file accesses are
done over the network. Second, scientific WMS will need to
adapt to dynamic changes on the number of input and output
files in order to support ML workflows. This means that input
and output files for each job will not be known before the
workflow’s execution begins [30].

Following is a list of few WMS which support some level
of data collection, setting the stage for performance analysis.

a) Pegasus: The Pegasus WMS enables scientists to
specify complex computational pipelines as high-level work-
flow descriptions, which are then mapped to diverse execution
environments such as campus clusters, national cyber infras-
tructure, and cloud platforms [39]. By abstracting workflow
specification from resource details, Pegasus supports portabil-
ity and scalability of scientific workflows across heterogeneous
systems. In addition to core orchestration capabilities, Pegasus
has been extended to support performance analysis using
tools like pegasus-statistics for performance metrics,
pegasus-analyzer for debugging, pegasus-plots for
visualizations, and pegasus-monitord for runtime moni-
toring, capturing job execution times, data transfer overhead,
and resource utilization [40]–[42]. This monitoring capability
allows users to analyze end-to-end workflow performance,
track job and data dependencies, and derive insights about
bottlenecks, making Pegasus not only a workflow engine but
also a platform that supports performance-aware execution.

b) Nextflow: Nextflow [43] is a WMS that provides built-
in performance monitoring through execution reports, trace
files, and timeline visualizations that capture CPU usage,
memory consumption, task duration, and I/O operations for
each workflow process. The system generates HTML execu-
tion reports using the -with-report option and creates
trace files with the -with-trace option containing submis-
sion time, completion time, and resource usage statistics [44].
For real-time monitoring, Nextflow integrates with Tower (now
Seqera Platform), which tracks workflow execution across any
infrastructure and provides resource optimization recommen-
dations based on actual usage patterns. These performance
analysis capabilities enable scientists to identify bottlenecks,
optimize resource allocation, and improve computational effi-


ciency across distributed computing environments.
c) Snakemake: Snakemake [45] is a workflow manage-

ment system designed to create reproducible and scalable
data analyses using a Python-based domain-specific language.
Workflows are defined through rules that specify how to
generate output files from input files, with dependencies
automatically determined to create a directed acyclic graph
of jobs that can be parallelized. For performance analysis,
Snakemake provides built-in benchmarking capabilities that
allow rules to capture execution metrics including runtime,
memory usage, and I/O operations [46]. These benchmark
results are stored in tab-separated files that can be used to
build complex performance analysis pipelines.

d) Kepler: Kepler [47] is an open-source scientific work-
flow system that provides a graphical user interface for de-
signing, executing, and sharing workflows using an actor-
oriented modeling approach. Workflows are composed of
modular components called actors connected through channels
that represent data flow. For performance analysis, Kepler
provides a comprehensive provenance framework that auto-
matically captures process and data monitoring information,
including workflow execution context, parameter settings, re-
source usage, and data lineage across distributed computing
environments [48]. The provenance collection mechanism uses
an event-based architecture designed for minimal performance
overhead, recording actor executions, data transformations,
and workflow evolution to enable reproducibility. These per-
formance analysis capabilities enable scientists to validate
experimental procedures.

Key Takeaways

Scientific workflows have become integral to managing
the complexity of modern compute- and data-intensive
scientific tasks. Represented as Directed Acyclic Graphs
(DAGs), workflows automate and orchestrate interdepen-
dent tasks to enhance scalability, reproducibility, and
provenance. Workflow Management Systems (WMSs)
such as Pegasus, Nextflow, and Snakemake enable seam-
less execution, data movement, and performance monitor-
ing across heterogeneous infrastructures. The integration
of AI/ML into workflows, through modes such as ML-in-,
ML-out-, and ML-about-HPC, has further expanded their
capabilities, enabling adaptive, data-driven scientific dis-
covery. These AI/ML-enabled workflows introduce new
challenges in scheduling, orchestration, and monitoring,
requiring WMSs to evolve with advanced support for
dynamic data dependencies, heterogeneous resources, and
performance-aware execution.

IV. PERFORMANCE ANALYSIS OF HPC APPLICATIONS

Over the years, various performance analysis tools have
been developed to meet the diverse needs of HPC users.
Figure 5 shows a classification of HPC performance analysis
tools. These tools generally fall into two categories: (1)
sampling-based tools, which collect data periodically or on

Fig. 5. Classification of HPC Performance Analysis Tools.

specific events, and (2) instrumentation-based tools, which
require adding measurement markers to the source code. The
data formats produced by these tools are typically categorized
into traces and profiles. Sampling-based tools almost always
produce profiles. Tools such as Vampir [49], Paraver [50],
and Scalasca [51], record traces with detailed sequences of
events during program execution, useful for in-depth temporal
analysis. Hybrid tools combine sampling, instrumentation,
and tracing to provide comprehensive performance insights.
System-level monitoring tools, such as PAPI [52], [53] and
LDMS [54], is a separate category of performance analysis
tools that monitor overall system performance metrics, aiding
in identifying system-wide bottlenecks. These tools are often
incorporated as part of hybrid tools. Table II shows the
different characteristics supported by performance analysis
tools in a binary format.

Sampling tools (e.g. HPCToolkit [55],
Open|SpeedShop [56], mpiprof) are useful when the
source code is too large or not accessible, offering profiling
without code changes. However, they gather data across
the entire application, which can overwhelm analysis and
obscure critical regions. Their fixed sampling rate also poses
challenges: too high increases overhead and I/O issues,
while too low risks missing important behavior. Since HPC
applications often have distinct phases, a uniform sampling
rate is suboptimal.

In contrast, instrumentation-based tools such as TAU [57],
Score-P [58], Timemory [59], and Caliper [60], [61] offer
more control and lower overhead by targeting specific code
regions. This makes them well-suited for analyzing phase
behavior and critical paths. However, they require access to
and familiarity with the application source code. In addition,
common runtime characteristics collected by performance
tools include function-level timings, call counts, processor
usage, and memory usage. Traces are often visualized and
capture fine-grained event timelines for each process/thread,
enabling detailed analysis of communication delays, load im-
balance, synchronization overhead, etc. Finally, some frame-
works support continuous or online monitoring but, they
typically focus on system metrics rather than application-level
behavior. The contextual features of the performance analysis


tools mentioned in the characterization in Figure 5 can be
found in Table III.

A. Performance Analysis for Bulk-Parallel HPC Applications

In bulk-parallel HPC applications, performance analysis
tools need to meet several core requirements. They must
support instrumentation across tightly coupled compute, com-
munication, memory, storage, and I/O subsystems, capturing
metrics such as per-rank execution time, message volumes,
and hardware counter data. Because these applications run
on extreme scale, the measurement overhead must remain
minimal and the data collection must be manageable. Effective
tools thus provide both profiling (aggregated metrics) and
tracing (time-stamped event logs) so that bottlenecks can be
uncovered either through high-level summaries or detailed
temporal analysis.

The performance analysis of Bulk-Parallel HPC applications
usually involves three steps: source code instrumentation or
sampling, performance data collection, and performance data
analysis and visualization. There are a number of tools that
provide one or more of these capabilities. This section briefly
describes the functionality and support provided by state-of-
the-art performance tools.

1) TAU Performance Suite: The TAU Performance Sys-
tem [57], [63] is a comprehensive performance analysis toolkit
that supports various programming paradigms, including MPI,
OpenMP, and hybrid models. It is compatible with languages
such as C, C++, Fortran, Java, and Python. TAU’s architecture
comprises three primary components: instrumentation, mea-
surement, and analysis.

TAU collects performance data through profiling and trac-
ing. Source code instrumentation can be done either manually
or automatically, and can be applied at various levels, includ-
ing functions, methods, basic blocks, and statements. Addi-
tionally, TAU integrates with hardware performance counters
(e.g., via PAPI), enabling access to low-level metrics.

Recent developments have focused on preparing TAU for
exascale computing environments by improving scalability,
support for heterogeneous architectures, and integrating with
emerging programming models [64]. For workflow perfor-
mance analysis, TAU improved instrumentation methods to
capture data across multiple workflow components [65]. The
profiles and/or traces collected by each component are ag-
gregated with post-processing scripts that generate structured
JSON output and merged traces.

2) HPCToolKit: HPCToolkit [55], [66] is an integrated
suite of tools designed for performance measurement and
analysis of parallel applications. It supports a variety of
programming models, including MPI, OpenMP, and CUDA,
and is compatible with languages such as C, C++, and
Fortran. HPCToolkit’s architecture comprises three primary
components: measurement, analysis, and visualization.

HPCToolkit collects performance data by sampling and
tracing. It employs asynchronous sampling to gather call
path profiles, capturing metrics such as CPU cycles and
cache misses. Further, HPCToolkit’s analysis tools attribute

performance metrics to calling contexts, enabling developers
to pinpoint where in the program bottlenecks occur and
identify their underlying causes. It supports fine-grained call
path tracing, allowing for detailed examination of application
behavior over time. Hpcviewer and hpctraceviewer are
HPCToolkit’s visualization tools. These tools offer graphical
interfaces for exploring performance data, presenting call path
profiles and traces in a hierarchical manner.

HPCToolkit’s capabilities have been extended to support
performance analysis of GPU-accelerated applications, ad-
dressing the challenges of performance analysis on emerging
exascale supercomputers [67]. Moreover, it is important to note
that while HPCToolkit excels at analyzing individual parallel
application components, it is not inherently designed for end-
to-end analysis of complex scientific workflows composed of
heterogeneous tasks or mixed-language components. In such
cases, especially where workflows involve diverse execution
models, complex task dependencies, and data flows, addi-
tional integration effort may be required. For comprehensive
workflow-level performance insights, HPCToolkit is best used
in conjunction with tools specifically tailored to scientific
workflow analysis.

3) Caliper: Caliper [61] is a lightweight, modular per-
formance analysis library designed to integrate performance
measurement capabilities directly into high-performance com-
puting (HPC) applications. Developed at Lawrence Livermore
National Laboratory, Caliper supports C, C++, and Fortran
programs, and is compatible with parallel programming mod-
els such as MPI, OpenMP, CUDA, and Kokkos.

The framework enables developers to annotate the source
code with performance markers, facilitating the collection
of rich contextual information during program execution.
Caliper’s architecture allows for flexible configuration at run-
time, supporting various measurement techniques including
sampling, tracing, and hardware counter integration. The col-
lected data can be exported in multiple formats, allowing
seamless integration with analysis tools such a Hatchet and
visualization platforms such as Chrome’s trace viewer [60].

Caliper’s design emphasizes minimal overhead and ease
of integration, making it suitable for both large-scale HPC
applications and smaller-scale performance studies. Its exten-
sible plugin system provides flexibility and customization by
allowing developers to enable or disable various services at
runtime.

In terms of scientific workflow performance analysis,
Caliper may not provide holistic workflow-level performance
insights out-of-the-box. This is particularly true in the case
of multi-cluster and multi-binary workflows. Although Caliper
excels at fine-grained performance analysis of individual com-
ponents within a workflow, integrating Caliper with other
tools or frameworks specifically designed for workflow-level
performance analysis could offer a more comprehensive un-
derstanding of the workflow’s performance characteristics.
Especially in scenarios involving complex task dependencies
and diverse execution models.


TABLE II
CHARACTERISTICS OF SOME OF THE PERFORMANCE PROFILING, TRACING, AND MONITORING TOOLS

Tool Methods Outputs Monitor CPU GPU Rank Thread Perf
Counters

Workflow
AwareInstrumentation Sampling Trace Profile

TAU ✓ ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
HPCToolkit p ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
Caliper ✓ ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
Score-P ✓ ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
Open|SpeedShop p ✓ ✓ ✓ p ✓ p ✓ ✓ ✓ p
MPIProf p ✓ p ✓ p ✓ p ✓ p p p
timemory ✓ ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
Scalasca ✓ ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
Darshan p ✓ ✓ ✓ p ✓ - ✓ p ✓ ✓
Intel Advisor ✓ ✓ p ✓ p ✓ ✓ p ✓ ✓ p
Recorder p ✓ ✓ p p ✓ - ✓ p ✓ ✓
NVIDIA Nsight p ✓ ✓ ✓ p ✓ ✓ ✓ ✓ ✓ p
PerfFlowAspect ✓ p ✓ p p ✓ ✓ ✓ ✓ ✓ ✓
Chimbuko ✓ p ✓ p ✓ ✓ ✓ ✓ ✓ ✓ ✓
SOMA [62] ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
DFTracer ✓ p ✓ p p ✓ p ✓ ✓ p ✓

TABLE III
CONTEXTUAL FEATURES OF SOME OF THE PERFORMANCE ANALYSIS TOOLS

Tool Programming
Models Languages Scalability Overhead Visualization /

Output

TAU MPI, OpenMP, CUDA,
HIP, Pthreads, OpenCL

C, C++, Fortran,
Python, JAVA, etc. High Medium PerfExplorer, Vampir, Jumpshot

ParaProf, Paraver, OTF2

HPCToolkit MPI, OpenMP, CUDA,
HIP, Pthreads

C, C++,
Fortran, Python High Low hpcviewer, hpctraceviewer

Caliper MPI, OpenMP, HIP, CUDA C, C++, Fortran, Python High Low Hatchet, JSON, CSV exports

Score-P MPI, OpenMP, Pthreads,
CUDA, OpenCL

C, C++,
Fortran, Python Medium Medium Vampir, Scalasca, Cube4,

OTF2
Open|SpeedShop MPI, Pthreads, OpenMP C, C++, Fortran Medium Medium OSS GUI, CLI
MPIProf MPI C, Fortran Very High Very Low Text/CSV logs
timemory MPI, OpenMP, CUDA, HIP C, C++, Fortran, Python Very High Low Hatchet, Perfetto, JSON, XML
Scalasca MPI, OpenMP, Pthreads C, C++, Fortran High Medium Scalasca GUI, Cube4, OTF2

Darshan MPI-I/O, POSIX I/O, HDF5 C, C++, Fortran Very High Very Low PyDarshan (PDF reports),
Darshan logs

Intel Advisor OpenMP, MPI, OpenCL,
SYCL C, C++, Fortran, Python High Medium Intel Advisor GUI, Roofline

Recorder MPI-I/O, POSIX I/O, HDF5 C, C++, Fortran High Low I/O traces, Plots/Parsers

NVIDIA Nsight MPI, OpenMP, CUDA,
OpenACC C, C++, Fortran High Medium Nsight GUI, Roofline

PerfFlowAspect MPI, OpenMP, CUDA, HIP C, C++, Python High Low Perfetto, Chrome trace
DFTracer POSIX I/O C, C++, Python High Low Perfetto, DFAnalyzer

4) Score-P: Score-P [58] is a scalable and modular perfor-
mance measurement infrastructure developed collaboratively.
It provides a unified instrumentation and measurement sys-
tem for profiling and tracing parallel applications, supporting
various programming models such as MPI, OpenMP, CUDA,
OpenCL, and OpenACC. Since standard Python performance
analysis tools cannot cope with highly parallel programs,
Andreas et al. demonstrate advanced Python Performance
Monitoring with Score-P [68].

The framework enables developers to collect performance
data through automatic instrumentation, manual annotations,
or sampling methods. The collected data can be stored in
standardized formats like OTF2 [69] for traces and CUBE4 for
profiles, facilitating interoperability with analysis tools such as
Vampir and Scalasca. Lastly, Score-P’s flexible configuration

allows users to control measurement granularity and overhead
through environment variables, enabling tailored performance
analyses.

Although Score-P excels at detailed performance analysis
of individual components within HPC applications, it may
not inherently provide holistic insights into complex scien-
tific workflows composed of multiple heterogeneous tasks.
In such scenarios, integrating Score-P with workflow-aware
performance analysis tools or frameworks could offer a more
comprehensive understanding of the workflow’s performance
characteristics.

5) Open|SpeedShop: Open|SpeedShop (O|SS) [56] is an
open-source performance analysis framework designed to sup-
port a wide range of profiling and tracing experiments for
parallel applications. It provides a unified interface for defining


“experiments,” which encapsulate what performance metrics
to collect and how to present them. O|SS supports program
counter (PC) sampling, call-path profiling, hardware counter
metrics, MPI profiling/tracing, I/O tracing, and floating-point
exception detection. One of the core philosophies of O|SS
is to apply instrumentation and monitoring to unmodified
application binaries whenever possible [70].

O|SS is built atop components such as Dyninst/DPCL for
binary instrumentation, libmonitor for intercepting I/O and
system calls, PAPI for hardware counter access, and MRNet
for data aggregation in parallel settings. In the O|SS workflow,
a user selects an “experiment” (e.g. pcsamp, usertime, hwc, io,
mpi, etc.), which drives specific collectors (sampling, tracing,
or counter-based) and result views. Moreover, it supports
multiple user interfaces (GUI, command line, Python API)
and provides a plugin infrastructure so new measurement or
analysis modules can be added modularly.

While O|SS offers extensive support for instrumentation,
sampling, and tracing that could probe individual components
or modules of scientific workflows, it is not inherently de-
signed to treat a workflow as a holistic unit. Adapting O|SS
for full scientific workflow analysis would require custom
orchestration of experiments across all workflow components,
synchronized trace merging, and a higher-level abstraction to
relate performance data across task dependencies. Thus, for
end-to-end workflow performance evaluation, combining O|SS
with workflow-level performance tools is likely necessary.

6) MPIProf: MPIProf [71] is a lightweight, profile-based
tool developed by NASA HECC for MPI applications, using
the PMPI interface to collect aggregated statistics on MPI
calls, I/O, and memory usage without requiring source code
changes. It reports metrics such as number of calls, message
sizes, and time spent in point-to-point and collective MPI rou-
tines, and optionally captures call-path context when compiled
with debugging flags. MPIProf also supports profiling POSIX
and MPI I/O operations, providing per-rank summaries of I/O
behavior.

While MPIProf can help pinpoint MPI and I/O hotspots
within components of scientific workflows, it is not inherently
suited for full workflow-level performance analysis.

7) Timemory: Timemory [59] is a modular and extensible
toolkit for performance measurement and analysis in HPC,
designed to allow users to combine different performance
“components” flexibly with minimal overhead. Although im-
plemented in C++, timemory supports instrumentation in C,
Fortran, Python, and CUDA, offering a unified interface across
these languages. Users can create measurement bundles that
mix timers, hardware counters, memory metrics, GPU coun-
ters, and more. These bundles can be enabled, disabled, or
interleaved dynamically at runtime. One of timemory’s design
goals is that disabled instrumentation should impose near-zero
cost, making it safe to integrate broadly without disturbing
performance. The framework is built to interface with external
tools and to be extended by users. If timemory does not
natively support a measurement type, the user can plug in
custom components.

Timemory is well suited to profiling and analyzing indi-
vidual tasks or modules within scientific workflows, espe-
cially when these modules are implemented in languages and
paradigms that timemory supports. Its flexibility in combining
measurement types allows detailed characterization of com-
putation, memory, and hardware counter behavior. However,
timemory does not inherently provide workflow-level orches-
tration, merging of performance traces across multiple distinct
binaries or languages, or correlation of performance data with
task dependencies or data flow in workflows. To perform
holistic workflow-level performance analysis, timemory would
likely need to be paired with workflow-aware tools that can
aggregate, align, and interpret performance data across task
boundaries.

8) Scalasca: Scalasca [51] is a performance analysis tool
developed for optimizing parallel applications that use MPI
and/or OpenMP. It offers two primary modes: profiling and
event tracing. In profiling mode, Scalasca generates aggre-
gate call-path metrics and hardware counter data to identify
hotspots. In tracing mode, it records detailed time-stamped
events to detect wait states and communication/synchroniza-
tion inefficiencies. The reports can be explored via the CUBE
graphical browser or external timeline tools such as Vampir.

Scalasca can be applied to scientific workflows insofar as
those workflows consist of large-scale MPI/OpenMP tasks or
kernels: it can profile individual components, find commu-
nication bottlenecks, synchronization inefficiencies, and load
imbalances. However, Scalasca is not inherently workflow-
aware.

9) Intel Advisor: Intel Advisor [72] is a performance design
and analysis tool aimed at helping developers optimize vector-
ization, threading, memory usage, and accelerator offloading
in applications written in C, C++, Fortran, SYCL, Python,
and others. It offers multiple “perspectives” (e.g. vectorization,
roofline, offload modeling) to guide where performance im-
provements are possible. Its key features is Roofline Analysis,
which plots application performance (FLOP/s) against arith-
metic intensity to help identify whether kernels are compute-
bound or memory-bound. Intel Advisor also supports offload
modeling, which helps estimate which parts of a CPU code
might benefit from being offloaded to a GPU and predicts the
performance gains and bottlenecks.

In the context of scientific workflows, Intel Advisor is
useful for profiling and optimizing individual computational
kernels or components, especially those involving vectoriza-
tion, threading, or offload to accelerators.

10) NVIDIA Nsight: Nsight Systems [73] is a system-wide
performance analysis tool from NVIDIA, designed to capture
CPU, GPU, OS, library, and interconnect activity across a
unified timeline. It helps developers visualize dependencies,
identify bottlenecks (e.g., between CPU and GPU work), mon-
itor hardware utilization (SM occupancy, memory bandwidth,
PCIe/NVLink throughput), and trace CUDA API calls. Nsight
Compute complements this by being a detailed kernel-profiler:
it provides per-kernel metrics, instruction throughput, memory
behavior, occupancy, and recently includes Roofline analysis to


help locate whether kernels are compute-bounded or memory-
bandwidth bounded.

Nsight is quite strong in analyzing individual tasks or
kernels within scientific workflows, especially when those
tasks make use of CUDA, GPU offload, or when CPU-GPU
interactions are significant. Because it supports multi-node
profiling in Nsight Systems, it can also scale to cluster settings
to some degree. However, Nsight does not inherently capture
workflow-level metadata such as task dependencies, data flow
between workflow stages, or cross-language orchestration.

B. Performance Analysis for Complex Scientific Workflows

The complex nature of modern scientific workflows with
AI/ML-assisted science requires different features in the per-
formance tools for their effective performance analysis. In
this section, we discuss some of them. First, considering
that workflows consisting of multiple programs have mul-
tiple source code components, the tool should preserve the
readability of the complex code base, should be minimally
intrusive in terms of the amount of instrumentation code, and
should also minimize introducing any instrumentation-related
errors. It should be user-friendly without a steep learning
curve and preferably have a human-readable trace file format
that allows users to easily understand and validate the data.
Second, given the emergence of complex AI/ML workflows
that rely on both C/C++ and Python interactions, the tool
should be able to support instrumentation and measurement
across multiple languages with ease. Third, a tool that profiles
scientific workflows effectively needs to support disparate data
sources and generate trace files that are easily composable
across different components (multiple binaries) and across
heterogeneous systems (multiple HPC clusters). Fourth, given
the sheer amount of data that a workflow can generate, the
tool should be able to provide both coarse-grained, high-
level analysis as well as finer-grained analysis with hardware
performance counters if and when desired. Finally, the tool
should be able to generate trace files that can be easily
visualized with existing and commonly available visualization
tools [75].

This section presents a discussion of some projects, tech-
niques, and tools from the literature for the analysis and
optimization of scientific workflows. Complementing this per-
spective, a recent survey on I/O analysis and tooling [76]
underscores that AI/ML integration is making application
behavior increasingly complex, which also strengthens the
need for detailed I/O tracking and profiling.

1) Workflow Performance Profiles: Król et al. presents a
holistic process for the development and analysis of Workflow
Performance Profiles. A performance profile for a given job
and a metric can be described as a time series with values
of the metric measured in equidistant points in time during
the job execution [77]. When data is collected from multiple
executions of the same workflow configuration, statistically
significant performance profiles for each job in a workflow
can be computed. In order to generate workflow performance
profiles, a workflow and a set of distinct input parameters

are taken as inputs. For each possible combination of these
parameters, a workflow execution is performed. The workflow
performance data collected using the monitoring tools is then
used to build average profiles.

2) Workflow Roofline Model: In their work, Ding et
al. introduce a Workflow Roofline Model, which ties a
workflow’s end-to-end performance with peak node- and
system-performance constraints [74]. Using this new work-
flow Roofline model, both node performance (FLOPS, data
movement in CPU/GPU, etc.) and system performance can
be analyzed, which is an expansion of the original Roofline
model to the workflow domain. It also allows for the analysis
of potential performance bottlenecks (node-bound vs. system-
bound), thereby guiding workflow performance optimizations.
Figure 6 shows the Workflow Roofline interpretation for time-
and throughput-sensitive workflows.

3) Top-down Performance Analysis: Tschueter et al.
presents a top-down performance analysis methodology for
scientific workflows by presenting event logs for both individ-
ual job steps and derived statistics for the workflow levels [78].
This work tries to widen the scope of performance analysis
beyond computational performance by including queue wait-
ing times, storage capacities, and data transfer speeds. The
analysis starts with an overview of the entire workflow and
then proceeds down the hierarchical structure of workloads,
thereby increasing analysis detail. First, aggregated results
are provided for the workflow level through visualization,
which allows users to pinpoint problematic jobs. Once a job
is selected, detailed information about the job is provided.

4) Stampede: Complex scientific workflows often experi-
ence failures due to temporary or localized resources prob-
lems. For example, a compute node may have a bad disk,
or a network file server may be overloaded and time out.
The Stampede (Synthesized Tools for Archiving, Monitor-
ing Performance, and Enhanced DEbugging) project ap-
plies offline workflow log analysis to address reliability and
performance problems for large, complex scientific work-
flows [79]. Specifically, Stampede integrates NetLogger
[80] and Pegasus-WMS into a general-purpose framework
to analyze the performance and failure states of running
workflows. In general, Stampede has algorithms for workflow
failure prediction, probabilistic anomaly detection, and a web-
based dashboard to visually analyze and monitor workflows.

5) RAMSES Project: The Department of Energy RAMSES
(Robust Analytic Models for Science at Extreme Scales)
project, launched by the U.S. DOE in 2014, seeks to develop
end-to-end analytical performance models for complex sci-
entific workflows in extreme-scale environments. The project
goals include building tools that can explain, predict, and
optimize workflow performance by combining first-principles
modeling (networks, storage, and compute) with data-driven
calibration and models synthesis. A performance advisor tool
and a database of observed behavior and parameter estimates
are also part of its infrastructure.

6) PerfFlowAspect: PerfFlowAspect is a lightweight, open-
source performance analysis tool specifically designed to


Fig. 6. The Workflow Roofline interpretation for time- and throughput-sensitive workflows [74].

Fig. 7. The top-down workflow analysis methodology follows the workflows’
hierarchy. The top level provides a general overview of the entire workflow
while subsequent levels provide more detailed information for the individual
workflow components [78].

address the challenges posed by complex scientific work-
flows [75], [81], [82]. Unlike traditional performance tools that
primarily target single-binary, bulk-synchronous applications,
PerfFlowAspect is tailored for modern, heterogeneous work-
flows composed of multiple binaries, programming languages,
and clusters.

1 import PerfFlowAspect
2 import PerfFlowAspect.aspect
3
4 @PerfFlowAspect.aspect.critical_path("around")
5 def foo(msg):
6 do_work_here()
7 return 1 if msg == "hello" else 0
8
9 def main():

10 foo("hello")

Listing 1. Python code with PerfFlowAspect annotated functions.

1 {"name": "foo", "cat": "__main__", "pid": 26356, "tid":
26356, "ts": 1712881071887511.8, "ph": "X", "dur":
3043.0},

Listing 2. PerfFlowAspect generated trace file with compact logging.

Built on aspect-oriented programming (AOP) principles,
PerfFlowAspect introduces minimal, non-intrusive instrumen-
tation, typically requiring only a single line of annotation per
function (listing 1), thereby maintaining code readability and
reducing user error. It supports C/C++, CUDA, and Python,
and uses language-appropriate mechanisms (e.g., Python dec-

Fig. 8. Chimbuko’s architectural diagram with its four major components.
This image shows two applications running concurrently [85].

orators and LLVM [83] passes for C++) to enable perfor-
mance monitoring without burdening the user with steep
learning curves or extensive code modifications. Moreover,
PerfFlowAspect generates performance trace files (listing 2)
in Chrome Tracing Format (CTF), making it composable and
easily visualizable through tools like Perfetto. It also supports
integration with Caliper, enabling the collection of low-level
hardware counters without requiring additional annotations.
This allows users to start with high-level, coarse-grained
analysis and drill down into fine-grained metrics as needed.

PerfFlowAspect effectively captures detailed performance
insights with minimal overhead (averaging 0.9%), supporting
both holistic and application-level analysis. Its trace compos-
ability across binaries and systems, support for multiple lan-
guages, and flexible instrumentation make it particularly well-
suited for emerging AI- and ML-driven scientific workflows
operating in distributed HPC environments.

7) Chimbuko: Chimbuko [85] is a performance-analysis
framework designed for large-scale HPC applications and
scientific workflows, with a particular emphasis on online,
distributed anomaly detection. Instead of collecting full traces,
Chimbuko uses a combination of TAU instrumentation, prove-
nance tracking, and an in situ analytics pipeline to iden-


TABLE IV
CHARACTERISTICS OF SOME OF THE VISUALIZATION TOOLS [84]

Tool Trace Profile Architecture
(Roofline)

Distributed
Memory Loop Function Workflow

AwareThread Rank Instrumentation Sampling

Intel Advisor p p p p ✓ p ✓ ✓ p
Intel VTune ✓ ✓ ✓ ✓ p p p p p
NSight p p p p ✓ p ✓ ✓ p
NSight Systems ✓ ✓ p ✓ ✓ p p p p
Paraprof ✓ ✓ ✓ ✓ p p p ✓ p
Perfetto ✓ ✓ ✓ ✓ p p ✓ ✓ p
Hpcviewer ✓ ✓ p ✓ p p ✓ p p
Boxfish p p p p p ✓ p ✓ p
Vampir ✓ ✓ ✓ ✓ p ✓ p ✓ p
Jumpshot ✓ ✓ p p p ✓ p p p
Grafana ✓ ✓ ✓ p p p p p p
Hatchet p p ✓ ✓ p p p ✓ p
Thicket p p ✓ ✓ p p p ✓ p
Drishti ✓ ✓ p p p p - - p

tify anomalous events during execution and capture only
their associated context. This significantly reduces trace data
volume while enabling detailed, workflow-level diagnosis of
performance issues such as load imbalance, unexpected com-
munication behavior, or abnormal task runtimes. Figure 8
shows all the components in Chimbuko’s architecture. Chim-
buko supports interactive exploration through its web-based
visualization interface and is designed to operate at extreme
scales, making it suitable for modern workflow environments
where applications are highly parallel, multi-component, and
dynamically interacting. As such, it represents a shift toward
performance-aware workflow monitoring that integrates prove-
nance, anomaly detection, and trace analytics into a unified
runtime system.

8) DFTracer: DFTracer [86] is a multi-level dataflow trac-
ing tool designed to capture both application-level and low-
level I/O events in complex, AI-driven scientific workflows.
Traditional I/O profilers often lack the ability to correlate
application behavior with system-level I/O operations, making
it challenging to analyze workflows that integrate simulation,
data analytics, and machine learning components. DFTracer
addresses this gap by providing a unified tracing interface that
captures events across different layers of the software stack.

The tool outputs trace files in Chrome Tracing Format
(CTF), facilitating visualization and analysis using tools like
Perfetto. DFTracer’s optimized trace format and compression
mechanisms enable efficient loading and analysis of large-
scale workflows, with reported runtime overheads between
1-5% and trace sizes up to 7.1 times smaller compared to
state-of-the-art tools. For analyzing the overlap between I/O
and computation, DFTracer integrates with DFAnalyzer, which
processes the generated traces to identify periods where I/O
and computation are not effectively overlapped. This analysis
helps to pinpoint I/O bottlenecks and optimize data loading
strategies in workflows that utilize frameworks like PyTorch.

DFTracer has been demonstrated on real-world AI-driven
workflows, such as MuMMI and Megatron-Deepspeed, show-
casing its ability to capture multi-level performance data

with minimal overhead. Its support for multiple programming
languages, including C++, Python, and CUDA, along with its
flexible instrumentation capabilities, make it a valuable tool
for performance analysis in modern scientific workflows.

9) Recorder: Recorder [87] is a multi-level I/O tracing
framework developed for HPC applications that captures I/O
function calls from different layers of the stack (POSIX,
MPI-IO, HDF5, NetCDF, PnetCDF) via function interposi-
tion, without requiring modifications or recompilation of the
application. Users can control which I/O layers are traced,
enabling flexible trade-offs between detail and overhead. One
of Recorder’s key innovations is its pattern-recognition-based
compression algorithm, which identifies repeated I/O patterns
both within individual processes and across multiple processes.
This allows large scale runs to generate richly detailed traces
while managing trace size growth and storage overhead.

Recorder is well suited for I/O-centric performance analysis
in scientific workflows, especially when workflows involve
heterogeneous I/O libraries (e.g., mixing HDF5, MPI-IO,
POSIX) and layered tracing is required. It captures detailed I/O
metadata (offsets, parameters, timing) that help in diagnosing
I/O bottlenecks, overlapping I/O, and pattern dependencies
across tasks. However, Recorder focuses on I/O tracing and
does not by itself provide workflow task dependency metadata,
nor automatic correlation across multiple binaries or languages
in a workflow.

10) Darshan: Darshan [88] is a widely used HPC I/O
characterization tool designed to capture the I/O behavior of
applications with very low overhead. It intercepts I/O calls
(POSIX, MPI-IO, etc.), records per-file/per-process statistics
such as operation counts, access sizes, timing, and access
patterns, and compresses/logs them for postmortem analysis.

Darshan is well suited for analyzing I/O behavior within
scientific workflows, especially for understanding I/O bottle-
necks in component binaries or for comparing I/O patterns
across tasks. Because Darshan is lightweight, it can be enabled
broadly without overly perturbing the application performance.


Key Takeaways

The HPC community maintains a diverse tool, each
optimized for a specific balance of accuracy, overhead,
scalability, and context of use. Tools differ in their design
choices, such as the use of instrumentation or sampling.
Instrumentation provides precise, event-level detail at
higher cost, while sampling offers scalable, low-overhead
insights with less precision. Similarly, profile- and trace-
based tools cater to different analysis needs, which are
aggregated summaries versus temporal behavior. These
methodological and architectural choices reflect the target
use case of each tool. An emerging use case is that of
complex scientific workflows that have different require-
ments than bulk-parallel applications when it comes to
performance analysis tools. These requirements include,
but are not limited to, minimal intrusion, ease of use,
multi-language support, data composibility, coarse- and
fine-grained analysis, and visualization.

V. VISUALIZATION

Analyzing and interpreting performance data from parallel
applications is inherently challenging. The situation becomes
even more complex in scientific workflows, which generate
vast and heterogeneous performance data across multiple
components and execution stages. This complexity highlights
the critical role of effective performance visualization tech-
niques and tools to facilitate exploration, understanding, and
optimization. Tools such as VisIt [89] have long focused on
providing scientific visualization capabilities for extremely
large datasets that have been generated on parallel clusters.
However, the visualization requirements for complex scientific
workflows are different. It is important to ingest and have
holistic visualization capability for data coming from disparate
sources in the form of DAGs, gantt charts, etc. This would
allow users to analyze workflow data at a coarser and higher
level before looking at task- or application-level data, if
required.

Isaacs et al. organize a taxonomy of performance visu-
alization in four contexts: hardware, software, tasks, and
application [90]. Hardware covers the individual hardware
elements as the performance data directly related to nodes,
cores, etc. The software context covers the source code or the
application being run. Tasks contexts encompass the individual
tasks that are performed as part of the computation. Lastly, the
application context refers to what is actually being computed,
i.e., the mathematical details, etc. Some visualizations may be
characterized under more than one of these contexts.

The contexts provided by Isaacs et al. help us to understand
what kinds of performance visualization and tools already
exist. The authors also highlight the challenges and oppor-
tunities for improvements in the performance visualization
space. Table IV shows the different characteristics of the
performance visualization tools in compact and binary format.
In this section, we discuss the properties and capabilities of
some of these visualization tools and techniques, especially

Fig. 9. Workflow Statistics displayed in pegasus-dashboard [39].

Fig. 10. Workflow Gantt Chart and Job Distribution Pie Charts displayed in
pegasus-dashboard [39].

in the context of complex scientific workflows. We begin by
looking at a few WMS that provide workflow visualization
support.

A. Pegasus WMS Visualizations

Pegasus WMS provides built-in visualization and moni-
toring capabilities that help users track the execution and
performance of complex workflows across distributed re-
sources [39], [91]. Its dashboard and monitoring tools vi-
sualize the directed acyclic graph (DAG) of the workflow,
showing task dependencies, execution states, durations, and re-
source mappings, thereby offering a holistic view of workflow
progress, as shown in figures 9 and 10. These visualizations
are primarily designed for workflow-level provenance and
performance monitoring rather than low-level performance
profiling. Consequently, Pegasus effectively supports end-to-
end workflow visualization and diagnosis but must be coupled
with external profilers to capture detailed performance metrics
at the application or kernel level.


Fig. 11. Top-down performance analysis and visualization methodology: Workflow Visualizer main window consisting of Workflow Graph (A), Job Summary
(B1), Function Runtimes (B2), and Info Table (B3) [78].

B. Top-down Visualization

The top-down performance analysis methodology (dis-
cussed in IV-B3) provides visualization for different data
granularities. Figure 11 shows the top-level Workflow View
window consisting of Workflow Graph at the top and Info View
at the bottom. Users can choose to open a single job log file
or a folder that contains all workflow job log and profile data
files as they move down in their workflow analysis.

C. Intel Advisor

Intel Advisor (see IV-A9) is a performance analysis and
optimization tool that helps identify vectorization, threading,
memory, and GPU offload opportunities in C, C++, Fortran,
and Python applications [72], [92]. Its roofline visualization
provides an intuitive view of computational intensity and
hardware utilization. However, Intel Advisor operates at the
level of a single application or binary: the user instruments
or runs the executable under Advisor, collects metrics, and
then performs interactive visualization for each application
run. Although it supports MPI and hybrid parallelism, it
is not designed to integrate or visualize performance data
from multiple, heterogeneous workflow components. In the
context of complex scientific workflows comprising many
interdependent steps, the tool may therefore be useful for
analyzing individual program instances but less suitable for

holistic, cross-step workflow performance visualization unless
manual aggregation is applied.

D. Intel Vtune

Intel VTune Profiler is a comprehensive performance anal-
ysis tool that collects and visualizes detailed hardware and
software performance data to identify bottlenecks in parallel
applications [93]. It supports C, C++, Fortran, and Python
programs, offering timeline and hierarchical views of CPU uti-
lization, threading behavior, memory bandwidth, vectorization
efficiency, and GPU offload performance. VTune’s visualiza-
tion environment enables developers to explore performance
metrics interactively within a single application or binary.
However, like Intel Advisor, VTune operates at the program
level and does not natively integrate or visualize performance
data from multiple, heterogeneous workflow components. As
such, it is effective for deep per-application profiling, but less
suitable for holistic, workflow-level performance visualization
without external data aggregation.

E. Nsight Compute

NVIDIA Nsight Compute is a kernel-level performance
analysis tool designed for CUDA applications running on
NVIDIA GPUs [94], [95]. It provides detailed performance
metrics, source-level correlation, and interactive visualization


of GPU kernel execution, including memory throughput, in-
struction mix, occupancy, and warp efficiency. Nsight Com-
pute’s GUI and command-line interfaces allow users to analyze
bottlenecks at the kernel and instruction levels within a single
GPU-enabled application. However, it is limited to profil-
ing individual executables or kernels and does not natively
support integrated visualization of performance data from
multiple, heterogeneous workflow components. For multi-
application scientific workflows, Nsight Compute is best suited
for component-level GPU performance characterization rather
than holistic workflow-level performance visualization.

F. Nsight Systems

NVIDIA Nsight Systems is a system-wide performance
analysis tool that captures and visualizes the runtime behavior
of heterogeneous applications across CPUs and GPUs [73],
[96]. It provides timeline-based visualization showing inter-
actions between processes, threads, CUDA kernels, memory
transfers, and operating system activities, enabling users to
identify bottlenecks and optimize concurrency between CPU
and GPU workloads. Unlike Nsight Compute, which focuses
on single-kernel analysis, Nsight Systems offers a broader
execution view across multiple processes and libraries. How-
ever, while it can correlate activity across distributed or multi-
process applications, it is primarily designed for profiling a
single workflow run or binary ensemble rather than integrating
performance data from diverse workflow systems. As such,
it serves well for end-to-end tracing of GPU-accelerated
workflow components but is not a native solution for holistic,
workflow performance visualization.

G. Paraprof and PerfExplorer

TAU offers tools such as ParaProf and PerfExplorer for
performance data analysis [63], [97], [98]. ParaProf provides a
graphical interface for viewing performance profiles, including
call-path trees, bar charts, and scatter plots, enabling users
to explore metrics across threads, processes, and functions.
PerfExplorer complements this with advanced data mining,
statistical analysis, and comparative studies across multiple
runs or components. When performance data from multiple
workflow steps or applications is collected into TAUdb, these
tools can aggregate and visualize the metrics together, provid-
ing a semi-holistic view of workflow performance. However,
ParaProf and PerfExplorer are not inherently workflow-aware:
they do not visualize DAG structure or task dependencies, and
effective workflow-level analysis requires that all components
be instrumented and their data imported into TAUdb. These
tools are therefore best suited for analyzing aggregated per-
formance metrics across heterogeneous workflow components,
rather than providing structural workflow visualization.

H. Perfetto

Perfetto is an open-source performance analysis and visu-
alization framework developed by Google for profiling and
tracing complex software systems [99]. It provides a browser-
based interface for interactive timeline visualization of events

collected from CPUs, GPUs, and system processes, enabling
users to explore correlations between tasks, threads, and
I/O activities. Perfetto supports Chrome tracing format and
can aggregate traces from multiple applications or processes,
offering a global system view similar to Nsight Systems.
Although originally designed for Android and Chrome perfor-
mance analysis, it has evolved into a general-purpose tracing
framework applicable to HPC and workflow environments
where distributed tracing is available. However, Perfetto itself
does not provide native workflow-level abstraction (e.g., DAG
or step-level visualization); such integration must be achieved
through custom trace merging or higher-level tooling.

I. Hpcviewer

HPCViewer is the interactive visualization component of
the HPCToolkit performance analysis suite, providing a rich
GUI to explore hierarchical performance measurements and
calling context trees collected from parallel applications [55].
It enables users to navigate through calling contexts, correlate
performance metrics such as time, hardware counters, and
memory usage, and compare different program executions.
Its top-down and bottom-up tree views, along with flame-
graph-like visualizations, make it effective for identifying
performance bottlenecks within a single application. However,
HPCViewer operates primarily at the per-application or per-
binary level; it visualizes data from individual HPCToolkit
measurement databases rather than aggregating across mul-
tiple, heterogeneous workflow components. For workflow-
oriented analysis, it could assist in profiling each step’s perfor-
mance but lacks inherent support for holistic, cross-component
workflow visualization.

J. Boxfish

Boxfish is a domain-aware performance visualization plat-
form developed by Lawrence Livermore National Laboratory
that enables interactive exploration of performance data across
multiple domains, for example, application communication
patterns, hardware network topology, and system link traf-
fic [100]. It supports filtering, querying, and projecting data
from one domain into another allowing users to correlate
communication, compute, and topology in a unified view.
Although Boxfish is powerful for multi-domain correlation and
supports large-scale system visualization, it is not explicitly
designed for workflow-step abstraction; it assumes that the
data pertains to one application’s execution context rather than
automatically integrating disparate workflow components.

K. Vampir

Vampir is a performance visualization tool designed for
postmortem analysis of parallel program traces collected from
large-scale HPC applications [49]. It provides an interac-
tive GUI for exploring time-based behavior, communication,
and synchronization events across thousands of processes or
threads. Using traces generated by performance tools such as
Score-P or TAU (through OTF/OTF2 formats), Vampir en-
ables a detailed examination of computation–communication


overlap, load imbalance, and collective operation timing. Its
timeline and statistical displays allow users to correlate events
across multiple application components or workflow stages,
giving a coarse workflow-level view when traces from distinct
binaries are merged. However, Vampir does not natively repre-
sent workflow dependencies or DAG structures, and full cross-
step workflow visualization requires external trace aggregation
and annotation.

L. Jumpshot

Jumpshot is a postmortem performance visualization tool
originally developed at Argonne National Laboratory for
profiling parallel MPI and threaded applications [101]. It
displays execution as timelines (Gantt charts) and histograms
based on SLOG-2 or CLOG trace formats, enabling users
to zoom into event sequences, detect anomalous durations,
and view communication or computation phases over time.
While Jumpshot excels at visualizing behavior within a single
program’s execution (especially communication and synchro-
nization patterns), it is not inherently designed to aggregate or
visualize performance data across multiple distinct workflow
components or heterogeneous applications without manual
integration.

M. Grafana

Grafana is an open-source visualization and analytics plat-
form designed for real-time monitoring of metrics from di-
verse data sources, including Prometheus [102], InfluxDB,
Elasticsearch, and proprietary telemetry systems [103]. It
provides interactive dashboards, time series plots, and alerting
capabilities that make it well suited for observing system
performance and resource utilization. Grafana’s plugin ar-
chitecture enables integration with HPC monitoring stacks
such as LDMS and Prometheus, allowing visualization across
compute nodes, jobs, and applications. However, its focus is
on metrics aggregation and system-level observability rather
than program- or workflow-level performance analysis. While
it can be configured to visualize workflow-related data through
custom exporters or APIs, it lacks native support for linking
fine-grained execution traces or profiling information across
multiple workflow components.

N. Hatchet

Hatchet is a Python-based framework for analyzing hierar-
chical performance data collected from tools such as HPC-
Toolkit, Caliper, or TAU [104], [105]. It provides a flexible
data model for representing calling context trees (CCTs) or
DAGs of performance metrics and supports powerful query,
aggregation, and comparison operations through a Pandas-like
API. Figure 12 shows the visualization supported by Hatchet.
Hatchet’s strength lies in post hoc analysis and data reduction
across multiple runs or configurations, making it useful for
comparative performance studies. However, it operates at the
level of performance data extracted from individual appli-
cations rather than full workflow systems. Although it can

Fig. 12. Three different visualizations (the call graph, a tree-based call path,
and a flame graph) generated by Hatchet for the same data [104]. These
visualizations are well-suited for a single application or a single component
of a workflow, as supported by Hatchet.

merge data from several program executions, it lacks inher-
ent mechanisms to visualize or analyze performance across
heterogeneous workflow components with distinct execution
semantics, unless the data is manually integrated.

O. Thicket

Thicket is a Python-based framework built on top of Hatchet
to enable scalable analysis of ensembles of performance
profiles collected from multiple runs, configurations, or plat-
forms [106]. It extends Hatchet’s data model to represent col-
lections of CCTs, supporting statistical analysis, correlation,
and machine learning workflows on performance data. Thicket
provides capabilities for grouping, filtering, and visualizing
multi-run trends, making it particularly useful for comparative
performance studies or identifying variability across large
experiment sets. However, similar to Hatchet, Thicket operates
at the application level and is not designed for workflow-level
integration, where data from distinct, heterogeneous workflow
components are automatically linked. Its visualization and
analytics focus on ensembles of program executions rather
than orchestrated workflow structures.

P. Drishti

Drishti [107]–[109] is an I/O performance visualization
tool developed at Lawrence Berkeley National Laboratory
to assist in the analysis of large-scale parallel I/O behavior.
It provides a scalable and interactive environment for visu-
alizing I/O traces collected from HPC applications, helping
users identify access patterns, bottlenecks, and inefficiencies
in parallel file system usage. Drishti focuses on presenting
correlations between application-level events and I/O activity,
making it easier to interpret complex performance data across


multiple layers of the I/O stack. Drishti provides fine-grained
visualization of I/O traces from individual HPC applications,
offering detailed intra-application insight into data movement
and access patterns. While it can be used to compare I/O
behavior across multiple runs, it is not inherently workflow-
aware.

Key Takeaways

Traditional visualization tools are optimized for detailed,
intra-application analysis of tightly coupled parallel pro-
grams, focusing on timelines, communication patterns, or
per-rank metrics. However, complex scientific workflows
introduce new challenges that extend visualization be-
yond individual applications. Workflow-level visualization
must capture both intra-application behavior and inter-
application relationships across heterogeneous compo-
nents, execution environments, and time scales. It must
integrate data from disparate sources, linking system-level
and application-level metrics while preserving contextual
information about data and control dependencies. Con-
sequently, workflow-aware visualization tools prioritize
scalability, data composability, interactivity, and cross-
component correlation to provide holistic insight into
end-to-end workflow behavior rather than just isolated
application performance.

VI. CONCLUSION

Complex scientific workflows, integrated with artificial in-
telligence and machine learning, are becoming central to
modern scientific research. Ensuring that these workflows
utilize expensive HPC resources efficiently requires robust
performance analysis and visualization. While numerous tools
exist in the HPC ecosystem, most were originally designed for
single, bulk-synchronous applications. Consequently, they fall
short of addressing the performance and visualization needs
of complex workflows that involve multiple interdependent
parallel tasks and significant data movement. To support such
workloads effectively, tools must extend their capabilities be-
yond per-task or per-application analysis to provide a holistic
view of workflow execution and dependencies, enabling the
identification of bottlenecks and critical paths.

This survey has reviewed state-of-the-art performance anal-
ysis and visualization tools, outlining their core characteristics
and assessing their suitability for workflow-level performance
evaluation. Our findings indicate that while these tools excel
at fine-grained analysis, few support integrated coarse-grained
visualization across entire workflows. This gap underscores
the need for new tools, or extensions to existing ones, that
explicitly target workflow-aware performance analysis.

Workflow Management Systems (WMS) play a comple-
mentary role by enabling workflow design, orchestration,
and execution. However, workflow characteristics remain con-
strained by the underlying WMS. Some systems provide
limited mechanisms for performance data collection, while
others, such as Pegasus, have made progress toward holistic

workflow visualization. These developments signal both a gap
and an opportunity for deeper integration between WMS and
performance analysis frameworks.

Looking forward, one promising direction is workflow per-
formance prediction, particularly for sub-tasks that lie on the
critical path. Given the high cost and setup time of scientific
workflows, predicting the performance of workflow tasks with
different configurations could significantly reduce experimen-
tation overhead. Large Language Models (LLMs) offer a new
avenue for such predictive modeling. Recent studies [110]–
[113] have proposed and explored the extent to which LLMs
can understand and reason about HPC performance behavior,
reflecting a growing research interest in their potential for
optimization and prediction. Incorporating these capabilities
into workflow analysis pipelines could guide scientists toward
optimal configurations before execution, ultimately accelerat-
ing discovery and improving resource efficiency.

REFERENCES

[1] D. Qian, “High performance computing: a brief review and prospects,”
National Science Review, vol. 3, no. 1, p. 16––23, 2016.

[2] R. Ferreira da Silva, R. Filgueira, I. Pietri, M. Jiang, R. Sakellariou,
and E. Deelman, “A characterization of workflow management
systems for extreme-scale applications,” Future Generation Computer
Systems, vol. 75, pp. 228–238, 2017. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0167739X17302510

[3] T. Ben-Nun, T. Gamblin, D. S. Hollman, H. Krishnan, and C. J. New-
burn, “Workflows are the new applications: Challenges in performance,
portability, and productivity,” in 2020 IEEE/ACM International Work-
shop on Performance, Portability and Productivity in HPC (P3HPC),
2020, pp. 57–69.

[4] P. Krzyzanowski, “Bulk synchronous parallel and pregel,” 2022,
accessed: April 3, 2025. [Online]. Available: https://people.cs.rutgers.
edu/∼pxk/417/notes/pregel.html

[5] R. C. Calinescu, The Bulk-Synchronous Parallel Model. London:
Springer London, 2000, pp. 5–12. [Online]. Available: https:
//doi.org/10.1007/978-1-4471-0763-7\ 2

[6] F. Ferrarotti, S. González, and K.-D. Schewe, “Bsp abstract state
machines capture bulk synchronous parallel computations,” Sci.
Comput. Program., vol. 184, no. C, Oct. 2019. [Online]. Available:
https://doi.org/10.1016/j.scico.2019.102319

[7] C. Navarro, N. Hitschfeld, and L. Mateu, “A survey on parallel
computing and its applications in data-parallel problems using gpu
architectures,” Communications in Computational Physics, vol. 15, pp.
285–329, 09 2013.

[8] I. Karlin, A. Bhatele, J. Keasler, B. Chamberlain, J. Cohen, Z. DeVito,
R. Haque, D. Laney, E. Luke, F. Wang, D. Richards, M. Schulz, and
C. Still, “Exploring traditional and emerging parallel programming
models using a proxy application,” 05 2013, pp. 919–932.

[9] T. Adefemi, “What every computer scientist needs to know about
parallelization,” 2025. [Online]. Available: https://arxiv.org/abs/2504.
03647

[10] X. Zhao, M. Papagelis, A. An, B. X. Chen, J. Liu, and Y. Hu, “Elastic
bulk synchronous parallel model for distributed deep learning,” 2020.
[Online]. Available: https://arxiv.org/abs/2001.01347

[11] E. Deelman, D. Gannon, M. Shields, and I. Taylor, “Workflows
and e-science: An overview of workflow system features and
capabilities,” Future Generation Computer Systems, vol. 25, no. 5, pp.
528–540, 2009. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0167739X08000861

[12] B. Ludäscher, S. Bowers, and T. McPhillips, Scientific Workflows.
Boston, MA: Springer US, 2009, pp. 2507–2511. [Online]. Available:
https://doi.org/10.1007/978-0-387-39940-9\ 1471

[13] S. B. Davidson and J. Freire, “Provenance and scientific workflows:
challenges and opportunities,” in Proceedings of the 2008 ACM
SIGMOD International Conference on Management of Data, ser.
SIGMOD ’08. New York, NY, USA: Association for Computing


Machinery, 2008, p. 1345–1350. [Online]. Available: https://doi.org/
10.1145/1376616.1376772

[14] E. Deelman, T. Peterka, I. Altintas, C. D. Carothers, K. K. van Dam,
K. Moreland, M. Parashar, L. Ramakrishnan, M. Taufer, and J. Vetter,
“The future of scientific workflows,” The International Journal of High
Performance Computing Applications, vol. 32, no. 1, pp. 159–175,
2018. [Online]. Available: https://doi.org/10.1177/1094342017704893

[15] G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and
K. Vahi, “Characterizing and profiling scientific workflows,” Future
Generation Computer Systems, vol. 29, no. 3, pp. 682–692, 2013,
special Section: Recent Developments in High Performance Computing
and Security. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S0167739X12001732

[16] A. Al-Saadi, D. H. Ahn, Y. Babuji, K. Chard, J. Corbett, M. Hategan,
S. Herbein, S. Jha, D. Laney, A. Merzky, T. Munson, M. Salim,
M. Titov, M. Turilli, T. D. Uram, and J. M. Wozniak, “Exaworks:
Workflows for exascale,” in 2021 IEEE Workshop on Workflows in
Support of Large-Scale Science (WORKS), 2021, pp. 50–57.

[17] V. Cuevas-Vicenttı́n, S. Dey, S. Köhler, S. Riddle, and B. Ludäscher,
“Scientific workflows and provenance: Introduction and research
opportunities,” Datenbank Spektrum, vol. 12, no. 4, pp. 193–203,
2012. [Online]. Available: https://link.springer.com/article/10.1007/
s13222-012-0100-z

[18] F. Di Natale, H. Bhatia, T. S. Carpenter, C. Neale, S. Kokkila-
Schumacher, T. Oppelstrup, L. Stanton, X. Zhang, S. Sundram,
T. R. W. Scogland, G. Dharuman, M. P. Surh, Y. Yang, C. Misale,
L. Schneidenbach, C. Costa, C. Kim, B. D’Amora, S. Gnanakaran,
D. V. Nissley, F. Streitz, F. C. Lightstone, P.-T. Bremer, J. N.
Glosli, and H. I. Ingólfsson, “A massively parallel infrastructure for
adaptive multiscale simulations: modeling ras initiation pathway for
cancer,” in Proceedings of the International Conference for High
Performance Computing, Networking, Storage and Analysis, ser. SC
’19. New York, NY, USA: Association for Computing Machinery,
2019. [Online]. Available: https://doi.org/10.1145/3295500.3356197

[19] D. H. Ahn, X. Zhang, J. Mast, S. Herbein, F. Di Natale, D. Kirshner,
S. A. Jacobs, I. Karlin, D. J. Milroy, B. De Supinski, B. Van Essen,
J. Allen, and F. C. Lightstone, “Scalable composition and analysis
techniques for massive scientific workflows,” in 2022 IEEE 18th
International Conference on e-Science (e-Science), 2022, pp. 32–43.

[20] H. Bhatia, T. A. Patki, S. Brink, L. Pottier, T. M. Stitt,
K. Parasyris, D. J. Milroy, D. E. Laney, R. C. Blake, J.-S. Yeom,
P.-T. Bremer, and C. Doutriaux, “Autonomous multiscale library,”
Accessed: 2023-10-15, 2023, version 1.0.0. [Online]. Available:
https://github.com/LLNL/AMS

[21] J. L. Peterson, T. Bender, R. Blake, N.-Y. Chiang, M. G. Fernández-
Godino, B. Garcia, A. Gillette, B. Gunnarson, C. Hansen, J. Hill
et al., “Toward digital design at the exascale: An overview of project
icecap,” Physics of Plasmas, vol. 31, no. 6, 06 2024. [Online].
Available: https://www.osti.gov/biblio/2407209

[22] R. S. Graves, T. H. Jordan, S. Callaghan, E. Deelman, E. Field,
G. Juve, C. Kesselman, P. Maechling, G. Mehta, K. Milner,
D. Okaya, P. Small, and K. Vahi, “Cybershake: A physics-based
seismic hazard model for southern california,” Pure and Applied
Geophysics, vol. 168, no. 3, pp. 367–381, 2010. [Online]. Available:
https://doi.org/10.1007/s00024-010-0161-6

[23] M. Rynge, G. Juve, J. Kinney, J. Good, B. Berriman, A. Merrihew,
and E. Deelman, “Producing an infrared multiwavelength galactic
plane atlas using montage, pegasus, and amazon web services,”
in Astronomical Data Analysis Software and Systems XXII, M. W.
Wertheimer and D. A. Bohlender, Eds., vol. 475. Astronomical Society
of the Pacific, 2013, pp. 345–348. [Online]. Available: https://pegasus.
isi.edu/wordpress/wp-content/papercite-data/pdf/rynge2013adass.pdf

[24] T. A. Desautels, K. T. Arrildt, A. T. Zemla, E. Y. Lau, F. Zhu,
D. Ricci, S. Cronin, S. J. Zost, E. Binshtein, S. M. Scheaffer,
B. Dadonaite, B. K. Petersen, T. B. Engdahl, E. Chen, L. S. Handal,
L. Hall, J. W. Goforth, D. Vashchenko, S. Nguyen, D. R. Weilhammer,
J. K.-Y. Lo, B. Rubinfeld, E. A. Saada, T. Weisenberger, T.-H. Lee,
B. Whitener, J. B. Case, A. Ladd, M. S. Silva, R. M. Haluska,
E. A. Grzesiak, C. G. Earnhart, S. Hopkins, T. W. Bates, L. B.
Thackray, B. W. Segelke, A. M. Lillo, S. Sundaram, J. Bloom,
M. S. Diamond, J. E. Crowe, R. H. Carnahan, and D. M. Faissol,
“Computationally restoring the potency of a clinical antibody against
sars-cov-2 omicron subvariants,” bioRxiv, 2023. [Online]. Available:
https://www.biorxiv.org/content/early/2023/04/24/2022.10.21.513237

[25] T. Coleman, H. Casanova, L. Pottier, M. Kaushik, E. Deelman, and
R. F. da Silva, “Wfcommons: A framework for enabling scientific
workflow research and development,” 2021. [Online]. Available:
https://arxiv.org/abs/2105.14352

[26] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, P. Martin, and
J. I. V. Hemert, “Scientific workflows: Moving across paradigms,”
ACM Comput. Surv., vol. 49, no. 4, Dec. 2016. [Online]. Available:
https://doi.org/10.1145/3012429

[27] M. Atkinson, S. Gesing, J. Montagnat, and I. Taylor, “Scientific
workflows: Past, present and future,” Future Generation Computer
Systems, vol. 75, pp. 216–227, 2017. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0167739X17311202

[28] Unknown, “Workflow Patterns,” http://www.workflowpatterns.com,
2010-2023.

[29] S. Bharathi, A. Chervenak, E. Deelman, G. Mehta, M.-H. Su, and
K. Vahi, “Characterization of scientific workflows,” in 2008 Third
Workshop on Workflows in Support of Large-Scale Science, 2008, pp.
1–10.

[30] P. Krawczuk, G. Papadimitriou, R. Tanaka, T. M. Anh Do, S. Sub-
ramanya, S. Nagarkar, A. Jain, K. Lam, A. Mandal, L. Pottier, and
E. Deelman, “A performance characterization of scientific machine
learning workflows,” in 2021 IEEE Workshop on Workflows in Support
of Large-Scale Science (WORKS), 2021, pp. 58–65.

[31] R. Ferreira da Silva, H. Casanova, K. Chard, I. Altintas, B. Balis,
T. Coleman, F. Coppens, F. Natale, B. Enders, T. Fahringer,
R. Filgueira, G. Fursin, D. Garijo, C. Goble, D. Howell, S. Jha,
D. Katz, D. Laney, and M. Wolf, “A community roadmap for scientific
workflows research and development,” 11 2021, pp. 81–90.

[32] R. F. da Silva, “MAGIC: Machine Learning Guided Scientific
Discovery,” 2023, presentation at NITRD MAGIC, May 3, 2023.
[Online]. Available: https://www.nitrd.gov/nitrdgroups/images/2023/
MAGIC-Rafael-Ferreira-da-Silva-05032023.pdf

[33] R. M. Badia, L. Berti-Equille, R. F. da Silva, and U. Leser,
“Integrating HPC, AI, and Workflows for Scientific Data Analysis
(Dagstuhl Seminar 23352),” Dagstuhl Reports, vol. 13, no. 8, pp.
129–164, 2024. [Online]. Available: https://drops.dagstuhl.de/entities/
document/10.4230/DagRep.13.8.129

[34] S. Jha, V. R. Pascuzzi, and M. Turilli, “Ai-coupled hpc workflows,”
2022. [Online]. Available: https://arxiv.org/abs/2208.11745

[35] F. Suter, T. Coleman, İlkay Altintaş, R. M. Badia, B. Balis, K. Chard,
I. Colonnelli, E. Deelman, P. Di Tommaso, T. Fahringer, C. Goble,
S. Jha, D. S. Katz, J. Köster, U. Leser, K. Mehta, H. Oliver, J.-L.
Peterson, G. Pizzi, L. Pottier, R. Sirvent, E. Suchyta, D. Thain, S. R.
Wilkinson, J. M. Wozniak, and R. Ferreira da Silva, “A terminology
for scientific workflow systems,” Future Generation Computer
Systems, vol. 174, p. 107974, 2026. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0167739X25002699

[36] N. Dube, D. Roweth, P. Faraboschi, and D. Milojicic, “Future of hpc:
The internet of workflows,” IEEE Internet Computing, vol. 25, no. 5,
pp. 26–34, 2021.

[37] A. B. Yoo, M. A. Jette, and M. Grondona, “Slurm: Simple linux utility
for resource management,” in Job Scheduling Strategies for Parallel
Processing, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 44–60.

[38] J. Liu, E. Pacitti, P. Valduriez, and M. Mattoso, “A survey of data-
intensive scientific workflow management,” Journal of Grid Comput-
ing, vol. 13, 03 2015.

[39] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P. J. Maechling,
R. Mayani, W. Chen, R. Ferreira da Silva, M. Livny, and K. Wenger,
“Pegasus, a workflow management system for science automation,”
Future Generation Computer Systems, vol. 46, pp. 17–35, 2015.

[40] D. Gunter, B. Tierney, and et al., “Online workflow management and
performance analysis with the pegasus workflow management system,”
in Proceedings of the 2011 IFIP/IEEE International Symposium on
Integrated Network Management (IM’11), 2011, pp. 988–995.

[41] G. Papadimitriou, C. Wang, K. Vahi, R. Ferreira da Silva, A. Mandal,
Z. Liu, R. Mayani, M. Rynge, M. Kiran, V. Lynch, R. Kettimuthu,
E. Deelman, J. Vetter, and I. Foster, “End-to-end online performance
data capture and analysis for scientific workflows,” Future Generation
Computer Systems, vol. 117, pp. 170–185, 2021.

[42] Pegasus Team, “Pegasus workflow management system documenta-
tion,” https://pegasus.isi.edu/documentation/, 2025, accessed: 2025-01-
22.


[43] P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo,
and C. Notredame, “Nextflow enables reproducible computational
workflows,” Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017.

[44] Nextflow. Reports — nextflow documentation. Accessed: 2025-10-22.
[Online]. Available: https://www.nextflow.io/docs/latest/reports.html

[4