Project 81: Reproducibility of Computational Results 

------------------------------------------------------------------------------------------------------

According to the US National Academy of Sciences (http://nap.edu/25303):

          We define reproducibility to mean computational reproducibility- obtaining consistent
          computational results using the same input data, computational steps, methods, code, and
          conditions of analysis; and replicability to mean obtaining consistent results across studies
          aimed at answering the same scientific question, each of which has obtained its own data.
          In short, reproducibility involves the original data and code; replicability involves new data
          collection and similar methods used by previous studies. A third concept, generalizability,
          refers to the extent that results of a study apply in other contexts or populations that differ
          from the original one.

In this project we consider reproducibility in the context of numerical software. Of concern are:

   * Techniques and tools that enable the sharing of research objects, in our case numerical
      software and related data, which are prerequisites for reproducibility.
   * Cases where it is beneficial to have exactly reproducible results, and the techniques and tools
      needed to enable it.
   * How one defines and achieves reproducibility in cases where the underlying computations
      can be expected to exhibit variation due to floating point characteristics, problem conditioning,
      algorithmic choices, and stochastic aspects of the underlying model.

------------------------------------------------------------------------------------------------------


Shanghai 2013:
==============
 
A scientific result cannot be said to be completely established until
it can be independently verified. Unfortunately, reproduction of
results is rarely done in computational science. Algorithms and
experimental conditions are often not fully described in published
papers. Numerical software, which represents the most precise
documentation of a computational experiment, is often inaccessible to
other researchers. Not only does this make replication much more
difficult, but it results in lost opportunity to build upon existing
work.
 
This topic was the focus of a workshop at the Institute for Computational 
and Experimental Mathematics at Brown University in December 2012; see 
http://www.siam.org/news/news.php?id=2078. 
 
In this project we will monitor developments in this area and consider
how WG 2.5 can contribute to improving the environment for
reproducible research in computational science.

A report "INTEL MATH KERNEL LIBRARY and REPRODUCIBLE COMPUTING"
by Andrew Dienstfry is available as 

https://wg25.taa.univie.ac.at/ifip/intern/ReproducibleComputing.pdf

Vienna 2014:
============

Ron Boisvert briefed the group on a project on replicated computational
results (RCR) being undertaken by the ACM Transactions on Mathematical
Software (TOMS). The effort is being led by TOMS Editor-in-Chief Mike
Heroux, who is also a WG 2.5 member. TOMS is initiating an optional
second level of review for its papers, an RCR review. A single RCR
reviewer will be appointed for such papers, whose identity will be
known by the authors. The RCR reviewer will attempt to replicate the
computational results of the paper, possibly with the assistance of
the authors. For this purpose, "replicate" does not necessarily mean
to bitwise reproduce the computational results of the paper. A more
important criterion is that the computations done by the RCR reviewer
yield the same conclusions as the paper under review. In some cases
papers may be revised as a result of the RCR review to enable or ease
replicability. Papers whose results are successfully replicated will
receive a special designation. The RCR reviewer will be asked to write
a brief account of the process which will be published as a separate
paper in the same issue; it is hoped that the publication of the RCR
review will provide sufficient incentive to enlist such reviewers.
The process is being tried out on a few papers currently under review
in order to vet the proposed procedure. 

Halifax 2015:
=============

Report by Dienstfrey and Boisvert:

Questions of reproducibility continue to be a concern across a wide range of
scientific disciplines. For example, in February 2015 the meeting,
"Statistical challenges in assessing and fostering the reproducibility of
scientific results: A workshop" was convened by the U.S. National Academies
of Sciences to address this issue [1]. Notably, Ron Boisvert served as an
invited panelist representing both NIST and ACM. Also in this last year a
reproducibility policy statement was issued in the journal Science [2] which
echoes similar policies established previously by Nature [3]. Ron Boisvert
presented the ACM position on reproducibility at the 2015 workshop,
"Applied and Computational Mathematics Days" in Halifax.

The ACM Transactions on Mathematical Software (TOMS) has unveiled a new
refereeing process aimed at improving the reproducibility of computational
results published in the journal. The main idea is to provide incentives 
for both authors and reviewers to encourage independent reproduction of 
scientific results. The new process, along with the first example of its
use were published in the June 2015 issue of TOMS, 
http://dl.acm.org/citation.cfm?id=2786970&CFID=526991282&CFTOKEN=90055118.
Under the new scheme, papers which have survived the first round of regular
refereeing may opt for an additional "replicability review." Here, a single
reviewer, who is expected to be a software professional rather than a 
researcher, attempts to verify the main conclusions of the paper, typically
by replicating the computations leading to the main results of the paper.
The reviewer may use software provided by the author in order to accomplish
this task. The identity of the reviewer is known to the author in order to
enable free exchange of questions and answers during the process. In the
end the reviewers write a short paper describing the experience, which is
published along with the paper under review in TOMS.  Papers whose results
are successfully replicated are appropriately branded as "Replicated
Computational Results" in the published PDF and in the page associated with
the paper in the ACM Digital Library.

The process is completely described in this TOMS Editorial:
http://dx.doi.org/10.1145/2743015, the first paper through the process
is here: http://dx.doi.org/10.1145/2764454, and the replicability report
is here: http://dx.doi.org/10.1145/2738033, providing incentives to
authors and reviewers.

Within NIST, the reproducibility concept is expanding from its historical
and limited role as a defined term within measurement science to the broader
understandings as used by several journals and the National Academies.
Of relevance to Working Group 2.5, this expansion includes increased focus
on reproducibility of scientific computation. Contributions of this sort
include investigations in the use and promotion of simulation workflow tools
such as Sumatra [4], and a growing number of projects emphasizing creation and
dissemination of application-specific computational benchmark results.
The project page describing benchmark problem development for density
functional theory computations [5] is an example of the later sort. It should
be emphasized that this attention to the reproducibility of scientific
computation is not strictly an academic exercise. Recent instances in which
erroneous software was the source of costly, potentially misplaced, investment
include a mis-indexed spreadsheet computation by economists [6], and the
retraction of a series of papers on protein crystal structure [7]. The source
of the later was a sign error in an analysis code circulated freely within the
community. These influential papers impacted public policy in first case,
and several research groups across multiple institutions in the later.

No immediate action is required of Working Group 2.5 at this time. This said,
the high-level of interest across the community suggests that the project
continue. In the coming year I draw attention to the workshop "Numerical
Reproducibility at Exascale" [8] which will take place as a satellite meeting
at SC15 to be held November 2015 in Austin, TX. Siegfried Rump is a member
of the Steering Committee.

References:

[1] U. S. National Academy of Sciences, "Statistical Challenges in Assessing
and Fostering the Reproducibility of Scientific Results: A Workshop,"
February 2015. [Online].
Available: http://sites.nationalacademies.org/DEPS/BMSA/DEPS_153236.
[Accessed July 2015].

[2] M. McNutt, "Data, eternal," Science, vol. 347, no. 6217, p. 7, 2015.

[3] Nature, "Challenges in irreproducible research," 2013. [Online].
Available: http://www.nature.com/nature/focus/reproducibility/index.html.
[Accessed July 2015].

[4] NIST, "Simulation management tools," 2015. [Online].
Available: https://mgi.nist.gov/simulation-management-tools.
[Accessed July 2015].

[5] NIST, "Density functional theory (DFT) informatics and repositories," 2015.
[Online].
Available: https://mgi.nist.gov/density-functional-theory-dft-informatics-and-repositories.
[Accessed July 2015].

[6] P. Coy, "FAQ: Reinhard, Rogoff, and the Excel Error that Changed History,"
Bloomberg Business, 18 April 2013. [Online].
Available: http://www.bloomberg.com/bw/articles/2013-04-18/faq-reinhart-rogoff-and-the-excel-error-that-changed-history.
[Accessed July 2015].

[7] G. Miller, "A Scientist's Nightmare: Software Problem Leads to Five
Retractions," Science, vol. 314, no. 5807, pp. 1856--1857, 2006.

[8] NIST Information Technology Laboratory, "Numerical Reproducibility at
Exascale (NRE2015)," November 2015. [Online].
Available: http://www.nist.gov/itl/ssd/is/numreprod2015.cfm.
[Accessed 2015 July].

Oslo 2017:
==========

Report by Dienstfrey and Boisvert:

Questions of reproducibility continue to attract attention across a
broad range of scientific disciplines and their stakeholders. Previous
updates to this Working Group 2.5 project pointed to activities being
initiated within the scientific computing community to address issues
related to numerical reproducibility.

In 2015 the ACM Transactions on Mathematical Software launched the
Replicated Computational Results (RCR) Initiative as an additional 
voluntary step of the review process [1]. Under this initiative, a 
software professional attempts to verify the main conclusions of the 
paper by replicating key aspects of the computation. If successful, 
both paper and replication study are published, and furthermore the 
paper receives an special designation indicating that it was replicated. 
The idea is to provide positive incentives for this activity. In the two
years since the initiative was announced only three published papers 
have completed this process, so additional work may be required to
encourage participation. On a positive note, the example provided by
TOMS has led two other ACM journals to adopt the RCR process: 
Transactions on Modeling and Computer Simulation (TOMACS) and the
Transactions on Graphics (TOG). The former has already published three
papers that underwent RCR review.

Because of these grassroots efforts, and related efforts such as 
Artifact Evaluation for Software Conferences [5], the ACM Publications
Board developed and approved standardized badges that can be awarded
to papers [6, 7]. Two badges are available for papers whose artifacts
(e.g., software, data) undergo a review that goes beyond traditional
paper refereeing (only one is awarded in each case):

    Artifacts Evaluated - Functional
    Artifacts Evaluated - Reusable

A separate badge can be awarded to papers for which such artifacts 
have been made freely (and permanently) accessible to the public:

    Artifacts Available

A third badge is available for papers in which the computational
results have been independently verified:

   Results Replicated
   Results Reproduced

The former is awarded if, as in the TOMS case, artifacts provided by
the auther were used.

ACM is now applying these badges to its published papers. It should
be noted that all TOMS algorithm papers will automatically be 
awarded the "Artifacts Evaluated - Reusable" and "Artifacts Available"
badges based on its  standard review and dissemination policies for
such papers.

Also in 2015, the inaugural workshop "Numerical Reproducibility at
Exascale: NRE2015" was held as a satellite to the annual
supercomputing conference SC'15 [2]. Many of the talks in this
workshop focused on reproducibility of numerical linear algebra
arising from the combination of facts that: 1., floating point
arithmetic is non-associative, and 2., the execution order of
operations is not generally controlled for in high-end (i.e.,
parallel) computational systems. The success of this opening workshop
lead to a follow-up workshop of the same name held as a sattelite to
SC'16 [3]. The third in the series will be held as part of SC'17 to
take place Nov 12, 2017 in Denver [4]. A select group of papers from
this latest iteration will be published in a special issue of the
International Journal of High-Performance Computing and Applications.

Members of Working Group 2.5 are highly visible in each of the
activities referred to above. Michael Heroux (former TOMS Editor-in-Chief) 
created the RCR process for TOMS, and sheparded its first few papers
through the process. Even though his term as EiC ended in the spring,
he has been asked to stay on as a Senior Associate Editor to continue
to manage the RCR process. Ron Boisvert provided advice and guidance
on the TOMS initiative, and then led the development of badges for the
ACM Publications Board. Sigfried Rump has participated on
the steering committee of all three NRE workshops, and Michael Heroux
has participated on the steering committee since 2016. Andrew
Dienstfrey has consulted with members of a NIST working group that was
established to research this issue from a measurement and standards
(i.e. NIST) perspective.


[1] M. A. Heroux, "Editorial: ACM TOMS Replicated Computational
Results Initiative", ACM Transactions on Mathematical Software,
vol. 41, June, 2015.

[2] W. Keyrouz and M. Mascagni, "Numerical Reproducibility at Exascale
(NRE2015)," November 2015. [Online].  Available:
http://www.nist.gov/itl/ssd/is/numreprod2015.cfm.  [Accessed 2017
Oct].

[3] M. Mascagni and W. Keyrouz, "Numerical Reproducibility at
Exascale: NRE2016," November 2016. [Online].  Available:
http://www.cs.fsu.edu/~nre/ .  [Accessed 2017 Oct].

[4] M. Mascagni, W. Keyrouz, and M. Lesser, "Computational
Reproducibility at Exascale (CRE2017)," November 2017. [Online].
Available:
https://sc17.supercomputing.org/presentation/?id=wksp144&sess=sess132
[Accessed 2017 Oct].

[5] Artifact Evaluation for Software Conferences. Available:
http://www.artifact-eval.org/ [Accessed 2017 Oct].

[6] R. Boisvert, Incentivizing Reproducibility, Communications of
the ACM, vol. 59, issue 10, page 5. Available: https://doi.org/10.1145/2994031.

[7] Association for Computing Machinery, Artifact Review and Badging.
Available: http://www.acm.org/publications/policies/artifact-review-badging
[Accessed 2017 Oct]

Sydney, 2018:
=============

Report by Boisvert

Two WG2.5 members, Ron Boisvert and Mike Heroux, have been working with the 
Association for Computing Machinery (ACM) to develop policies and procedures
within the ACM publications program which enhance the reproducibility of results
published within the ACM Digital Library. As Chair of the ACM Publications Board's
Digital Library Committee, Boisvert helped to launch an ACM Task Force on Software,
Data and Reproducibilty. Boisvert and Heroux are members of this Task Force. The
Task Force held three workshops from 2015-2017 to share information, set goals and
work on joint projects. The main work products of the group are (a) a document 
describing best practices for reproducibility in computing research, and (b) a system 
of badges to be applied to papers to incentivize and reward papers taking positive
steps toward reproducibility. 

Best Practives Guide ...

The best practices guide is still under development. The document will suggest best 
practices for five different constituencies: Authors, Referees, Editors, Publishers, 
Vendors. In each case, a sequence of recommendations is given, with levels indicated 
as "Good", "Better" and "Best."  The main ideas in the document were developed at a 
workshop held in New York in December 2017.

ACM Badges ...

Ron Boisvert led the development of the badging system within the Task Force. The 
details of the badging system have been described here previously.The Task Force's 
recommendations were adopted by the ACM Publications Board in 2016, and are now ACM 
policy [1,2].  There has been swift adoption of the badging system among ACM groups:

  * All TOMS Algorithm papers have been assigned "Artifacts Evaluated - Reusable"
    and "Artifacts Available"nd will be routinely assigned them in the future based
    on the TOMS policy for refereeing such papers.
  * The Replicated Computational Results process, pioneered by Mike Heroux as TOMS
    EiC is now also being followed by ACM Transactions on Modeling and Computer
    Simulation (TOMACS) and the ACM Transactions on Graphics (TOG). All papers
    successfully completing that process will be assigned the "Results Replicated"
    badge.
  * Many ACM conferences have now adopted an artifact evaluation process, with
    successful papers being assigned one of the "Artifacts Evaluated" badges. These
    include the following ACM conferences:
       SC17 and SC18 (Supercomputing)
       MultiMedia Systems 17 (MMSys17)
       Principles of Programming Languages (POPL) 2018
       SIGSIM Principle of Advanced Discrete Simulation (PADS) 2018
       Programming Languages Design and Implementation (PLDI) 2018
       International Conference on Performance Engineering (ICPE) 2018
       International Conference on Software Language Engineering (SLE) 2018
       Symposium on Principles and Practice of Parallel Programming (PPoPP 2018)
       International Conference on Functional Programming (ICFP) 2018
       SIGMOD (Management of Data) 2018

[1] https://www.acm.org/publications/policies/artifact-review-badging
[2] https://cacm.acm.org/magazines/2016/10/207757-incentivizing-reproducibility/fulltext

Valencia 2019:
==============

Report by Dienstfrey and Boisvert

SUMMARY

Reproducibilty of scientific results continues to gain a great deal of attention by both researchers and the press. There remains a great deal of confusion about what reproducibility means and its ultimate goals and benefits. Among the observations that were expressed during the WG 2.5 meeting in Valencia are the following. 

1. Bitwise reproducibility of scientific results can be required in specific cases. For example, it can be useful in software testing. Additionally, there are claims that bitwise reproducibility may be required by policy makers in high-consequence contexts. For example, Jennifer Scott (not in attendance at the Valencia meeting) in her talk at the Working Group meeting in Vienna indicated that stakeholders in the United Kingdom require bitwise reproducibility for climate simulations. However, full details on such requirements are hard to find. Arranging for bitwise reproducibility is very computationally demanding, and hence it should only be considered a goal in very specialized circumstances. Peter Tang noted that the Intel Math Kernel Library has recently provided an optional feature to obtain bitwise reproducibility under certain restrictive conditions. Generally speaking, however, we expect that most results in scientific computing do not require such stringency to be considered reproducible.

2. In much of the scientific computing community, a result is termed reproducible if the main computations involved can be successfully rerun by an independent party using data and software from the original study. Here one judges a result reproducible if the results are the same to within an expected tolerance. A resonable tolerance might be dictated by the inherent conditioning of the problem. Estimation of reasonable tolerances for a computation is a facet of a larger field of study referred to as uncertainty quantification for scientific computing. An ICIAM minisymposium (see below) was organized by Working Group members to explore the relationship between these two concepts. Of course, such reproduction does not prove that the results are correct, and some argue that by giving the naive an impression of such may actually be harmful. Proponents argue that even this weak form of reproduction does have value in demonstrating that the computations involved are robust under some change in experimental conditions. There may also be value in having an incidental independent inspection of the code and data. 

3. A narrow focus on reproducibility may be obsuring the ultimate goal, which should be to obtain a higher level of confidence in the results of a scientific study. There are many incremental means of doing this, from deeper inspection of the experiment, to establishing robustness to changes in experimental conditions that one expects should have no effect, to a completely independent study aimed at answering the same research question. (While all of these have value, the latter, which is known as "replication," may not actually be warranted for all but the most controversial of results.) 

4. It is also the case that a failure to replicate a result is not a symptom of the failure of science, except, of course in cases of scientific fraud, which are thankfully rare. A failure to replicate may indicate that an variable that was not previously recognized as important was not adequately controlled during an experiment. Subsequent studies can learn from this and correct for it. This is one way in which science advances. 

5. A related goal is that of "open science," i.e., to facilitate the advancement of science through sharing and reuse of research artifacts, like software and data. 

RECENT RELEVANT ACTIVITIES OF WG PARTICIPANTS

Regarding items 3-5 above, Boisvert was a co-author on a paper "How measurement science can improve confidence in research results," which was published in PLOS Biology. See https://doi.org/10.1371/journal.pbio.2004299 

Michael Mascagni, who participated in the WG 2.5 meeting in Valencia, has been co-organizer of a series of workshops on the topic of "Computational Reproducibility at the Exascale," which have been held since 2015 at the yearly Supercomputing conference. Mike Heroux of WG 2.5 has been a frequent participant. Information for the 2019 edition is here: http://www.cs.fsu.edu/~cre/cre-2019/index.html

Boisvert briefed WG 2.5 an initative within ACM which he led to issue badges to papers achieving goals of reproducibility and open science. The slides from his talk can be found here: https://wg25.taa.univie.ac.at/ifip/intern/P81-Briefing-WG25-2019.pdf


Dienstfrey and Boisvert organized a minisymposium at the International Congress for Industrial and Applied Mathematics (ICIAM), with which the WG 2.5 meeting was co-located in Valencia. Eight talks were presented by speakers from five countries in two 2-hour sessions. Some 30 persons attended the minisymposium (which was impressive given the fact that there were many dozens of competing parallel sessions). A description of the minisymposium follows.

----------------------------------------------------------------------
Title: Uncertainty Quantification and Reproducibility 

Organizer: Andrew Dienstfrey, NIST, Boulder, CO, USA 
Co-Organizer: Ronald Boisvert, NIST, Gaithersburg, MD, USA 

Abstract: 
Reproducibility of scientific results has been called into question recently. Although most attention has focused on biomedicine and psychology, such questions have led to a great deal of self-reflection in the computational science community as well. At the same time, there has been a surge of interest in uncertainty quantification in scientific computing as a process to render computational results actionable for decision makers. In this minisymposium we will explore these two concepts in relationship to each other, and their respective roles in establishing credibility of computational results. [This minisymposium is sponsored by IFIP Working Group 2.5, https://wg25.taa.univie.ac.at/.] 

Speakers: 

Maurice Cox, Data Science Department, National Physical Laboratory (United Kingdom). 
The Influence of Methods of Uncertainty Quantification on the Comparability of Research Results 

Siegfried Rump, Institute for Reliable Computing, Hamburg University of Technology (Germany). 
Reproducibility - Algorithms, Pros and Cons 

Tristan Glatard, Concordia University (Canada). 
Numerical Stability of Neuroimaging Analysis Pipelines 

Christopher Drummond, National Research Council (Canada). 
Reproducible Research and the Illusion of the Scientific Method 

Michael Mascagni, Florida State University (USA) 
Stochastic Modeling of Numerical Reproducibility 

Eric Petit, Intel Corp. (France) 
Verificarlo: Floating-point Computing Validation on New Architecture and Large-scale Systems 

Jason Riedy, Georgia Institute of Technology (USA) 
Reproducible Linear Algebra from Application to Architecture 

Sarah Michalak, Los Alamos National Laboratory (USA). 
Using Probabilistic Hardware to Increase Energy Efficiency of Computation 

2020 (online meeting):
======================

Report by Boisvert

Here we provide an update on reproducibility in scientific computing. We highlight a new report by the US National Academies, present the status of reproducibility badging efforts as implemented by ACM and other research journals, introduce a new international standards effort recently initiated in this field, and provide a brief summary of the Computational Reproducibility at Exascale Workshop held in conjunction with the annual SC19 supercomputing conference held in Denver.

REPORT BY THE US NATIONAL ACADEMIES

The US National Academies released a report in 2019 entitled "Reproducibility and Replicability in Science." The panel which produced the report represented a wide variety of scientific disciplines, including computational scientists. It provides a well-balanced view of the subject recognizing the role that reproducibility and associated ideas play in establishing trust in science. Conversely, lack of reproducibility may inspire new theory development. In response to growing concerns over lack of reproducibility as expressed in both scientific and popular media, the United States Congress requested that the National Academies of Science, Engineering, and Medicine conduct this study. We draw attention to the definition of terms established by this report as these definitions were discussed in previous WG meetings. Many other topics are discussed in the full report which can be obtained here:
https://www.nationalacademies.org/our-work/reproducibility-and-replicability-in-science.

One of the problems with this field is that the terms "repeatability," "reproducibility," and "replicability" have sometimes been used interchangeably, or sometimes used with specific meanings in one community that conflict with how they are used by another community. The National Academies report referenced above has tried to put that to rest, offering these definitions:

   Reproducibility
obtaining consistent results using the same input data; computational steps, methods, and code; and conditions of analysis, i.e. computational reproducibility.

   Replicability
obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. Two studies may be considered to have replicated if they obtain consistent results given the level of uncertainty inherent in the system under study.

ACM had used these two terms with their meanings interchanged. They are now in the process of reversing the usage of these terms, which includes use in their reproducibility badges (see below).


BADGES

Ron Boisvert briefed the Working Group on the status of reproducibility badging. A summary is provided below.

The practice of assigning “badges” to published papers to incentivize good reproducibility practices has been gaining traction. 

Probably the most widely adopted examples are the Center for Open Science's (COS) Open Science Badges and the Association for Computing Machinery (ACM) badges. COS badges, “Open data,” “Open materials, and “Preregistered” are described here: https://www.cos.io/our-services/badges. They have been adopted by some 80 journals across a wide range of publishers primarily in the field of psychology and neuroscience.

The ACM badges are now being assigned by dozens of conferences and a handful of journals published by ACM. The badges, along with the number of papers that have them assigned are as follows:
 
"Artifacts Available" - 1419 
"Artifacts Evaluated - Functional" - 676
"Artifacts Evaluated - Reusable" - 860
"Results Reproduced" - 226
"Results Replicated" - 0

Note that there are some 600,000 full-text articles in the ACM collection, although the badges have only been assigned in the last three years.  A description of the badges can be found here: https://www.acm.org/publications/policies/artifact-review-badging

A variety of other publishers have pilot projects with small numbers of their journals to assign badges. Examples include:

IEEE ("Code Available," "Code Reviewed," "Datasets Available," "Datasets Reviewed")
Springer Nature ("Open Data")
Elsevier, Wiley (Open Science Badges)

STANDARDIZATION

Due to the inconsistencies in terminology and badge usage among publishers, and the National Information Standards Organization (NISO) created a Working Group on Reproducibility Badging and Definitions in the spring of 2019. NISO is an ANSI accredited standards development organization which also represents ANSI to the ISO Technical Committee 46 on Information and Documentation. The Working Group was charged with developing a Recommended Practice document to promote "common recognition practices, vocabulary, and iconography to facilitate sharing of data and methods" for reproducibility badging.

Co-chairs of the Working Group are Gerry Grenier (IEEE), Wayne Graves (ACM), Lorena Barba (GWU), and Mike Heroux (Sandia National Labs, and IFIP WG 2.5 member). In addition, there was also representation from the Center for Open Science, Elsevier, figshare, the International Association of STM Publishers, Springer Nature, and Wiley. There were additional members from NASA JPL, NIST (Ron Boisvert, an IFIP WG 2.5 member), Rutgers, the University of Illinois at Urbana Champaign, and the University of Edinburgh.  The broad participation provides hope that badges will be normalized among publishers in the future.

The NISO Working Group issued a draft for public comment in the spring of 2020, and hopes to issue its final document in the fall. The recommended badge names and definitions are as follows.

Open Research Objects (ORO)
author-created digital objects used in the research (including data and code) are permanently archived in a public repository that assigns a global identifier, and guarantees persistence

Research Objects Reviewed (ROR)
signals that all relevant author-created digital objects used in the research (including data and code) were reviewed according to the criteria provided by the badge issuer. The badge metadata should link to the award criteria. 

Results Reproduced (ROR-R)
an additional step was taken or facilitated by the badge issuer (e.g., publisher, trusted third-party certifier) to certify that an independent party has regenerated computational results before publication, using the author-created research objects, methods, code, and conditions of analysis

Results Replicated (RER)
an independent study, aimed at answering the same scientific question, has obtained consistent results leading to the same findings (potentially using new artifacts or methods). The badge links to the persistent identifier for that secondary publication. This badge is awarded by the publisher of the original work that is being badged.

Note that the badges are similar in structure to the ACM badges, though different, and more common, terminology is used. An interesting aspect is that the Results Reproduced badge is a special case of the Research Objects Reviewed badge, reflecting the fact that reproduction is a form of object review. The group has also said that other "decorations" could be applied to the ROR badge to signify differing levels of review (such as the "functional" and "reusable" definitions used by ACM.

The NISO document also provides recommendations on badge placement and discovery, badge metadata, badge validation, and badge revocation.

The NISO group decided not to provide recommendations on badge design, though that may be the subject of follow-on efforts. They also do not specify detailed criteria for object review, since norms in different communities may vary. It does insist that badges be "active," providing links to a detailed description of the criteria used by the review team in the assignment of each instance of the badge.

COMPUTATIONAL REPRODUCIBILITY AT EXASCALE

In calendar year 2015, Michael Mascagni (Florida State University and NIST) and Walid Keyrouz (NIST) received $123,000 to begin working on numerical reproducibility at NIST.  Experimental reproducibility is a cornerstone of the scientific method. As computing has grown into a powerful tool for scientific inquiry, computational reproducibility has been one of the core assumptions underlying scientific computing. With “traditional” single-core CPUs, documenting a numerical result was relatively straightforward. However, hardware developments over the past several decades have made it almost impossible to ensure computational reproducibility or to even fully document a computation without incurring a severe loss of performance. This loss of reproducibility started when systems combined parallelism (e.g., clusters) with non-determinism (e.g., single-core CPUs with out-of-order execution). It has accelerated with recent architectural trends towards platforms with increasingly large numbers of processing elements, namely multicore CPUs and compute accelerators (GPUs, Intel Xeon Phi, FPGAs).

This was motivation that led them to start a series of workshops at the annual high-performance computing (HPC) conferences.  The previous incarnations of these workshop included the first two presentations of Numerical Reproducibility at Exascale Workshops (conducted in 2015 and 2016 at SC) and the Panel on Reproducibility held at SC’16 (originally a BOF at SC’15) to address several different issues in reproducibility that arise when computing at exascale. Starting at SC17 they changed the name to Computational Reproducibility at Exascale (CRE) to accommodate some of the broader aspects of reproducibility, and presented CRE2017, CRE2018, and CRE2019 at the SC conferences.  In 2019, they resurrected the NRE series by offering NRE2019 at the European HPC conferences, ISC-HPC2019.

During this period of time, the NIST researchers published the following papers related to numerical reproducibility:

W. Blanco, P. H, Lopes, A. A. de S. Souza and M. Mascagni (2020), "Non-replicability circumstances in a neural network model with Hodgkin-Huxley-type neurons," Journal of Computational Neuroscience, 48(3): 1-7.

M. Mascagni (2019), "Three Numerical Reproducibility Issues That Can Be Explained as Round-Off Error," ISC-HPC Workshops 2019, Lecture Notes in Computer Science, 11887: 452-462.


2022 (online meeting):
======================

Report by Boisvert and Dienstfrey

THE STATE OF REPRODUCIBILITY

As documented previously, in 2019 the US National Academies produced report entitled "Reproducibility and Replicability in Science." (https://www.nationalacademies.org/our-work/reproducibility-and-replicability-in-science) This report highlighted the growing concerns about reproduciblity and replicability expressed by a wide range of stakeholders including computational scientists, policy makers, and society at large. To recall the context, reproducibility is a fundamental tenet of the scientific process. Thus, a lack of reproducibility potentially undermines trust and confidence in claims and recommendations generated by this process. As this report documents, possible sources of non-reproducibility are diverse and common suggestions include: data irregularities, measurement noise, incomplete models, and choices in stastistical analysis. However, this report strives to draw attention to "scientific computing" as another possible contributor. More specifically, while the public at large may expect that the output of a deterministic scientific computation is both extremely precise and well-defined up to the reported precision, numerical analysts understand that this is not always the case. For one, the non-associativity of floating point arithmetic implies that the order of a summation can impact its value. This, coupled to the release of operation scheduling to compilers designed to optimize speed in parallel computational environments entails that even something so elementary as computing the sum of a list of numbers can result in very different results. Even more, most scientific computations involve more than linear algebra and thereby require calls to elementary functions and special function libraries, each with their own (many times implicit) specification of accuracy. These small variations, generically referred to as "round-off error", can be furthermore compounded through iterative processes exhibiting potentially chaotic dynamics. The result is hard-to-predict differences in numerical outputs of what, otherwise, may have been considered deterministic simulations. While differences in estimated quantities are expected in traditional measurement contexts, their appearance in computation can be surprising to the unitiated. This report served as a first concerted effort by the US National Academies to define terms used to describe this phenomenon, and to propose mechanisms to address this reproducibility in the context of scientific computing.

As per Google Scholar, as of October 2022 the US National Academies report has been cited 475 times since its publication in late 2019. Many of these citations come from psychology and biomedical sciences for which data management and choices of statistical analysis can lead to wildly divergent scientific claims from the same experimental dataset. This is a very understandable concern given the significant financial resources and immediacy of healthcare decision-making. In fact, it should be noted that a significant paper by Ionnidis entitled "Why most published research findings are false" appearing in PLOS Medicine in 2005 served as a primary driver for the National Academies report. However, there is an increasing focus on computational considerations which are relevant to the Working Group. In some cases this is being met by calls for open-source code development. The idea here is that sharing of source code will surface algorithmic choices that would otherwise be implicit and hidden from inspection by outside users. Note, this "solution" does not quantify the variability of scientific computation so much as it attempts to hide this variability from the field of view by standardizing some of the components that generate it. Increasing calls for data sharing have a similar motivation although in this case the concept of a benchmark dataset provides, to varying degrees, an idea that there is a solution to a given problem that can be used as a numerical reference against which other solutions can be quantifiably compared. Algorithms for reproducible linear algebra continue to be advanced which accomodate non-associative aspects of floating point addition through algorithmic or hardware extensions. Finally, interval arithmetic may be viewed as a rigorous hedge against computational "noise" in the sense that tools developed by this research community may be understood as providing theorems in the form of a range in which the solution to a problem must lie. Any computational result that lies outside this interval may immediately be considered problematic and should be set aside for further analysis. To the degree to which Working Group members are active in creating reference results for numerical computation and tools of this nature, the topic of reproducibility remains relevant.

As a relatively new development since the last project update in 2020, we draw attention to the topic of reproducibilty in machine learning. As with the research areas covered by the Academies report, the sources of non-reproducibilty in machine learning are numerous and most discussion focuses on creation and management of the large data repositories that serve as fuel for training large neural networks. However, there is a growing recognition of the tension between two facts: 1. machine learning algorithms are being deployed in extremely consequential decision-making contexts, 2. these models have failure modes that are hard to predict and suprising. Driven by these facts, the reproducibilty discussion is making rapid advances into this field which, in turns, is generating calls for open-source code development and data sharing. Notions of reference models and computational results are also being considered.

TERMINOLOGY AND BADGING

In January 2021 the National Information Standards Organization (NISO) issed a recommended practice entitled "Reproducibility Badging and Definitions" (NISO RP-31-2021). The report recommends badge types, and associated terminology, that should be used when awarding research papers for certain reproducibility practices. See https://www.niso.org/publications/rp-31-2021-badging. The working group for this report included major publishers in computing (ACM, IEEE, Elsevier, Springer Nature, and figshare), as well as representatives from NIST, Sandia Labs, NASA JPL, George Washington University, Boston University, the University of Illinois, and Rutgers University. Ron Boisvert and Mike Heroux of WG 2.5 participated.

The badges and terms recommended are:

   * Open Research Objects (ORO)
   * Research Objects Reviewed (ROR)
   * Results Reproduced (ROR-R)
   * Results Replicated (RER)

See the 2020 project 81 report for a fuller explanation of these badges.

A new NISO standing committee on Taxonomy, Definitions, and Recognition Badging Schemes has been instantiated as a follow-on. See https://www.niso.org/standards-committees/reproducibility-badging. Boisvert and Heroux are members of this committee. They will work on propagating the recommendation, answering queries, and may consider graphical designs for badges.

REPRODUCIBILITY BADGING AT ACM

The Association for Computing Machinery (ACM) was one of the first publishers to develop a comprehensive scheme for badging. While ACM terminology for their badges is different, the concepts match the NISO scheme closely. One issue with ACM's original badges is that their use of the terms "reproduce" and "replicate" were reversed in comparison with the National Academies report (which came later). NISO convinced ACM that they should be consistent with the National Academies, and with NISO, which was in the process of adopting the Natioanl Academies terminology. ACM has since switched the labels on their badges.

Reproducibility badges have seen a good deal of adoption within ACM.  Below shows the counts of the various ACM badges in its Digital Library in August 2020 and August 2022

                                     2000   2022  % increase
   -----------------------------------------------------------
   Artifacts Available               1419   2722     92%
   Artifacts Evaluated - Functional   676   1424    110%
   Artifacts Evaluated - Reusable     860   1393     62%
   Results Reproduced                 226    713    215%
   Results Replicated                   0     36
   ------------------------------------------------------------

Note that all algorithm papers published in TOMS receive the "Artifacts Evaluated -- Reusable" designation based on the high level of scrutiny in the TOMS review process.

Mike Heroux continues to serve as the TOMS Reproduced Computational Results Editor who manages the review of regular research papers who opt to have the reproducibility of their results evaluated.

Amsterdam 2023:
===============

Report by Boisvert

Two recent developments were noted:

1. The SIAM Journal on Scientific Computing (SISC) has begun the practice of awarding badges
   to papers for which code and data have been made available. For details, see
   https://epubs.siam.org/journal/sisc/instructions-for-authors

2. ACM has established an Emerging Interest Group (EIG) on Reproducibility and Replicability.
   EIGs are precursors to the establishment of ACM Special Interest Groups (SIGs). The new EIG
   is sponsoring its first conference in June 2023. See https://reproducibility.acm.org/.