Reproducible Analytical Pipelines (RAP) that meet the minimum criteria

Version control software

This criterion has been met.

Git is available on Office for National Statistics (ONS) computers and the Data Access Platform (DAP), and is specified as a requirement for the Integrated Data Platform (IDP).

Some legacy systems do not permit the use of Git but are in scope for legacy transformation.

Open-source programming languages and flexibility to add more

For example: Python, R, Julia, JavaScript, C++, Java or Scala

This criterion has been met.

Python and R are available on ONS computers, in DAP, and specified as a requirement for IDP. However, users do not always know who to contact for technical support.

Package and environment managers for each of the available languages

This criterion has been met.

Python and R both have toolchains for managing environments and packages (for example, pip and renv) in all ONS environments.

Packages and libraries for open source programming languages

For example: either through direct access to well-known libraries (npm, PyPI, CRAN) or through a proxy repository system (Artifactory).

This criterion has been met.

Python and R users can install packages and libraries through the Artifactory proxy repository system on desktops and platforms.

Individual storage

For example: home directory

This criterion has been met.

Available on computers and platforms.

Shared storage with fine-grained access control, accessible programmatically

For example: s3, cloud storage

This criterion has been met.

Shared drives are available on local computers and platforms. These have secure access control.

Integrated development environments suitable for the available languages

For example: RStudio for R, Visual Studio Code for Python

This criterion has been partially met.

On ONS computers, RStudio, Visual Studio Code and Jupyter notebooks are available. IDEs should be ready to use out of the box for even beginner users. We expect our cloud transformation programme will ensure all suitable IDEs are available on all platforms by the end of 2024.

For further development

Source control platforms

For example: GitHub, GitLab or BitBucket

This criterion has been met.

GitLab is available internally on the local network and DAP platform. GitHub is available for ONS computers and other cloud environments. GitHub has been specified for version control of Integrated Data Service (IDS) analysis.

Continuous integration tools

For example: GitHub Actions, GitLab CI, Jenkins, Concourse

This criterion has been partially met.

Internal GitLab instances do not have "runners" connected for continuous integration (CI). Jenkins is available on the ONS local network, but the integration with GitLab and GitHub is limited. When GitHub can be used, GitHub Actions can be used.

Make-like tools for reproducible workflows

For example: make

This criterion has been partially met.

Make-like tools are not available on ONS Windows machines, except dedicated tools in Python or R such as snakemake and targets. Make is available in DAP.

Relational database management software that is available to users

For example: PostgreSQL

This criterion has been partially met.

Relational databases are available on ONS computers and platforms through requests to the central IT team, but most data are stored on networked drives instead. Relational databases are not well supported for individual projects in DAP, though the Reference Data Management Framework (RDMF) does use them.

Orchestration systems for pipelines and workflows

For example: airflow, NiFi

This criterion has been partially met.

No orchestration systems are available to users on ONS computers, despite many analysis pipelines being run locally. Cloudera Data Science Workbench (CDSW) Jobs is available in DAP, but this offers very limited functionality compared with dedicated orchestration tools. Airflow should be available in the updated version of the DAP platform.

Internal-facing servers to host HTML-rendered documentation

This criterion has been partially met.

There is no dedicated location for hosting internal documentation that is generated by standard code package documentation tools (for example, sphinx and pkgdown). Documentation can be manually added to Confluence or SharePoint. GitHub Pages can be used to host documentation for open-source code.

External-facing servers with authentication to host end-products such as web applications or application programming interfaces (APIs)

This criterion has been met.

External-facing servers can be deployed by the central IT team when there is a clear business need. A beta API exists for data published by the ONS.

Big data tool or access to large memory capability

For example: Presto Athena, Spark, Dask

This criterion has been partially met.

Spark is available on DAP and can be developed on local machines (without distributed execution). IDS and the updated version of DAP will include big data tools.

Reproducible infrastructure and containers

For example: docker

This criterion has not been met.

Reproducible infrastructure is not available on ONS computers or DAP. It could be made available on a cloud platform and is specified for the IDP. It should be available in IDS.