1. Introduction
Our Data Standards define how we carry out our statistical and business activities. They ensure that the Office for National Statistics (ONS) adopts a consistent, rule-based approach to data management and that our activities comply with our Data Policies.
Our current landscape of Data Standards includes the following categories:
- Classification Standards
- Data File Format Standards
- Data Format Standards
- Data Inflow Standards
- Data Management Standards
- Data Organisation Standards
- Data Provider Standards
- Data Sharing Standards
- Geospatial Data Standards
- Governance Standards
- Linking and Matching Standards
- Metadata Standards
- Standardised Variable Standards = Statistical Unit Definition Standards
This page contains 10 examples of Data Standards that have been shortened and adapted for the website.
All of these standards are aimed at those working on data-related initiatives and projects or who have an interest in the acquisition, management, provision and subsequent dissemination of data.
For further information or to request a copy of a standard, please email data.architecture@ons.gov.uk.
Back to table of contents2. Currency
The ONS collects and stores information about money in many places and in various forms, such as cost, price and amount. In addition, information about the associated currency is an important element to determine its true value. The ONS largely deals with British pounds and the Euro, but data exchanged with a wide range of third parties may include many other currencies. This standard is primarily developed for statistical data in the ONS, but it can also be used while exchanging data with third parties.
The ONS uses ISO 4217 as a standard classification for currency. It is the most widely used standard for currencies and facilitates working with third parties. It provides a comprehensive list of internationally recognised currency names including three-letter alphabetic code and three-digit numeric code representations. Use of the three-letter alphabetic currency codes are recommended over other formats as they are the most widely recognised short forms. They also provide unique representations of currencies and are well suited for statistical purposes.
Back to table of contents3. Money
The ONS collects and stores information about money in many places and in various forms, such as cost, price and amount. A common standard for recording this information is important for maintaining data consistency and quality. This standard is primarily developed for statistical data in the ONS. This can be used while exchanging data with third parties and is also suitable for use in other domains.
We have decided to use a numeric format, allowing negative numbers with a minimum of two decimal places. A separate variable should be used to capture the currency code unless the value is in British pounds. A separate but optional variable may also be used to indicate unit of quantity such as millions.
Back to table of contents4. Dataset storage
The Hadoop data processing platform allows data to be stored in a variety of different storage formats including plain text, images and binary formats. Common file formats used for processing data on Hadoop include Avro and Parquet, the choice of which depends on the processing that will be carried out.
For scanning or retrieving all columns in a row in each query, or if a dataset has only a few columns, Avro is usually the best choice.
- Avro should be used for raw data ingested into the Data Access Platform (DAP), which is the ONS’s Hadoop-based platform.
- Avro is a “self-describing” storage format, meaning that it embeds the data and the schema (structural metadata) in the same file; this allows users to explore the data without any dependency on schema information defined elsewhere. It is a compact, row-based format, ideal for validation and standardisation processing as part of data preparation. A detailed specification of this standard is available in the Apache Avro Documentation.
For a dataset with many columns where a subset of those columns is required, Parquet is usually the best choice.
- Parquet should be used for storing standardised data on the DAP. This data format is better suited than Avro for analysis and statistical production.
- Parquet is a “self-describing” storage format, as per Avro. It is a compact, column-based format, ideal for analytical queries, which typically use only a few columns in a wide dataset. A detailed specification of this standard is available in the Apache Parquet Documentation.
5. Date and time
Date and time information is part of most ONS datasets. A common standard for recording this information is important for maintaining data consistency and quality. A standard format for date and time has been defined in this document to address this need.
This standard is primarily developed for statistical data in the ONS. This can be used while exchanging data with third parties and is also suitable for adopting in other domains.
The Government Digital Service recommends the use of ISO 8601 standard for representing date and time. Fundamentally, there are two choices:
- a basic format, with no separation between different components of date and time (for example, YYYYMMDD or HHMMSS)
- an extended format that includes separators (for example, YYYY-MM-DD or HH:MM:SS) This data standard mandates the adoption of the second option – the extended format – for improved clarity.
6. Flag
Binary indicators, such as flags, are the most widely used variables for recording simple responses to questions that have two possible answers. There are many ways to record the responses: for example, true or false, yes or no, 1 or 0, and agree or disagree. While these all equate to “true” or “false”, it is important for consistency and data quality that the same set of values is used to record responses to variables of binary indicator type.
The scope of this standard is:
- data and metadata
- for use within the ONS
- for use while sharing data with third parties
- any binary indicators that hold the response as a Boolean value
The standard is to use a Boolean variable to record binary indicators, with lower case “true” or “false” being the only allowed values. Spaces and null are allowed and treated as no response or not supplied.
Back to table of contents7. ONS character set
The ONS needs to standardise data to achieve consistency and improve quality and reuse, in line with the ONS Data Principles. The ONS must cater for multiple languages, including English and Welsh. Consequently, a standard character set has been defined based on the ONS’s requirements, industry best practice and the UK Government standard.
In line with the Government Digital Service Cross platform character encoding profile, the standard character set is UTF-8, with ASCII text being used where possible.
Back to table of contents8. ONS asset description metadata schema
The ONS has several metadata schemas in place that are used to describe data and metadata at various points on the ONS data journey.
A set of metadata schemas has been identified as being suitable for use by the ONS:
- Stat-DCAT (Statistical domain data catalog vocabulary)
- ADMS (Asset Description Metadata Schema)
- SDMX (statistical data and metadata exchange)
- CSVW (CSV on the web)
The ADMS is used and recommended by the European Commission for interoperability solutions by public administrations, businesses and citizens. The ADMS provides an extensive schema for information assets and is well aligned with the ONS’s requirements. Currently, the ADMS supports JSON, RDF and XML file formats. At the ONS, the ADMS is recommended as the metadata schema for JSON files.
Back to table of contents9. Naming standards for the Data Access Platform
The DAP is a Hadoop-based shared central platform for many business areas in the ONS. Standards for naming objects on the platform are important for ensuring consistency and operational efficiency. This standard defines naming standards for the following objects:
- folders
- files
- databases
Role-based access control (RBAC) is the approach taken to manage access to various resources on the DAP. This uses Active Directory (AD) groups and user roles defined in the Cloudera Sentry Service, which is the access management and authorisation system for Hadoop. This document therefore includes standards for naming the following objects, which manage access to the various resources on the platform:
- AD groups
- Sentry roles
This standard is developed for the DAP and intended for use by technical development and operational teams.
We have decided to manage access to each data object with a corresponding AD group or Sentry role. Note that there are length limitations for AD group names owing to their use across different applications within the DAP.
Back to table of contents10. Standard variable format – duration
Duration or period is the amount of intervening time in a time interval and is measured in date- and time-based elements (that is, seconds, minutes, hours, days, weeks, months and years).
The scope of this standard is:
- data and metadata
- for use within the ONS
- for use while sharing data with third parties
- any variable that holds duration or period values
The Government Digital Service recommends the use of ISO 8601 for representing date and time, including time intervals. We have therefore decided to adopt this recommendation for recording duration. We have specified the use of a string variable to record various elements of date and time:
- values specified for the date or time elements must be positive integers
- a decimal value may be specified only for the lowest order date or time element
- spaces and nulls are allowed and treated as no response or not supplied