Skip to main content

https://onsdigital.blog.gov.uk/2023/08/31/delivering-the-government-cloud-first-policy-what-it-looks-like-in-practice-at-the-ons/

Delivering the Government cloud first policy – what it looks like in practice at the ONS

Posted by: , Posted on: - Categories: Automation, Cloud, DevSecOps, Technology

Screen shot of the government cloud first policy

Our Cloud Services and Support team (CSS) are the infrastructure operations and best practice hub for ONS’ Digital Services and Technology Cloud Division – in 2020 we were created in response to ONS’ growing need to operate sustainable, scalable, and secure cloud services. We now comprise 11 people working across a variety of areas and geographies.

A new DevSecOps engineer role

We have a broad range of expertise in our community, Infrastructure Engineering, Software Engineering, Technical Architecture, Security, Delivery Management and Service Management. Given the pace at which we are delivering cloud services to the organisation the requirement for breadth and agility of skills is constantly changing. As a result, we pioneered the creation and recruitment of a new role – the DevSecOps Engineer! This role embeds thinking around security risk and mitigation and places this squarely in the middle of our ‘DevOps’ working practice.

We work very closely with the architecture teams in our Cloud Division – Cloud Infrastructure and Architecture, and Cloud Architecture and Technical Design, and our work spans three major public clouds – Amazon Web Services, Google Cloud and Microsoft Azure. That is a lot of ground to cover – one question I posed to myself when writing this was “how are we doing this?” but I think the more accurate sentiment would be “how can we adapt and anticipate what needs to be done?”

Our approach

We use Agile combined with the best bits of Site Reliability Engineering, ITIL and DevOps. We hold stand-ups and planning sessions which are very wide ranging. We love a good ‘hackathon’, where we surge together to solve a problem. We ask colleagues, including across Government, what challenges and adaptation journeys they have been on and learn from their experience. We ask for help; we try things and get creative. Sometimes we even fail, and that’s a MASSIVELY good thing, allowing us to learn fast and pivot.

Some of the approaches we use to make sense of this journey are:

Identity and cloud projects

Our team manages the identity platforms for users on AWS and Google and the lifecycle of project spaces (or ‘cloud warehouses’) across AWS, Google, and Azure for most divisions across the ONS such as the Integrated Data Service, Macroeconomic Stats and Surveys. We build and manage the secure cloud ‘warehouses’ from within which the business operates their specific cloud services; a bit like the analogy of a business park – but with 1000+ separate units! All projects are built using ‘infrastructure as code’ and we work with our users to ensure they have access to the policies, principles and guidance they need to keep their warehouse secure.

Cloud ecosystem

Our colleagues and customers depend on a wide variety of products within the cloud ‘ecosystem’ that we support including:

  • Collaboration platforms
  • Source code control systems
  • Observability platforms
  • Developer Workstations
  • Container Management Platforms
  • Data Analysis tools

Track everything

We track everything that we can with our use of the ServiceNow ITSM toolset, including Cloud Identities, Cloud Projects, and the Cloud Ecosystem. This platform allows us to track, prioritise and audit incidents, changes, requests, and problems.

One of the key challenges we face with the pace of cloud adoption is making sense of who owns and manages what, where and why in cloud – ServiceNow is helping us with this capability. We have built a ‘Cloud Account Management System’ (CAMS) in ServiceNow that holds metadata around each cloud project account and is updated automatically when a new project is created. It also surveys all cloud account owners regularly to ensure information is up to date.

The MOT test

As we are working in a very fluid and organic space (ironic, given the amount of tech at work!), we acknowledged that if a process was suitable, secure and scalable at one point in time – perhaps when a service went live – the nature of change in cloud means that processes warrant ongoing consideration and iteration, even automation in some cases – so we developed the ‘MOT Check’. The whole team gathers monthly and runs down the list of cloud processes we operate to see if they are still valid, could be optimised or in some cases are no longer needed; it’s a great way to look across our estate and identify where we can automate things.

Automate the boring stuff

‘Toil’ (not ‘time off in lieu!) is a way to measure the effort of undertaking repeatable tasks such as creating a user account in Google Cloud or AWS or creating a Data Analysis project for the Integrated Data Service; it’s a term we borrowed from the ‘Site Reliability Engineering’ principles. As we move further along our cloud journey, there are ways that we look at these repeatable tasks and ask whether there are more automatable/bot-driven ways of undertaking these to ensure we maximise our time and add value. Some of our successes so far are automating account creation and deletion across the cloud ecosystem and notifying users of dormant cloud accounts. As a central operational hub we are also responsible for propagating the question of ‘can this operational task be automated?’ as early as possible in the design phases for products we support, allowing us to help that get built quickly. This is going to be an exciting area as AI-driven tooling becomes more reliable and adoptable.

Service operations framework

‘Document enough and iterate often’. We utilise tools such as Confluence, Jira, and SharePoint to build standard operating procedures, playbooks and runbooks for service operations and have developed a framework that allows us to capture the key ‘themes’ around a service to successfully operate it at scale. At time of writing, we have 27 separate cloud services on our catalogue; this approach enables us to efficiently operate multiple services in parallel based on the key information needed to do so.

FinOps and sustainability

The shift to cloud services has evolved the way we consume technical resources and the way we pay for them. Though a bit of an oversimplification, cloud service consumption can loosely be defined as a ‘pay as you go’ contract with our cloud suppliers. This shift has allowed ONS to look at more efficient ways of delivering the same outcome using cloud services. However, this has created a need to keep on top of costs and commitments, which is where FinOps (Financial Operations) comes in. Our team is responsible for centralising the billing views across the public clouds and helping teams to understand where cost can be saved through embedded culture (e.g., through switching off cloud resources when not in use and deleting old projects) and cloud service design. We are also working closely with the architecture teams on a ‘tagging’ strategy that will allow us to map and observe how the organisation is using cloud services, enabling us to advise and steer teams in optimum use of cloud. Crucially, each cloud vendor is also heavily committed to the use of renewable energy to power their data centres in the next 10 years, so we can also help teams understand what the carbon footprint of their project is and how to optimise it.

If you would like to learn more or would be interested in sharing insight across Government on Cloud service operations please get in touch with us via css@ons.gov.uk.

Sharing and comments

Share this page

Leave a comment

We only ask for your email address so we know you're a real person

By submitting a comment you understand it may be published on this public website. Please read our privacy notice to see how the GOV.UK blogging platform handles your information.