Data Center automation at NVIDIA using Pipeline as Code

Bringing CICD principles and practices into DC operations helps to efficiently manage DC resources i.e. from network devices setup to container platform to application deployment. This also helps in scaling DC to thousands of servers reliably.

Have you ever thought of automating end to end workflow for setting up a new data center by single click? Have you ever thought of implementing automations for infrastructure setups which generally takes months of effort to hours with single click? Some of the examples of such setups are:

1. Setting up DC network in reliable and reproducible manner.
2. Automatic OS provisioning on blade servers.
3. Secure login mechanism for human and service accounts.
4. Configuring the DC components using idempotent automation workflows.
5. Setting up highly available internal private cloud / container orchestration platform like Kubernetes on auto provisioned infra.
6. A very complex Inventory state life management workflow.

To accomplish reliable, reproducible and idempotent automation for infrastructure setup, NVIDIA DevOps team has been working on implementing *DC Automation Manager*, a framework developed using CICD tools ecosystem.

In this presentation we will talk about design and automation used at NVIDIA GPU Cloud to setup new DC of 1000s of GPU and CPU blade servers from scratch using Jenkins and GitOps for,

1. Streamlining inventory life cycle
2. L2/L3 network setups
3. Secret management to secure human or automated interaction with all the data center services.
4. Node provisioning and OS configuration with dynamic inventory capabilities
5. Setting up container orchestration platforms on BM/Cloud
6. Bridging the gap between application engineering and operation engineering.

Gopi Vadlamudi

Principal Engineer , NVIDIA CORPORATION

Senior Engineering Manager, NVIDIA GPU Cloud DevOps Team.

