Finding the right balance of speed and security through just-in-time access to cloud resources

October 11, 2023

Joining Ramp in 2021, I saw the company grow from less than 100 employees to more than 600 today. This high-speed growth generated exciting engineering challenges, including managing backend engineers' access to cloud infrastructure resources.

In the early days, our engineers mainly operated with only 3 AWS roles (Junior-Engineers for interns and entry-level engineers, Engineers for senior and established engineers, and Admins for system owners). Whenever a backend engineer wanted to access a new or existing resource, the infrastructure team would have to add a new IAM statement to the role policy to allow this access. In addition to adding friction for both the backend engineer forced to wait until the request was completed and the infrastructure engineer forced to perform a repetitive task, this also generated significant organizational problems that worsened with time. After months of operating with this workflow, we noticed the following problems:

  • Senior engineers had persistent production access to more of our system than necessary. This increased the base rate of risk of their accounts - a mistake or account compromise could impact a large swath of infrastructure at once.
  • Junior engineers had limited access to our system, leading them to ask senior engineers to run queries on their behalf. This added friction, monopolizing engineering time.

Working closely with our engineering organization, my team and I attempted to solve these problems by first restructuring IAM permissions and roles and then building a just-in-time access provisioning solution to deal with the escalation of privilege in AWS with ConductorOne.

The ideal system

In a perfect world, every engineer has standing access to non-production environments and limited to no access to the production environment.

  • When an engineer requires access to a resource in production they own, they submit a request that gets automatically approved. They can then access that resource in production for a limited period until they lose their extended permissions.
  • When an engineer requires access to a resource in production they do not own; they submit a request that gets routed to the owner of that resource. Once the request gets approved, they obtain access for a limited period. If the resource owner is away and remains pending for too long, the request gets routed to a backup approver until someone approves or denies it.

In the ideal system, resource and system ownership are clearly defined, access is provisioned and de-provisioned instantly without extra friction, and every action performed against production is fully audited and recorded. This points to a security practice called just-in-time (JIT) access, which grants users temporary access to specific systems and resources for a limited predefined period.

Flowchart

Moving to fine-grained access control

By describing the ideal system, we understood two halves to it: the technical one, being defining what access to obtain and implementing how to provision that access, and the human one, being about establishing clear system ownership for every engineering team to ensure that they would all use the tool properly.

In our journey towards achieving just-in-time access, we first moved to fine-grained access control by scoping AWS SSO groups per engineering teams. By moving from 3 roles to more than 20, our objective was to create a clear social contract around the use of our future tool. The tech lead of a team would increasingly own their corner of the stack, and they would similarly be accountable for their team's use of production access. As our teams scale and create more resources, it becomes easier to identify owners of parts of the systems. In embodying our core company value of ownership, each tech lead manages and takes proactive responsibility for their part of the system, ensuring that accountability and quality are never compromised in our ongoing developments.

After spending time with engineering teams to identify a baseline level of access, we established a robust access management framework leveraging AWS Identity Center. Each of the new 20+ AWS groups was now meticulously tailored to the needs of a specific team. For every group, a permission set is attached to scope actions by the team's responsibilities. Engineers were now empowered with access tightly bound to the systems and resources they owned.

Abstracting IAM complexity

The Ramp infrastructure team is always concerned with developer velocity. Instead of making our backend engineers open DevOps tickets, we invested in building self-service frameworks they can leverage to create infrastructure owned by specific teams.

By this principle, we built a Terraform module abstracting away the complexity of IAM to promote access visibility and ease of use. This module takes input data from a human-readable interface and creates the correct set of IAM statements to ensure that desired access is obtained. For a team X, we have a Terraform file engineers_X.tf sourcing our module. Whenever engineers ask about their permissions, we point them to the file for their team. That way, they have a direct view of what they own and can access. If they need access to other resources, they can open pull requests for the infrastructure team to review without dealing with the complexity of IAM statements.

  module "sso_group_engineers_X" {
    source      = "../../../modules/aws/sso/group"
    name        = "Engineers-X"
    description = "Default role for engineers in team X"

    managed_policy_arns = [

    ]

    pipelines = [

    ]

    lambdas = [z

    ]

    prod_containers = [

    ]

    rw_bucket_paths = [
      
    ]

    secrets = [

    ]
  }
  • managed_policy_arns is a list of IAM policies shared between teams. It is mainly used to scope access to non-production environments and some read-only access to the production environment.
  • pipelines is a list of CI / CD pipelines the team can access.
  • lambdas is a list of lambda functions the team can invoke.
  • prod_containers is a list of ECS Fargate tasks the team can access in the production environment.
  • rw_bucket_paths is a list of S3 buckets the team has read and write access to.
  • secrets is a list of AWS secrets where the team can fetch, add, and edit values.

Implementing JIT access with ConductorOne

Splitting the permissions per engineering team was a massive win for us, but a subset of engineers still had standing access to production parts of the system. The only way to resolve this problem was to move to just-in-time access. The plan was simple: for every of the 20+ engineering roles, create an associated production role with elevated permissions. That way, we could remove production access from the default non-production roles. In this new world, production roles are all supersets of non-production roles. No user has access to a production role by default, but only a specific set of eligible engineers can request and obtain access to it. We just needed a way to automate access provisioning.

In the Fall of 2022, I started discussing my JIT project with the founder and CEO of ConductorOne, Alex Bovee. Ramp was already using ConductorOne to provision application access to new hires. Alex invested time to understand my use cases and scenarios and onboarded me to the product for AWS. We tried it with one engineering team, and after many iterations and a progressive deployment, we adopted ConductorOne globally for AWS.

ConductorOne provisioning works by adding and removing users to AWS SSO groups. For every AWS role, we map a ConductorOne entitlement. Entitlements have policies attached to them to allow users to obtain access and catalogs to define who is eligible to request access. In our setup,

  • Eligible production engineers can request access to any production role.
    • If the requested production role is the elevated role for the requester's team, it is obtained immediately without requiring approval.
    • If the requested production role is the elevated role of another team, it creates a pending request routed to the tech lead of that team, who is responsible for approving or declining the request.
  • Owners of an entitlement can approve or deny incoming requests. They can also see who has current access to production and revoke access if needed. Engineers usually become eligible for production when they meet specific tenure criteria and complete PCI-mandated secure code training.

Improving workflows with cone CLI

To make things easier for our engineers and remove the friction around requesting access from the ConductorOne web application, we integrated cone, ConductorOne CLI tool into our own CLI tool, built initially to support our needs. That way, our engineers can leverage a CLI tool they're familiar with to access AWS resources. Depending on their default role, the logic will suggest requesting the associated production role to obtain access for them in less than 10 seconds.

Here is the flow when an eligible engineer attempts to SSH to a production instance.

  devtool ssh-to-container --env prd
  [SSO] Current SSO Context: account: account-X, role: Engineers-X
  [SSO] Initiating SSO session at 2023-08-24T22:12:03.218619Z
  [SSO] Client secret still valid until 2023-11-20T13:58:12Z
  [SSO] Device authorization still valid until 2023-08-24T23:11:52Z
  [JIT] You're attempting to access production. 
  [JIT] Detecting your default requestable production role is prod-eng-X. 
  Would you like to request it?: Yes
  > Yes
    I'd like to request another role

  [JIT] How long do you need production access for?: 1h
  > 1h
    12h
    1d

  [JIT] Why do you need production access? Please give a valid reason.
    Working on ticket INFRA-2441
  [JIT] Requesting access to prod-eng-X from ConductorOne... 
  [JIT] Production role prod-eng-X successfully obtained.
  [?] Choose a container to ssh to: container_X
  > container_X
  Starting session with SessionId 07ca85df7cb98f013 under role prod-eng-X. 
  # 

Analyzing impact and benefits

Our implementation of just-in-time access to production significantly impacted and benefited our organization. This approach strengthens security by reducing the window of vulnerability for potential breaches, as engineers only access the production environment when necessary, minimizing exposure to threats. This heightened security leads to enhanced data protection and increased user trust.

That impact extends to regulatory compliance frameworks like SOC 2 and ISO27001, where controlled access is pivotal. Just-in-time access aligns with SOC 2 principles, demonstrating adequate controls and safeguarding customer data. This approach fortifies security, reduces audit scope, and enforces accountability through meticulous access tracking.

Regarding engineering velocity, the central security improvement only came with minimal friction. Our engineers can still access the resources they need fast enough without being blocked by pending access requests. The significant difference is that we can correlate every access to production to a customer issue or a valid justification. This is highly beneficial as it helps us understand why our engineers need access to production.

Lessons learned

Establishing clear system ownership is crucial before building an access management framework.

Defining who is responsible for different system aspects ensures that access controls align with operational needs and security considerations. Clear ownership delineation prevents ambiguity, streamlines decision-making, and enables effective delegation of access rights. With this foundational clarity, access management efforts can become more cohesive, potentially leading to inefficiencies, security gaps, and conflicts in authorization. A well-defined system ownership framework provides the essential context for designing access controls that safeguard sensitive resources while facilitating smooth and secure collaboration within the organization.

Investing in the right product early on will save time and money for your company.

By choosing a suitable product that aligns with your needs and objectives, you can avoid the pitfalls of later adjustments, reconfigurations, and potential replacements. This proactive approach ensures that resources are channeled efficiently into a solution that addresses your requirements comprehensively. ConductorOne's capabilities to streamline access provisioning, enforce fine-grained permissions, and offer real-time visibility into access patterns empowered our organization to navigate the complexities of cloud environments effectively. Addressing access management comprehensively at an early stage prevents costly security breaches, minimizes the need for reactive measures, and optimizes development cycles. Moreover, as your company scales, the benefits of a well-integrated access management solution become increasingly pronounced, saving valuable resources and positioning your company for sustainable growth.

Onboarding hundreds of engineers to a new system can sometimes be more challenging than building that new system.

While building a new system involves technical expertise and design prowess, introducing it to a large workforce necessitates comprehensive planning, communication, and support mechanisms. Clear documentation and ongoing assistance ensure a smooth transition and rapid adoption. Failing to address the onboarding process adequately can result in productivity losses, resistance to change, and even system underutilization. Thus, recognizing the significance of seamless onboarding is critical in maximizing the value of the newly developed system.

© 2024 Ramp Business Corporation. “Ramp,” "Ramp Financial" and the Ramp logo are trademarks of the company.
The Ramp Visa Commercial Card and the Ramp Visa Corporate Card are issued by Sutton Bank and Celtic Bank (Members FDIC), respectively. Please visit our Terms of Service for more details.