Software operations is the business.
Maintaining software operations at the high standard already requires significant efforts to ensure availability, performance, or security.
But the digital competition pushes organizations to continuously improve the business with software changes that increase the risk of disrupting operations.
In such a context, teams need a way to support the business operations systematically delivering software meeting production requirements
Production-Readinesss Review is a method part of the Quality Engineering framework for driving continuous improvement up to the high standard.
Follow the QE Unit for more Quality Engineering from the community.
What is a Production-Readiness Review
A Production-Readiness Review, or PRR in short, is a methodology aiming to assess the operational readiness of services with upcoming software changes, relying on a systematic checklist collaboratively defined with the team.
A Production-Readiness Review is like an ultimate safeguard before going live. In any case it is there to replace activities that should have been performed properly earlier like non-functional requirements or other types of peer reviews.
The key elements of Production-Readiness Review are to:
- Be collaborative rather than centrally owned
- Ensure the operational readiness of software changes
- Perform incremental reviews at different lifecycle stages
- Focus only on upcoming changes, not on previous changes
- Identify improvement points without necessarily blocking the flow.
It is equally important to state what PRR are not:
- Production-readiness criteria are not needed for all services
- PRR is not an external board acting as validators
- PRR are not necessarily for all applications or changes
- PRR are not SLOs, metrics or indicators
- PRR are not a retrospective.
The method confronts how the application supports identified operational requirements at different stages of the lifecycle. It is different from a QE Retrospective, usually broader and performed after changes were delivered.
The decision coming out of a Production Readiness Review can be to:
- Validate the change for Full-Rate Production (FRP)
- Validate the change for Low-Rate Initial Production (LRIP)
- Block the change for lacking critical requirement(s).
The method can be progressively implemented to avoid blocking the entire flow at the beginning. With more maturity, the practice is to perform minimal incremental assessments at key phases of design, implementation prior to the deployment one.
The additional value of PPR is to drive continuous improvement within the team. Systematic solutions can be implemented in the software lifecycle to systematically ensure meeting operational requirements to free time for upstream improvements or scaling reviews.
For GitLab, this review is meant to facilitate collaboration between Service Owners, Application Security, and Site Reliability teams to share and help bridge any gaps about a new service. The review document will serve as a snapshot of what is being deployed and the discussions that surround it. It is not intended to be constantly updated.
—GitLab Production-Readiness Review, GitLab.
The methodology was largely inspired by Google PRR and used in a variety of organizations such as Gitlab leveraging the method for high-performing software operations and driving continuous improvement within a Quality Engineering organization.
Why using Production-Readiness Review in Quality Engineering?
Companies with the pressure to transform their business with software must master the software lifecycle to deliver valuable changes without disrupting their operations; users simply switch to competitors for the ones unable to meet that standard.
Software teams have the challenge to accelerate software delivery while still answering the quality requirements. Relying on a systematic methodology is a way to optimize processes and open the door for upstream improvements and automation.
The defined criteria support faster verification through a systematic checklist but are also the foundations for improving how software changes cope with the multiple operational requirements, reduce the cognitive load of teams, and allow future automation.
Production-Readiness Reviews directly contribute to high standard customer experience and business operations fostering an ecosystem where only Quality at Speed software is continuously delivered to users.
How does Production-Readiness Review contribute to Quality?
Software is becoming simpler to use but rely on a much more complex assemblage of technologies. Additionally, the distributed software and infrastructure architecture equally increases the risks of service disruption.
Answering the requirements of one application is already a challenge that increases when multiple applications are required to support the user experience and business operations. Teams need efficient methodologies to help them.
Production-Readiness Review contributes to Quality being:
- Result-driven on assessing key production-readiness requirements
- Systematic by defining the assessment criteria per application and stage
- Scalable to multiple applications, teams and capitalize on shared practices.
How does Production-Readiness Review contribute to Speed?
Organizations that are not able to produce well in the first place suffer the repetitive cost of rework: another implementation at a much higher cost and the opportunity-cost of not being able to work on other business changes.
But companies searching to deliver a continuous flow of valuable software must iterate with speed. One effective way to be faster is to improve the likelihood of delivering right in the first place while performing minimal verifications to minimize rework.
Production-Readiness Review contributes to Speed with:
- Focus in assessing key operational criteria on application that matters
- Rhythm by being systematically performed in the lifecycle and for changes
- Asynchronicity with the possibility to perform reviews decoupled temporally
- Visibility as formalizing the important criteria and their respective assessment.
How to start with Production-Readiness Review in QE?
Production-Readiness Reviews are part of the Quality Engineering Framework for being a methodology aligned with the requirements of progressivity, scalability and deployability among many organizations.
Keep in mind the following principles for your PRR:
- Start with the most valuable operational-readiness criteria
- Use business drivers and previous incidents to define priorities
- Initiate on most valuable perimeter with one meeting only
- Focus on collaboration to review and define the action plan.
You can then implementation Production-Readiness Reviews:
- Identify review stages along the software lifecycle
- Define the production-readiness requirements
- Asynchronously prepare the review
- Review collaboratively and formalize action plans.
Identify review stages along the software lifecycle
Production-Readiness Reviews are not performed the same day or a few hours before the deployment; discovering structuring at that stage could lead to costly rework and block the flow for the following changes.
You therefore have to identify the best timing according to the team velocity to maximize the value of the PRR exercise. Too early, the design is not mature enough to cascade operational requirements; too late, and the rework is not likely to happen or will frustrate the team.
The best timing to start with a single PRR instance is at the Design stage of the Quality Engineering lifecycle. Straight after the Specify and Build stages, the stage has the good equilibrium for adding operational requirements in just-in-time.
Define the production-readiness requirements
Efficient PPR meetings have the operational criteria defined beforehand to focus the team on the requirements that matters, also allowing them to prepare the meeting and replicate the practice among other teams.
Keep in mind that the goal is to identify major gaps representing significant risks; it’s not about designing a bullet-proof checklist that would overlap with other types of reviews like architecture or code reviews.
The PPR list will evolve over-time according to your maturity and can start as a highlight of key production requirements to ensure for applications selected. I recommend you to explicit each point with concrete examples, explicit list of checks to align everyone.
Your Production-Readiness checklist can rely on the following elements with most important points to start with have an asterisk (*).
General
- *Ownership: Service owners are identified with contacts
- *Onboarding: Integration instructions for APIs are documented
- *Service-level indicators (SLIs, SLOs, SLAs) and defined
- Functional documentation like flows and data are available
- Technology standards compliance like providers, languages and frameworks
- Error management has been documented and reviews with the business
- Reusable components or patterns are identified or have been shared.
Disaster Recovery
- *Disaster recovery (DR): DR plans have been documented and tested
- *Backups: Backups of data occur regularly
- *Redundancy: Services have deployment in multiple regions or locations.
Deployment
- *Continuous integration using the standard pipeline
- *Continuous delivery standard with quality gates, changelog and release notes
- Deployment strategy: defined like blue-green or canary
- Static code analysis: Code is automatically scanned and passing standards.
Operations
- *On-call policy: Confirmed and configured in pager solution
- *Runbooks written and reviewed by support teams with known failure scenarios
- *Logging with centralized logging and the logs can be accessed easily
- *Metrics with the Four Golden Signals available with automated alerting.
- Incident management: escalation processes and resolution are defined
- Tracing: The application transactions can be traced using standard solutions.
Testing
- *Unit tests: Unit tests execute at every code push, automatically
- *End-to-end or acceptance tests: available, tested and automated if pertinent
- *Monitoring: testing running in production for customer journeys are identified.
- Integration tests: automated integration tests execute and pass successfully
Resiliency
- Performance requirements has been identified for nominal and exception loads
- Load testing are automated or occur on a regular cadence with shared results
- Stress testing : are performed if required with automated alerts
- Chaos engineering performed for critical business applications.
Security
- *Authentication/authorization with standard mechanisms and protocols
- *Secrets management used for all vaults and secrets of the application
- *Static application security testing (SAST) running and passing
- *Dependency scan for latest, stable and patched versions.
- Dynamic application security testing (DAST) set up for regular tests
- Penetration (pen) testing for externally exposed applications
Governance, Risk, and Compliance (GRC)
- *GRC documentation completed, if needed in central systems
- *Confidentiality, integrity, availability (CIA) rating documented
- *Data privacy has been assessed and validated with stakeholders.
The above list was compiled from various sources of GitLab, OpsLevel Service Maturity Framework available here, and the PRR checklist of Grafana.
Asynchronously prepare the review
The preparation of PRR reviews can require multiple iterations depending on the number of topics to review, the complexity of the application, and the maturity level of the organization and team members.
The best way to prepare for the review is through asynchronous sessions of 30 minutes, selecting the most important areas to review for a particular change. In that setup, only two main actors are required: the author and the reviewer.
The author is usually the main person driving the changes up to production that have the means to change the application depending on the reviews results. The reviewer is best when being someone external to the team with different interests.
The preparation can be achieved with an epic for the PRR with a task for each block of production-readiness requirements to review, where the two persons can document their review on a shared documentation space to ease the review and sharing later on.
Review collaboratively and formalize action plans
Once the PRR preparation is ready, it is then time to review it with the remaining stakeholders impacted by the software changes to align everyone and use the collective intelligence to identify improvements and decide the next steps.
The good practice is to share beforehand the preparation document to all stakeholders, highlighting the key points they should look at beforehand. That way, the synchronous meeting can be more efficient focusing only on the most relevant subjects.
The outputs of the session can be the following:
- The change can proceed, with full or gradual deployment
- The change cannot proceed due to blocking points to address.
It is also important to leverage the opportunity of reviewing software changes with the stakeholders to identify structuring improvements for future changes. All improvements can be set in the product backlog with a specific tag for production or technical debt for instance.
All decisions taken and action items must be documented to align all stakeholders on the meeting and what’s next, to then be shared across the organization to align on the level of quality expected and existing gaps that are still to resolve.
Production-Readiness Review within MAMOS
The practice of Production-Readiness Review is an efficient methodology to systematically review changes against expected operational requirements, and also foster a shared culture on software and business operations for all stakeholders.
Continuously reviewing changes supports a dynamic of continuous improvement where teams are regularly reminded about the need to include operational requirements upstream to ensure their respect once the changes are delivered.
Teams applying Production-Readiness Reviews ensure a customer experience and business operations at the high standard of our digital ecosystem, supported by a continuous flow of Quality at Speed software.
The other methods part of MAMOS like Non-functional Requirements, Architecture Reviews, or Customer Feedback are all practices that together contribute to a Quality Engineering ecosystem for continuously delivering value, while supporting future practices like SRE.
When is your next PRR?
References
Thoughtworks, Definition of production readiness. Technology Radar.
Milan Plžík (2021), How we’re building a production readiness review process at Grafana Labs, Grafana.
Production Readiness in Depth: A Guide and Checklist, OpsLevel.
Production Readiness Review (PRR), Dau.
Infrastructure Production Readiness, GitLab.