Developing a Google SRE Culture

January 18, 2021

Developing a Google SRE Culture

https://www.coursera.org/learn/developing-a-google-sre-culture

https://www.coursera.org/instructor/google-cloud-training

https://cloud.google.com/

Site Reliability Engineering (SRE)

Module 1: Welcome to Developing a Google SRE Culture

Module 2: DevOps, SRE, and Why They Exist

Module 3: SLOs with Consequences

Module 4: Make Tomorrow Better than Today

Module 5: Regulate Workload

Module 6: Apply SRE in Your Organization

M1: Welcome to Developing a Google SRE Culture (no content to share)

M2: DevOps, SRE, and Why they Exist:

Business and IT does not have shared communication and can lead to IT burnout while developing new features and fixing technical issues. Change how you think about, measure and incentivize reliability, similar to a DevOps role.

Devs create code, work fast, and fail quick.

Operators keep the system stable and work slower, focusing on reliability and consistency. The contention between the two groups was not sustainable, nor aligned with business.

DevOps: reduces org silos, accepts failure as normal, implement gradual change, leverage tooling and automation, measure everything. It is a philosophy. Engineering team had to do ops work, and thus the role was born.

SRE is both a practice and a role. You need a culture to maintain the technical aspects: share ownership to reduce organizational silos (priority, SLO, etc.), be blameless and accept failure as a normal state, reduce cost of failure (implement gradual change), toil automation, measure toil and reliability.

M3: SLOs with Consequences:

Guide business decisions. SRE mission: protect, provide for, progress software and systems with consistent focus on availability, latency, performance, and capacity.

Be comfortable with failure, eliminate ambiguity with monitoring, establish and document processes.
Do blameless postmortem, with components: details of the incident and its timeline, actions taken to mitigate or resolve the incident, incident's impact, its trigger and root cause(s), follow up actions to prevent its recurrence.

Low psychological safety robs team of learning opportunities, which could spark a conversation or lead to a new idea. Frame work as a learning problem, not an execution problem. Acknowledge your own fallibility, model curiosity.

Benefits for lead time, deployment frequency and time to restore via: bridging is encouraged, cooperation is high, messengers are not punished when delivering bad news, failure is treated as opportunity for improvement, new ideas are welcomed.

Why do people blame: hindsight bias is tendency to overestimate their ability to have predicted an unpredictable outcome (leads to blaming leadership), discomfort discharge is when people blame others to discharge discomfort and pain at a neurobiological level.

Assume good intentions, focus on systems and processes (not people), innovation requires some degree of risk taking.

Reduce organizational silos with SLOs and error budgets: wall between business, development, and operations need to be knocked down to operate efficiently. Software engineering is design and build, not operate and maintain however 40 to 90% of costs are incurred after launch. SRE help here and rely on error budgets and SLOs for shared responsibilities.

Reliability (simple) is good time / total time = fraction of time the service is available and working.
Reliability (sophisticated) is good interactions / total interactions = fraction of real users who experience a service that is working and available.
Amount of unreliability you are willing to tolerate = error budget
Aiming for 100% uptime slows development, error budget prioritizes engineering work.

SLOs precise numerical targets for system reliability, how do you define SLOs, via SLI service level indicators which is a quantifiable measure of system health at any moment, expressed as a ratio of good events to valid events multiplied by 100, and is mapped to user expectations, = Target for SLIs aggregated over time and is short of 100% (99.9%).

SLA is a promise about system health to customer.

Unify vision, foster collaboration and communications, and share knowledge:
Create a unified vision: teams vision statement, support company vision, core values (response to others, commitment to goals, the way you spend your time, the way you operate as a team)) [core values help build trust and psych safety with each other, willing to take risks, open to learn and grow, inclusion and commitment] + purpose (why, life and work satisfaction, create stronger connections, reduce conflict) + mission (what the team wants to achieve) + strategy (single initiative, leveraged, requires change) [identify threads and ops, understand resources, capabilities and practices, consider strategies for addressing threats and opportunities, create alignment on communicating and coordinating work processes] + goals (OKRs to reach for a big goal, try new, prioritize work, learn from success and failure)

Determine what collaboration looks like: high priority for SREs, common approaches to platforms, focus on problem-solving (ex. service-oriented meetings to: review state of service, weekly, with lead, compulsory, set agenda) [team comp: tech lead for direction, manager for perf management, project manager for comments on design docs and code]

Share knowledge among teams: cross-training (train team to be flexible, help reduce cost, improve morale, reduce turnover, boost productivity, scheduling flexibility, increased job satisfaction), employee to employee network (g2g, volunteer teaching network, help peers learn new skills, create courses, mentor learning material design), job shadowing (expert knowledge and exposure to teammates, hands-on experience, opportunity to ask questions, intro to gradual change, cross-functional collaboration, understand nuances of role, psychological safe env, pair team members to scale and retain knowledge).

Benefits of collaboration tech: real-time collaboration, open commenting system, email notifications

M4: Make Tomorrow Better than Today

Continuous integration, delivery and canarying: implementing gradual change reduces the cost of failure, CI is build, integrate and test code within dev env. CD is to prod deploy rate (frequently or by business schedule).

Software process: code, build, integrate, test, release, deploy, operate. Agile is code to build, GI is code to test, CD is code to release/deploy, DevOps is code to operate.

Why CI/CD: help overcome agile transform challenges, minimize code integration, reduce human error, promote higher code quality, easy to recover on error, automate everything, provide visibility on project status/completion, time to market is shorter, provides more metrics to review and act on.
Canarying: large systems require canary tests which remove sections of software to see how the product deals with failure

Canary requirements: canary population should be large enough to be representative subset of control, with the difference being the production change, the population should be small enough to not endanger the whole service if broken, canary should not be overly complicated for those who monitor it.
Design thinking combines creativity and structure to solve complex problems: empathize, define, ideate, prototype, test. Without prototypes you have slow failures instead of fast ones, thus fewer success, with allows for more ideas to be tested.

Toil Automation: if a human operator needs to touch systems during normal operations, it is a bug. Normal changes as you scale.

Toil: manual, repetitive, automatable, tactical, without enduring value, scales linearly as service grows
Toil leads to: career stagnation, low morale, confusion, slower progress, precedence, attrition, breach of faith.

Have thresholds for toil (20%-50%), project work (50%), automation (0%-30%).

Automation: what and how, provides: consistency, a platform, quicker resolution, faster action, time saved.

Psychology of change and resistance to change:
Navigators: help you succeed (celebrate them as champions for change)
Critics: have passion and energy, valid fears (spend time with them, persuade them)
Victims: need to express emotions, take change personally (listen to and empathize with them)
Bystanders: difficult to understand, don't know whats going on, continue with normal routine (communicate with them, understand their feelings)

Brain emotional mapping and solutions:
Exclusion (anterior cingulate, physical pain) [involve people in change], realization (prefrontal cortex, deception, heightened anxiety) [set realistic expectations], problem-solving (rush of adrenaline, positivity) [identify ops for co-creation and provide coaching not solutions], unfamiliar concepts (amygdala, anxiety, depression, fatigue, anger) [simplify messaging and focus on key concepts per user group], greater attention (adaptation) [ensure that communications are engaging and training is interactive], habits (comfort, hard-wired, basal ganglia) [allow people time to build new habits].

Emotional response to change: denial (unconscious incompetence), resistance (conscious incompetence), acceptance (conscious incompetence), exploration (conscious competence), commitment (unconscious competence), growth
Connect with people on three levels: head (rational-why), heart (emotional-them), feet (behavioural-support).

Handling resistance to change: are all leaders and manager roles modelling the new processes and behaviours, do people understand the reason for the change, do people care about the change being successful, do people have a the knowledge and ability to be successful in your new world, are the right reinforcement and recognition programs in place.

Present change as an opportunity.

Prototyping: physical, paper and drawing, clickable, role play, video.

M5: Regulate Workload:

Measure everything by quantifying toil and reliability: IT and business can understand the current status of the service, IT can analyze the data and identify necessary actions to improve the status, IT can make better decisions and impact across the organization.

Measure reliability, measure toil, monitor. Good metrics can easily have thresholds set because of low variance. Identify, select a unit of measure, and track toil measurements continuously. Triggers a toil reduction effort, and empowers teams to think about toil. Monitoring allows visibility into the system, monitor symptoms rather than causes, error budget burn.

Four golden signals: latency, traffic, errors, saturation.

Need to have a culture of Goal setting, transparency, data-based decision-making in the org.

Goal settings: who, what, KPIs for what to measure and how, OKR grading good score is 60-70%, and are not a perf eval but rather an individuals contributions and impact. Org OKRs are graded publicly, frequent check-ins throughout the quarter help maintain progress.

Transparency: share monitoring tools, share communications and feedback loops

Data-driven decision making: avoid the following: affinity bias, confirmation bias, labeling bias, selective attention bias. Remove bias via question first impressions, justify decisions, make decisions collectively,

M6: Apply SRE in Your Organization

Org maturity for SRE: Low (no adopted SRE principles, practices, or culture), High (well-established SRE team or widely embraced principles, practices and culture) [well documented and user-centric SLOs, error budgets, blameless postmortem culture, low tolerance for toil] - needs technical and cultural by in. DORA DevOps Quick Check tool (https://www.devops-research.com/quickcheck.html)

Who to hire: engineers with SRE experience, systems administrators with operations and scripting experience.
What skills to train and hire: operations and software engineering, monitoring systems, production automation, system architecture, troubleshooting, culture of trust, incident management.

SRE team implementations: its scope is unbounded, it is a good starting point for first SRE team, it is recommended for orgs with few applications and user journeys, useful when a dedicated SRE team is needed.

Benefits of everything SRE: no coverage gaps, easy to spot patterns and similarities between services and projects, acts as glue between teams

Disadvantages of everything SRE: usually lacks a team charter, risks overloading the team, it can run the risk of shallow contributions, team issues can have negative impact on the business

Infrastructure team: helps make other team's jobs easier, maintains shared services related to infra, recommended for orgs with multiple developer teams, defines common standards for the IT team

Benefits of infrastructure: allows developers to use DevOps practices without divergence across business, keeps its focus on highly reliable infrastructure, defines production standards

Disadvantages of infrastructure: possible negative impact to business following team issues, improvements the team makes may not be tied to customer experience, may require teams to be split which can lead to duplication or divergence of practices

Tools team: it focuses on building software to help devs with aspects of SRE work, recommended for orgs that need highly specialized reliability-related tooling

Benefits of tools: allows devs to use dev ops practices without divergence across business, keeps its focus on highly specialized reliability-related tooling, defines production standards

Disadvantages of tools: could unintentionally turn into an infrastructure team, risk of increased toil and overall workload on the team

Product/Application team: improves the reliability of a critical application, it is recommended for organizations that have everything, infrastructure, or tools SRE team and applications with high reliability needs

Benefits of Product/applications: clear focus, creates clear link between business priorities and team effort expenditure

Disadvantages of Product/Application: it may require establishing new teams as the business and complexity grow, can lead to duplication of infrastructure and divergence of practices

Embedded team: SRE are embedded with developers, SREs and developers have a project or time-bounded relationship, hands-on changing code and configuration of services, recommended for organizations to start a team or scale another implementation, can augment the impact of tools or infrastructure teams

Benefits of Embedded: focused expertise directed to specific problems or teams, allows side-by-side demonstration of SRE practices

Disadvantages of Embedded: cause a lack of standardization between teams, lead to divergence in practice, there is less time for mentoring

Consulting team: similar to an embedded team, less hands-on, SREs may write code and maintain tools for themselves and developers, not recommended until organizational complexity is large, Google recommends staffing one to two part-time consulting SREs before the first SRE team

Benefits of Consulting: can help with scaling an existing SRE team's positive impact, decoupled from directly changing code and configuration

Disadvantages of Consulting: may lack sufficient context to offer useful advice, can be perceived as hands-off.

Can request a SRE consultation with the professional services team.

Search This Blog

Notes

Developing a Google SRE Culture

Comments

Post a Comment