[Thousands of people platform engineering training story Part 3] After business decisions and engineering innovations led to a surge in online shopping weekly performance, the focus on developer experience optimization also embraced SRE again

To launch a new feature a few weeks before the online shopping week, how to ensure that the new feature can withstand the Black Friday crowds and maintain customers' trust in the website? Zalando took a bold approach that year, recreating the explosive load of online shopping weeks in advance in a formal environment on weekdays to test new features.
As Zalando's technology development entered a mature stage, technology decisions were no longer free as they were when Radical Agile was first introduced. Instead, they were led by the senior engineering community, who developed the Tech Radar to assist hundreds of teams in making technology decisions. Each team was required to refer to this shared technology recommendation list as a reference for selecting technologies for new projects. They did not have to conduct a technology evaluation from scratch every time a new project was launched, but could directly refer to the list recommendations to make their choices. Because each team referred to the same technology radar to make their selections, Zalando was able to ensure that the technologies used in different projects were within the scope of this shared technology list, thereby achieving a company-wide focus on technology. / Zalando's technology development has entered a mature stage, relying on the Tech Radar to focus on the technical decisions of hundreds of teams From 2009 to 2019, Zalando's organization has undergone many changes, and the technology has also developed a large-scale distributed microservice architecture. According to the figures revealed by Zalando at the 2022 DevOpsCon in Berlin, the number of microservices in 2019 was as high as four or five thousand. At this time, Zalando's technology development has entered a mature stage. When making technical decisions, they are no longer as free as when radical agile was first promoted. Instead, they are led by a senior engineering community and a Tech Radar has been developed to assist 2 teams in making technical decisions. The design of this Tech Radar refers to the practices of the ThoughtWorks consulting company, but it has developed into Zalando's own exclusive version. The consulting firm has divided nearly XNUMX technical terms covering four categories, namely technology, tools, platforms, frameworks and languages, into four levels according to the degree of recommended adoption, and arranged them on a circular radar chart divided into four quadrants. In this technology radar chart, the recommended adoption levels of different types of technologies are listed, and different rings are used to represent different recommendation levels. The closer the ring is to the core, the higher the recommendation level of the technology. Zalando took stock of its needs and finally focused on software development-related technologies, including four categories: data storage, data management, infrastructure and development languages. The recommended adoption level is divided into four levels, forming four rings, and each ring represents a different recommendation level. The four levels include: Adopt (recommended for adoption), Trial (recommended for trial use), Assess (assessment phase), and Hold (retained). Recommended trial technologies refer to technologies that have already had successful projects internally, and are at least used to deal with real problems rather than simulated situations, and they value the breadth of adoption. Only technologies that senior management is willing to invest in for the long term will be included in this recommendation level. Technologies included in the assessment phase refer to a group of technologies that have obvious potential value and are worth investing in. By automatically analyzing the data of trial plans in all products, technologies that have been tested and are worth including in the trial phase are found. . The last category, the reserved but not recommended level, is the technology that is not recommended but will continue to be maintained. Not only can it not be used in new projects, but it is also not encouraged to be used in promotional services. The scope of application of this type of technology will be gradually reduced. Each technology will also be accompanied by a technical description document, which lists the advantages, disadvantages, limitations, usage and experience learned after using this technology. Each technology has a document, and all technical documents are integrated into a technical knowledge base. Zalando also organizes the adoption templates and guidelines of these recommended technologies in the technical radar chart. The guide will provide explanations of common problems during use, or use cases of teams that have already adopted them, and even comparisons between different alternative technologies.

Every once in a while, to adjust the technology recommendation level, the chief engineer will collect the real usage data of each technology on the existing technology radar, including usage volume, incident records, and adoption experience (for example, how many years has this technology been introduced in Zalando? ), and then perform scoring. The designated maintenance chief engineer will first create a spreadsheet of new technology scores, and then open it to the chief engineer community for voting to decide whether to "upgrade" or "downgrade"

Zalando requires each team to refer to this shared technology list as a reference for selecting technologies for new projects. There is no need to conduct technical evaluations from scratch every time a new project is launched. Engineers directly refer to the list recommendations for selection. Because each team refers to the same technology radar chart for selection, Zalando can ensure that the technologies used in different projects are within the scope of this shared technology list to achieve the focus of technical direction.

Zalando renamed the original digital infrastructure department to the construction department (Build) and continues to be responsible for building and improving the developer platform to specifically serve developers. The construction department began to study the developer's customer journey, that is, the developer's daily work journey, and found that the development platforms used by developers were quite scattered. Each team communicated with its members in its own way, and there was a lack of common knowledge across the company. Language of communication.

Solve the problem of developer process fragmentation and create a developer portal website

In order to solve the problem of fragmented developer workflow, the construction department created a developer portal Sunrise (Sunrise Platform) as the first website that developers open every day when they go to work. Users of this platform include software engineers, data engineers, technical directors, data scientists, project managers, designers, etc.

The construction department used Spotify's open-source ML management platform project Backstage as a basis, integrated many Zalando internal technical tools, development components, implementation templates and technical documents, and designed this internal dedicated self-service developer platform (Internal Developer Platform). The interface operation is as smooth as commercial enterprise-level collaboration platforms, and the UX design details are focused on guiding developers to operate by themselves. Developers can even directly see the common monitoring data of the responsible AP on the Sunrise platform.

The first page that developers see when they open the Sunrise Platform has all the information they use most, allowing them to easily search for specific applications and commonly used APIs they are responsible for, and quickly see who is the dedicated owner of each application or API. If necessary, they can directly submit a request (Ticket) on this page to seek help, without having to apply in another system as in the past. The Sunrise Platform homepage also integrates all event information of all APs that developers are responsible for and subscribe to reference documents.

Engineers or other users can check the progress or status of each stage of the product life cycle, monitor it in real time, and collaborate with teams and other individuals to troubleshoot issues in the CI/CD process. Zalando team members can even bootstrap and deploy new applications using Sunrise.

In order to create this convenient and easy-to-use internal developer platform, Zalando has publicly shared several keys.

For example, they directly modified the K8s source code to solve the problem, turning K8s into a system they can control to develop their own cloud native platform. For example, the Sunrise platform uses a self-developed and customized kubectl encapsulation function.

When an emergency occurs and you need to quickly create a temporary access k8s cluster, this encapsulation function can come in handy. You do not need to follow the original standard encapsulation function, which further shortens a lot of deployment time. Another key is that Zalando also digitizes the "development experience", which means measuring the effectiveness of the development platform on developer experience and productivity.

Zalando referred to the recommendations of a book "Accelerate: The Science of Lean Software and DevOps" (the name of the Taiwanese Chinese version is "The Science Behind Lean Software & DevOps") to define four indicators of the developer performance matrix.

It includes lead time, release frequency, average recovery time (Time to Restore Service), and change failure rate (Change Fail Rate). This is exactly what the four indicators used in the well-known DevOps performance indicator DORA are. concept.

However, Zalando’s specific method of measuring the four indicators is slightly different. The preparation time is from Commit to the official launch of the environment. Release frequency: the number of deployments per developer per week. The average recovery time is calculated from the time the event occurs to the time the service is restored (not from the time the service crashes). The last change failure rate is calculated based on how many failures occur among all deployment times.

The biggest benefit of the Sunrise developer platform is that it keeps all developers on the same track. In addition, it can also meet the needs of different organizational divisions of work in asynchronous departments to provide flexibility. Finally, this single platform also integrates the design of Zalando. The technical radar chart and all reference technical practical experience, the verification team’s test documents, and even relevant templates of mature practices and processes. It can be focused through a single platform, and it is recommended that the development team use the technology that they particularly want to add.

The design goal of the Zalando sunrise website is to "make developers happy and productive!" It provides the best developer experience and reduces the cognitive load of the technical team and development team as much as possible to increase development speed and productivity. This was the first time Zalando disclosed the development process of the Sunrise Platform at last year's Platform Engineering Conference. Henning Jacobs, Zalando's senior chief engineer, emphasized this matter.

In order to solve the problem of fragmented developer workflows, Zalando's construction department created a developer portal called Sunrise, which is the first website that developers open when they go to work every day. / Zalando

[Thousands of people platform engineering training story Part 3] After business decisions and engineering innovations led to a surge in online shopping weekly performance, the focus on developer experience optimization also embraced SRE again

Sunrise is based on Spotify's open source ML management platform project Backstage, and integrates many Zalando internal technical tools, development components, implementation templates and technical documents to design this internal self-service developer platform (Internal Developer Platform). Zalando developers can obtain information on various tools and services created by different departments and product teams across the company on the Sunrise platform, and can also obtain all support services in one stop. /Zalando

 

Zalando developers can quickly view and manage the progress of their product projects on the Sunrise platform. / Zalando

Actively embrace SRE again and even set up a dedicated SRE department

On the other hand, as mentioned before when talking about Online Shopping Week, Zalando once again established an SRE support team. In 2019, it directly established a dedicated SRE department. This department includes a Log recording team, a tracking matrix team, an incident response team, and startup coaching. Team composition allows this group of people to focus on the same vision and goals through the same set of KPIs.

Andrew Howden pointed out: "The goal of the SRE department is to establish a set of key business maintenance operations models, focusing on customer experience and solving cross-department alignment issues." He has been involved in Zalando's SRE development process over the past four years.

Key business maintenance is a service level objective (SLO) that focuses on customer experience. By measuring the interaction between customers and the website, the perspectives of developers, managers, and customers can be integrated into the same set of data, and these data can be used to Improve reliability.

Establish an embedded SRE team to solve specific maintenance and operation problems

Having a dedicated SRE department is not enough. Zalando has also set up a new SRE team called Embedded SRE to solve the special challenges of the checkout process. For example, some crazy buyers will suddenly target specific products for large sales, causing some system problems. This type of checkout process problem involves communication and collaboration between more than a dozen applications, 4 or 5 departments, and hundreds of engineers. Andrew Howden is the leader of this team and leads 2 engineers.

Andrew Howden first analyzed the impact of related product systems behind different checkout exceptions and found solutions one by one. He has dealt with problems such as a large number of requests that overloaded the system and failed to respond, prompting the cluster management software to automatically restart, but causing the entire system to shut down.

Because the checkout system is a large-scale distributed microservice architecture, it was originally designed in the circuit breaker mode to avoid continuously calling the same failed service. However, because the design of the circuit breaker is too sensitive, when a system fails, it begins to affect The error rate judgment of circuit breakers in other systems will have cascading effects.

Or another problem is that in order to ensure reliability, the checkout system has designed many automatic expansion mechanisms. Once it is found that the response speed of a customer's checkout request has slowed down, it will automatically expand. However, this caused a sharp increase in cloud costs. Later it was discovered that A small number of customers generate a large number of requests due to their shopping behavior, causing this small group of people to respond slowly. Not all customers have the same problem. As long as each customer is defined according to a standard that can cover 99.9% of general customers. The upper limit on the number of requests for a customer can reduce the impact of a specific customer's crazy behavior on the automatic expansion mechanism.

Integrate experience in solving maintenance problems into daily maintenance

Because it usually only takes 3 weeks to solve a problem, but it takes 3 months to transfer the experience of handling this abnormal problem to the platform team and the different product responsible teams involved. The last challenge of embedding the SRE team is how to turn the experience of solving these maintenance problems into a part of daily maintenance.

Zalando holds weekly operational review meetings (WORMs) every week. The chief engineer community uses this meeting to review post-analysis reports and review various maintenance issues. However, the quality of these analysis reports varies greatly, and engineers spend a lot of effort preparing these documents.

Embedding the SRE team helps automate the production process of such analysis reports, and even adds adjustment suggestions related to SRE practices. The report can be automatically sent to this team, and the report can also be automatically sent to the engineering management team for weekly review.

In mid-2023, the embedded SRE team completed the issues it was originally set up to solve and ended the task of this team. Andrew Howden also ended his journey at Zalando in August and turned to become a consultant providing SRE training.

However, Zalando platform engineering has not stopped the pace of change and is still evolving.

Rate the article
Show verification code