Platform EngineerとしてのPractice(2020年)

少し前にキャリアに関してインタビューを受けたが,振り返ってみると前職のCloud Foundryによる社内PaaSから現職におけるKubernetesによるMicroservices platformまでかれこれ5年にわたってPlatformの開発と運用に携わってきたことになる.

5年もやってるとPlatform engineer(Platfomer)として,特定の技術とはある程度中立的なところで習慣として当たり前にやっていることや日々の意思決定の基礎になる考え方みたいなものが出てくる.見ればわかるようにこれは何か特別な考え方とかではなくてどっかで見たことや聞いたことがあるものがほとんどだと思う.自分の中でゼロから生まれたものではなくて現在のチームと働いた経験や日々のインプット,これまでの失敗の総体でしかない.

現在自分がリードしているチームにおいて,今後チームがどんどん拡大していったとしても皆が同じ方向を向き,同じ意思決定をできるようにと,自分の考え方やチームでやってきたことをチームのPracticeとして明文化するということ1年くらい前にした(本当はもう少し抽象度を上げてPrincipleとして変わらないものにしたいというフィードバックをもらっていたが今の自分の能力ではまだそこまでやれてない…)

Practiceなので5-10年使えるものではなくて主流の技術や世の中の流れによって変わるべきものだが,2020年におけるスナップショットとしてここに残しておく(もともと英語で書いていたのでそのままで).

  • Immutable infrastructure: To provide a better product to customers faster, Continuous Delivery is one of the most important factors. A big part of making Continuous Delivery practice successful is to be able to push new changes without fear. The immutable infrastructure allows us to do deterministic deployment (safely rollouts and rollbacks). For successful continuous delivery and reliable delivery, all our components must be immutable. We don’t SSH to modify things in production. This practice must be also applied to configuration management or application source code as well.
  • Environmental parity: We must keep the development and production of environmental parity as much as possible. We must enable developers (even us) to reproduce production issues in the development environment. Always test on development before production and grow confidence in production. And we must treat the development environment as production. If we break the development environment, it blocks the developer in testing and stalls agility.
  • Declarative configuration: All infrastructure state or application state must be defined in “infrastructure as code” way. We should not do or depend on manual operations in production (and even in development). It leads to implicit knowledge and human errors. Infrastructure as code allows us to review the operation and automate it in CI. And the code must be declarative, not imperative. The declarative way enables us to build self-healing or autonomous components. It reduces our operation cost a lot. We should build tools based on “Infrastructure as data” way. We should not mix programmatic logic and data but clearly separate. The computation must be done by real programing language.
  • Documentation as a product: Documentation is the minimum variable product of the platform. We must handle documentation as one of our products. This means the documentation must be always reviewed and maintained when something changes. When we release something new, we must provide the documentation together.
  • Simple, consistent, and predictable: Developer experience (DX) we provide and operation workflows we prepare for our component should be simple, consistent, and predictable. How to work with component should be as simple to learn as possible and consistent across the multiple components (action to component A should be the same as component B). And action to the components should be always predictable. This means changes are always expected and visible before execution. The catastrophic changes should be always blocked or notified.
  • Smart default but configurable: The tool we provide should be well abstracted and hide complex configuration (and underline complex infrastructure) as possible. Instead of asking developers to configure all but provide smart default which covers 80% of usage. To cover the rest of 20% of power users, open well designed control knob as well. We should consider both productivity and flexibility but the former should come first.
  • Embrace failure: No software or infrastructure is perfect. Failures are always there. We should be embracing failures instead of focusing on preventing them. We must expect potential failures and actively be testing availability and reliability before going to production. Once we faced an incident, we must learn a lot from it and avoid the same failure again.
  • Having migration strategy, always: We don’t introduce new features without thinking about migration strategy from existing ecosystem. Keeping multiple way of doing the same things leads bad DX and becomes burden for the operation (and even more we need to have workaround for legacy way to make it work with new ecosystem). We don’t evaluate just introducing new features. Instead we evaluate how much it used by developers (adaption rate). We don’t create a seprated new platform for doing new thing. There should be only one base platform.
  • A platform for platforms: We can not support all workloads or requirements by a single platform. We must enable and help other teams to build their own specialized platform to support their own workflow top on our platform without involving the platform team.
  • Operation by science, not by art: As a platform, we need to operate and maintain our components. The operation workflow or tool we design should be strongly backed by science and engineering, not art or individual knowledge or experience. We must handle it as a science. And we should avoid using all the time to react to the incident or manual operations. We must proactively be fixing the issue and the workflow and building tools. The time for proactive and reactive should be balanced well.
  • Avoiding reinventing the wheel: Cloud providers provide great services (by using lots of costs and resources). The OSS ecosystem (especially cloud-native ecosystem) provides great software. We must utilize them first. We must be a good curator of them and utilize it for our platform. We must avoid reinventing someone else work (when reinventing we must balance the cost and benefit and resources we have). We must build something new, something special for our use cases, and something no one has worked before. And we must open to our jobs to the ecosystem as well.
  • Standing on the shoulders of giants: As a microservices platform or microservices world, we are far behind from the giants. We must learn aggressively learn from what they succeeded or failed and the latest trends or research results (from conferences, papers or meeting them). Our proposal and design must be based on these learnings. The originality must be on top of the existing knowledge. Then becoming the next giant and provide our knowledge to the world to support them.