AKS GA? Is it the right time to start?

TL;DR; Wait 1-2 months more. It is a life on a bleeding edge. Some bugs lurking. If you want to start now, it is absolutely possible. We are for the last 2 months fully on AKS with our production (API, SaaS, and training of Machine Learning models) in SMACC.io. How was it? I got few white hairs.

Feel to be warned about:

  1. From time to time, a tunnelfront pod dies. When it breaks, you are neither able to open a console nor getting the logs with kubectl. I do not have this issue anymore (after restarting nodes and recreating some of them), but new users report this issue.
  2. If your pod uses a Persistent Volume and your pod gets scheduled on a different node, you might wait up to 6-7 minutes before the volume is detached from the previous node and attached to new one.
  3. Every node in Azure has a limit on the number of attached volumes, it is very easy to hit this limit when you reduce the size of your cluster. So, it might happen that, e.g., your database pods wait indefinitely for volumes.
  4. AKS does not support adding node pools with different VM types to your kubernetes cluster. So, you cannot optimize the costs by using different node pools for different parts of your platform.
  5. Support :D, often you are on your own. I do not blame the guys, it is a new product. I used to work in a startup delivering IaaS to clients, it is freaking hard.
  6. You might — I hope not anymore — learn from the support that you should restart the cluster or recreate a node to solve an issue :/. [Update 4.o7] some configuration fixes are applied when you resize your cluster.
  7. A little annoyance: kubectl delete statefulsets sometimes does not work without –cascade=false . So, you might need to clean up pods by hand later.
  8. … the best place to find help is: https://github.com/Azure/AKS/issues , I love the open approach of the Azure AKS team!
  9. [Update 5.07], watch out for the memory-preserving updates, your node might vanish for 30 seconds without any warning or info in advance. If you take a tech debt (e.g., only today my mysql without persistence volume, Vault not HA), you might lose your data. There is a way to learn about them 15-minutes before, see this (very Microsoft style) video from channel 9.
    Other cloud providers spoiled us, so we forgot what the cloud is and the VMs might vanish at any moment.
  10. [Update 5.07], currently the AKS team pushes changes to your nodes without notifying you, so you might wake up with, e.g., RBAC activated on your k8s or new tunnelfront pods.

If you have more detailed questions, feel free to contact me. We are fully on AKS, so it is feasible. My recommendation is about you skipping the first bumpy GA months. The AKS team gets more and more experience with time and the growing number of clients. So, the service improves fast.

Good luck! 🙂

ps. People asked about how we setup our AKS:

  • Terraform for setting up everything around our AKS cluster for dev, staging, and production, e.g., DNS.
  • Use az aks to create kubernetes clusters for all our environments.
  • Github + TravisCI for Continuous Deployment.
  • We use plain-kubernetes-configuration files for our services and for the most of other components, e.g., traefik. For few components, we use operators, e.g., for elasticsearch.

pps. We are hiring Golang Developers, System Engineer – help us to further develop our FinTech API and SaaS for finanse processes automation with Machine Learning – https://short.sg/j/1566212

 

wb