1.3.3 Cluster Autoscaler

Kubernetes Cluster Autoscaler는

리소스가 요구하는 자원이 있는 Pod에 대해, Pod에게 할당할 자원이 부족하여 Pod를 스케줄할 수 없는 경우, 해당 노드 풀에 Worker Node를 추가합니다.
장시간 동안 Worker Node의 활용도가 낮고 Pod를 다른 노드에 배치할 수 있는 경우, 노드 풀에서 Worker Node를 제거합니다.

Kubernetes Cluster Autoscaler는 매뉴얼하게 직접 설치하거나, Cluster Add-On으로 설치할 수 있습니다. 여기서는 직접 설치하는 것을 기준합니다.

Step 1: Cluster Autoscaler가 노드 풀에 접근할 수 있도록, Instance Principal 설정하기

Cluster Autoscaler가 필요한 OCI 자원을 관리할 수 있도록 권한을 부여합니다. Instance Principal 또는 Workload Identity Principal을 사용할 수 있습니다. Basic Cluster에서 사용할 수 있는 Instance Principal을 여기서는 편의상 사용합니다.

OCI 콘솔에 로그인합니다.
좌측 상단 햄버거 메뉴에서 Identity & Security > Identity > Compartments로 이동합니다.
OKE 클러스터가 있는 Compartment의 OCID를 확인합니다.
좌측 Dynamic Group 메뉴로 이동하여 아래 규칙을 가진 Dynamic Group을 만듭니다.
- Name: 예, oke-cluster-autoscaler-dyn-grp
```
instance.compartment.id = '<compartment-ocid>'
```
좌측 상단 햄버거 메뉴에서 Identity & Security > Identity > Policies로 이동합니다.

아래 규칙을 가진 Policy를 만듭니다

Name: 예, oke-cluster-autoscaler-dyn-grp-policy
dynamic-group-name: 앞서 만든 dynamic group 이름, 예, oke-cluster-autoscaler-dyn-grp
compartment-name: 대상 OKE Cluster가 위치한 compartment 이름

Allow dynamic-group <dynamic-group-name> to manage cluster-node-pools in compartment <compartment-name>
Allow dynamic-group <dynamic-group-name> to manage instance-family in compartment <compartment-name>
Allow dynamic-group <dynamic-group-name> to use subnets in compartment <compartment-name>
Allow dynamic-group <dynamic-group-name> to read virtual-network-family in compartment <compartment-name>
Allow dynamic-group <dynamic-group-name> to use vnics in compartment <compartment-name>
Allow dynamic-group <dynamic-group-name> to inspect compartments in compartment <compartment-name>

Step 2: Cluster Autoscaler 설정파일 설정하기

설정 파일 샘플을 다운로드 받습니다.

wget https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/oci/examples/oci-nodepool-cluster-autoscaler-w-principals.yaml -O cluster-autoscaler.yaml

설정 파일 변경

...
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  ...
  template:
    metadata:
      ...
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: iad.ocir.io/oracle/oci-cluster-autoscaler:{{ image tag }}
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=oci
            - --max-node-provision-time=25m
            - --nodes=1:5:{{ node pool ocid 1 }}
            - --nodes=1:5:{{ node pool ocid 2 }}
            - --scale-down-delay-after-add=10m
            - --scale-down-unneeded-time=10m
            - --unremovable-node-recheck-timeout=5m
            - --balance-similar-node-groups
            - --balancing-ignore-label=displayName
            - --balancing-ignore-label=hostname
            - --balancing-ignore-label=internal_addr
            - --balancing-ignore-label=oci.oraclecloud.com/fault-domain
          imagePullPolicy: "Always"
          env:
          - name: OKE_USE_INSTANCE_PRINCIPAL
            value: "true"
          - name: OCI_SDK_APPEND_USER_AGENT
            value: "oci-oke-cluster-autoscaler"

- --cloud-provider=oci를 OKE 클러스터 버전이 1.27이 아닌, 1.26, 1.25, 또는 1.24인 경우 oci-oke로 변경합니다.
```
- --cloud-provider=oci-oke
```

- image: iad.ocir.io/oracle/oci-cluster-autoscaler:{{ image tag }}

Frankfurt, London, Ashburn, Phoenix 리전에 있는 이미지 중에서 가까운 곳에 있는 이미지를 사용하거나, 테넌시 OCIR에 별도로 push하여 사용합니다. 여기서는 Ashburn 리전의 이미지를 사용하겠습니다.

Image Location	Kubernetes Version	Image Path
US East (Ashburn)	Kubernetes 1.25	iad.ocir.io/oracle/oci-cluster-autoscaler:1.25.0-6
US East (Ashburn)	Kubernetes 1.26	iad.ocir.io/oracle/oci-cluster-autoscaler:1.26.2-11
US East (Ashburn)	Kubernetes 1.27	iad.ocir.io/oracle/oci-cluster-autoscaler:1.27.2-9
US East (Ashburn)	Kubernetes 1.28	iad.ocir.io/oracle/oci-cluster-autoscaler:1.28.0-5

- --nodes=1:5:{{ node pool ocid 1 }}
포맷은 아래와 같습니다. nodepool-ocid에 Cluster Autoscaler가 관리할 Node Pool의 OCID를 입력합니다.
```
--nodes=<min-nodes>:<max-nodes>:<nodepool-ocid>
```
- --nodes=1:5:{{ node pool ocid 2 }}
관리할 두번째 Node Pool이 있는 경우, 설정해 사용하고 아닌 경우 해당 줄은 삭제합니다. 셋 이상인 Node Pool을 관리자 하는 경우 같은 형식으로 추가합니다.

설정파일을 저장합니다.

Step 3: OKE 클러스터에 Cluster Autoscaler 배포하기

Kubernetes Cluster Autoscaler을 OKE 클러스터에 배포합니다.
```
kubectl apply -f cluster-autoscaler.yaml
```

배포결과를 확인하기 위해 로그를 확인합니다.

kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler

배포가 성공하면 다음과 같은 로그가 보입니다. Pod 3개중 하나가 리더가 되고, lock을 잡습니다. 관련 로그가 지속적으로 보이는 것을 볼 수 있습니다.

$ kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
...
Found 3 pods, using pod/cluster-autoscaler-544c49444b-45gdt
...
I0128 12:35:14.650472       1 main.go:474] Cluster Autoscaler 1.26.2
I0128 12:35:14.668348       1 leaderelection.go:248] attempting to acquire leader lease kube-system/cluster-autoscaler...
...
I0128 12:35:32.071336       1 leaderelection.go:352] lock is held by cluster-autoscaler-544c49444b-jjzk7 and has not yet expired
I0128 12:35:32.071357       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-autoscaler
I0128 12:35:35.723678       1 leaderelection.go:352] lock is held by cluster-autoscaler-544c49444b-jjzk7 and has not yet expired
I0128 12:35:35.724233       1 leaderelection.go:253] failed to acquire lease kube-system/cluster-autoscaler
...

Kubernetes Cluster Autoscaler Pod 세 개 중 어느 Pod가 실제 동작하고 있는 지 확인해 봅니다.

$ kubectl get pod -l app=cluster-autoscaler -n kube-system
NAME                                  READY   STATUS    RESTARTS   AGE
cluster-autoscaler-544c49444b-45gdt   1/1     Running   0          7m55s
cluster-autoscaler-544c49444b-gzx64   1/1     Running   0          7m55s
cluster-autoscaler-544c49444b-jjzk7   1/1     Running   0          7m55s

$ kubectl -n kube-system get lease cluster-autoscaler
NAME                 HOLDER                                AGE
cluster-autoscaler   cluster-autoscaler-544c49444b-jjzk7   34m

Kubernetes Cluster Autoscaler의 상태를 확인하기 위해 Config Map을 확인해 봅니다.
```
kubectl -n kube-system get cm cluster-autoscaler-status -oyaml
```

Step 4: 클러스터 오토스케일링 동작 확인해 보기

현재 Worker Node 상태를 확인합니다.

$ kubectl get nodes
NAME          STATUS   ROLES   AGE    VERSION
10.0.10.158   Ready    node    10d    v1.26.7
10.0.10.42    Ready    node    2d5h   v1.26.7
10.0.10.43    Ready    node    2d5h   v1.26.7

샘플 애플리케이션 배포 파일 예시입니다.

requests.cpu를 기본 200 밀리코어까지 사용할 수 있게 지정하였습니다. 0.2 코어로 설정한 예시

# nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m

샘플을 배포합니다.
```
kubectl apply -f nginx.yaml
```

Pod의 수를 늘립니다.

kubectl scale deployment nginx-deployment --replicas=40

Deployment 상태를 확인합니다. 배포하다가, 자원을 다 쓰고 더 이상 Pod를 생성하지 못하고 멈춰있게 됩니다.

$ kubectl get deployment nginx-deployment --watch
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   20/40   40           20          60s
nginx-deployment   22/40   40           22          60s
nginx-deployment   23/40   40           23          61s
nginx-deployment   24/40   40           24          61s

이벤트 로그를 확인해 보면, CPU 부족으로 Pod 스케줄링에 실패한 것을 볼 수 있습니다. 이로 인해 Cluster Scale 이벤트가 발생하고, 노드가 3개에서 5개로 늘어납니다.

$ kubectl get events --sort-by=.metadata.creationTimestamp
...
5m1s        Warning   FailedScheduling          pod/nginx-deployment-694bc9bdb8-bgqww                                           0/3 nodes are available: 3 Insufficient cpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod..
...
40s         Normal    TriggeredScaleUp    pod/nginx-deployment-694bc9bdb8-bgqww                                           pod triggered scale-up: [{ocid1.nodepool.oc1.ap-chuncheon-1.aaaaaaaabtavqjthmpeivjj5dj7i4yttl74y7ncnvoxgxn7kqna6xzvlolbq 3->5 (max: 5)}]

$ kubectl get nodes 
NAME          STATUS     ROLES    AGE    VERSION
10.0.10.132   NotReady   <none>   5s     v1.26.7
10.0.10.158   Ready      node     10d    v1.26.7
10.0.10.167   NotReady   node     13s    v1.26.7
10.0.10.42    Ready      node     2d5h   v1.26.7
10.0.10.43    Ready      node     2d5h   v1.26.7

확장된 Worker Node가 Ready 상태가 되면 나머지 Pod에 대한 스케줄링이 진행됩니다.

$ kubectl get nodes 
NAME          STATUS   ROLES   AGE     VERSION
10.0.10.132   Ready    node    3m52s   v1.26.7
10.0.10.158   Ready    node    10d     v1.26.7
10.0.10.167   Ready    node    4m      v1.26.7
10.0.10.42    Ready    node    2d5h    v1.26.7
10.0.10.43    Ready    node    2d5h    v1.26.7

$ kubectl get deployment nginx-deployment --watch
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   24/40   40           24          61s
...
...
...
nginx-deployment   25/40   40           25          7m18s
...
nginx-deployment   40/40   40           40          7m51s

Step 5: 정리 및 Scale In 확인하기

배포한 샘플 애플리케이션을 삭제합니다.
```
kubectl delete deployment nginx-deployment
```

scale-down-unneeded-time=10m 설정값에 따라 10분 뒤에 ScaleDown 이벤트가 발생한 것을 확인할 수 있습니다.

$ kubectl get events --sort-by=.metadata.creationTimestamp -A
...
default         12m         Normal    Killing                   pod/nginx-deployment-694bc9bdb8-t44jp                                           Stopping container nginx
default         105s        Normal    ScaleDown                 node/10.0.10.167                                                                marked the node as toBeDeleted/unschedulable
...
default         105s        Normal    ScaleDown                 node/10.0.10.132                                                                marked the node as toBeDeleted/unschedulable
...
default         89s         Normal    NodeNotSchedulable        node/10.0.10.167                                                                Node 10.0.10.167 status is now: NodeNotSchedulable
...
default         82s         Normal    ScaleDown                 node/10.0.10.132                                                                marked the node as toBeDeleted/unschedulable

노드의 상태를 조회해봅니다. 2개 노드가 스케줄링에서 제외되었습니다.

$ kubectl get nodes
NAME          STATUS                        ROLES   AGE    VERSION
10.0.10.132   Ready,SchedulingDisabled      node    22m    v1.26.7
10.0.10.158   Ready                         node    10d    v1.26.7
10.0.10.167   NotReady,SchedulingDisabled   node    22m    v1.26.7
10.0.10.42    Ready                         node    2d5h   v1.26.7
10.0.10.43    Ready                         node    2d5h   v1.26.7

이후 해당 노드가 삭제되고 원래대로 3개의 노드로 남았습니다.
Cluster Autoscaler 배포시 <min-nodes>을 1로 설정해도, Node Pool 생성시 지정한 수가 더 큰 경우, 그 수 만큼은 유지하는 것으로 보입니다.
```
$ kubectl get nodes
NAME          STATUS   ROLES   AGE    VERSION
10.0.10.158   Ready    node    10d    v1.26.7
10.0.10.42    Ready    node    2d6h   v1.26.7
10.0.10.43    Ready    node    2d6h   v1.26.7
```

이 글은 개인으로서, 개인의 시간을 할애하여 작성된 글입니다. 글의 내용에 오류가 있을 수 있으며, 글 속의 의견은 개인적인 의견입니다.

Last updated on 28 Jan 2024