TheKoguryo's 기술 블로그

 Version 2025-11-14

1.3.5 Cluster Autoscaler 테스트

오토스케일 기능을 테스트합니다. 테스트를 위한 샘플 애플리케이션은 편의상 Agones 기반 샘플 게임서버를 사용합니다.

환경 준비

테스트 클러스터

  • Kubernetes Version: 1.32.1
  • Network type: Flannel overlay
  • Node Shape
    • VM.Standard.E4.Flex
    • 4 OCPU(8vCPU), 64GB
    • Network bandwidth: 4 Gbps
  • Node image: Oracle-Linux-8.10-2025.07.21-0-OKE-1.32.1-967
  • Boot volume: 47 GB
  • Volume Performance Units (VPUs) : 10 VPU
  • IOPS: 2,820 IOPS (60 IOPS/GB)
  • Throughput: 22.56 MB/s (480 KB/s/GB)
  • Node Count: 3

Cluster Autoscaler 설치

  • oke-cluster-autoscaler-grp-policy 생성

  • Cluster Autoscaler Add-on 설치

    export OCI_CLI_REGION=ap-tokyo-1
    
    COMPARTMENT_ID=ocid1.compartment.oc1..aaaaa...
    CLUSTER_NAME=oke-cluster-ca
    CLUSTER_ID=`oci ce cluster list --compartment-id $COMPARTMENT_ID --name $CLUSTER_NAME --lifecycle-state=ACTIVE --query "data[0].id" --raw-output`
    NODE_POOL_NAME=pool1
    NODE_POOL_ID=`oci ce node-pool list --compartment-id $COMPARTMENT_ID --cluster-id $CLUSTER_ID --name $NODE_POOL_NAME --query "data[0].id" --raw-output`
    
    oci ce cluster disable-addon --cluster-id $CLUSTER_ID \
      --cluster-id $CLUSTER_ID \
      --addon-name ClusterAutoscaler \
      --is-remove-existing-add-on true \
      --wait-for-state SUCCEEDED \
      --max-wait-seconds 60
    
    oci ce cluster install-addon \
      --cluster-id $CLUSTER_ID \
      --addon-name ClusterAutoscaler \
      --wait-for-state SUCCEEDED \
      --max-wait-seconds 60 \
      --configurations "[
          {
            \"key\": \"nodes\",
            \"value\": \"2:20:${NODE_POOL_ID}\"
          },
          {
            \"key\": \"authType\",
            \"value\": \"workload\"
          },
          {
            \"key\": \"numOfReplicas\",
            \"value\": \"1\"
          },
          {
            \"key\": \"v\",
            \"value\": \"4\"
          },
          {
            \"key\": \"scanInterval\",
            \"value\": \"5s\"
          },
          {
            \"key\": \"skipNodesWithSystemPods\",
            \"value\": \"false\"
          },
          {
            \"key\": \"maxNodeProvisionTime\",
            \"value\": \"15m\"
          },
          {
            \"key\": \"scaleDownDelayAfterAdd\",
            \"value\": \"2m\"
          },
          {
            \"key\": \"scaleDownUnneededTime\",
            \"value\": \"2m\"
          }
        ]"
    

Agones & 샘플 게임 서버 설치

  1. 이미지 준비

    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/release/agones-sdk --is-public TRUE
    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/release/agones-allocator --is-public TRUE
    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/release/agones-controller --is-public TRUE
    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/release/agones-extensions --is-public TRUE
    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/release/agones-ping --is-public TRUE
    oci artifacts container repository create --compartment-id $COMPARTMENT_ID --display-name sandbox/agones-images/examples/supertuxkart-example --is-public TRUE
    
    NAMESPACE=`oci os ns get --query data --raw-output`
    
    docker pull us-docker.pkg.dev/agones-images/release/agones-sdk:1.53.0
    docker tag us-docker.pkg.dev/agones-images/release/agones-sdk:1.53.0 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-sdk:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-sdk:1.53.0
    
    docker pull us-docker.pkg.dev/agones-images/release/agones-allocator:1.53.0
    docker tag us-docker.pkg.dev/agones-images/release/agones-allocator:1.53.0 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-allocator:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-allocator:1.53.0
    
    docker pull us-docker.pkg.dev/agones-images/release/agones-controller:1.53.0
    docker tag us-docker.pkg.dev/agones-images/release/agones-controller:1.53.0 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-controller:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-controller:1.53.0
    
    docker pull us-docker.pkg.dev/agones-images/release/agones-extensions:1.53.0
    docker tag us-docker.pkg.dev/agones-images/release/agones-extensions:1.53.0 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-extensions:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-extensions:1.53.0
    
    docker pull us-docker.pkg.dev/agones-images/release/agones-ping:1.53.0
    docker tag us-docker.pkg.dev/agones-images/release/agones-ping:1.53.0 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-ping:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-ping:1.53.0
    
    docker pull us-docker.pkg.dev/agones-images/examples/supertuxkart-example:0.19
    docker tag us-docker.pkg.dev/agones-images/examples/supertuxkart-example:0.19 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19
    
    docker pull us-docker.pkg.dev/agones-images/examples/supertuxkart-example:0.19 
    docker tag us-docker.pkg.dev/agones-images/examples/supertuxkart-example:0.19 ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release/agones-sdk:1.53.0
    docker push nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19 
    
  2. Agone 설치

    helm repo add agones https://agones.dev/chart/stable
    helm repo update
    helm uninstall my-agones --namespace agones-system
    helm install my-agones --namespace agones-system \
    --create-namespace agones/agones \
    --set agones.image.registry=ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/release \
    --set agones.ping.udp.expose=false
    
    
    --set agones.ping.udp.annotations."oci\.oraclecloud\.com/load-balancer-type"="nlb"
    
  3. 게임서버 설치

    wget https://agones.dev/site/docs/examples/supertuxkart/
    wget https://raw.githubusercontent.com/googleforgames/agones/release-1.53.0/examples/supertuxkart/fleet.yaml
    
  4. fleet.yaml 수정

    ...
              containers:
                - name: supertuxkart
                  image: ap-tokyo-1.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19
                  resources:
                    limits:
                      cpu: 1000m
                      memory: 4Gi
                    requests:
                      cpu: 1000m
                      memory: 4Gi
    
  5. 배포

    kubectl apply -f fleet.yaml
    
테스트 #1. 기본 구성 테스트
  1. 게임 서버 준비

    kubectl scale fleet supertuxkart --replicas=21
    
  2. 노드 3->4 스케일

    kubectl scale fleet supertuxkart --replicas=28
    kubectl get pod --watch
    
  3. 이벤트 확인

    $ kubectl get events -o custom-columns=Node:.source.host,kind:.involvedObject.kind,name:.involvedObject.name,timestamp:.metadata.creationTimestamp,Count:.count,From:.source.component,Reason:.reason,Message:.message \
    --sort-by=.metadata.creationTimestamp \
    --watch | grep -E "PortAllocation|RequestReady|TriggeredScaleUp|NodeReady|Pulling|Pulled"
    
    <none>        GameServer            supertuxkart-wk72n-ss4ng                2025-10-29T15:51:19Z   1        gameserver-controller             PortAllocation                    Port allocated
    ...
    <none>        Pod                   supertuxkart-wk72n-rmkgw                2025-10-29T15:51:23Z   1        cluster-autoscaler                TriggeredScaleUp                  pod triggered scale-up: [{ocid1.nodepool.oc1.ap-tokyo-1.aaaaaaaacqam3j4kqhazjofg5oo6ysfblm5zaxtbnofnybtwtn25xbi7anva 3->4 (max: 15)}]
    ...
    10.0.10.128   Node                  10.0.10.128                             2025-10-29T15:53:14Z   1        kubelet                           NodeReady                         Node 10.0.10.128 status is now: NodeReady
    ...
    10.0.10.128   Pod                   supertuxkart-wk72n-7l5jg                2025-10-29T15:54:46Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m17.536s (1m17.536s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-l5fr6                2025-10-29T15:54:46Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m16.105s (1m16.106s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-rmkgw                2025-10-29T15:54:47Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m17.562s (1m17.562s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-ss4ng                2025-10-29T15:54:47Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m18.435s (1m18.435s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-h5ff7                2025-10-29T15:54:47Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m18.458s (1m18.458s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-2676m                2025-10-29T15:54:47Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m16.487s (1m16.487s including waiting). Image size: 1107512625 bytes.
    10.0.10.128   Pod                   supertuxkart-wk72n-f648f                2025-10-29T15:54:47Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 1m18.68s (1m18.68s including waiting). Image size: 1107512625 bytes.
    ...
    <none>        GameServer            supertuxkart-wk72n-f648f                2025-10-29T15:54:49Z   1        gameserver-sidecar                RequestReady                      SDK state change
    
  4. supertuxkart-example:0.19 이미지 하나 가져오는데 1m18.68s 걸린 걸 7개가 동시로 계산해 보면

    항목
    단일 이미지 크기 1,107,512,625 bytes (≈ 1,056.21 MiB)
    단일 이미지 평균 전송속도 ≈ 13.42 MiB/s (≈ 107.39 Mb/s)
    7개 총 데이터량 ≈ 7,393.44 MiB (≈ 7.22 GiB)
    필요한 총 네트워크 대역폭 ≈ 93.97 MiB/s ≈ 751.75 Mb/s ≈ 0.752 Gbps
    디스크 순차 쓰기(집계) ≈ 93.97 MiB/s
    (랜덤 4KiB 기준) 이론적 IOPS ≈ 24,056 IOPS
    • 현재 Shape 기준으로 Network bandwidth: 4 Gbps 충족, 2,820 IOPS는 부족한 상태
    • 24,056 IOPS / 7 => 3,436 IOPS 필요
    • 디스크 100GB(10 VPU 기준)로 증가시 => 6,000 IOPS 및 7개 동시가 아닌 한번 받도록 변경 필요
테스트 #2. 디스크 100 GB, --serialize-image-pulls=true 설정
  1. Node Pool 업데이트

    1. cloud-init 업데이트
    #!/bin/bash
    curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
    bash /var/run/oke-init.sh --kubelet-extra-args "--serialize-image-pulls=true"
    sudo /usr/libexec/oci-growfs -y
    
    1. Boot Volume: 100 GB로 증설
  2. 게임 서버 준비

    kubectl scale fleet supertuxkart --replicas=21
    
  3. 노드 3->4 스케일

    kubectl scale fleet supertuxkart --replicas=28
    kubectl get pod --watch
    
  4. 이벤트 확인

    $ kubectl get events -o custom-columns=Node:.source.host,kind:.involvedObject.kind,name:.involvedObject.name,timestamp:.metadata.creationTimestamp,Count:.count,From:.source.component,Reason:.reason,Message:.message \
    --sort-by=.metadata.creationTimestamp \
    --watch | grep -E "PortAllocation|RequestReady|TriggeredScaleUp|NodeReady|Pulling|Pulled"
    
    <none>        GameServer            supertuxkart-wk72n-g57c7                2025-10-29T17:06:13Z   1        gameserver-controller             PortAllocation                    Port allocated
    ...
    <none>        Pod                   supertuxkart-wk72n-jskj6                2025-10-29T17:06:16Z   1        cluster-autoscaler                TriggeredScaleUp                  pod triggered scale-up: [{ocid1.nodepool.oc1.ap-tokyo-1.aaaaaaaacqam3j4kqhazjofg5oo6ysfblm5zaxtbnofnybtwtn25xbi7anva 3->4 (max: 15)}]
    ...
    10.0.10.245   Node                  10.0.10.245                             2025-10-29T17:08:01Z   1        kubelet                           NodeReady                         Node 10.0.10.245 status is now: NodeReady
    ...
    10.0.10.245   Pod                   supertuxkart-wk72n-v9j5k                2025-10-29T17:09:04Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 10.447s (11.109s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-g57c7                2025-10-29T17:09:04Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 78ms (11.123s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-jskj6                2025-10-29T17:09:04Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 69ms (10.772s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-slwcj                2025-10-29T17:09:04Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 72ms (10.841s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-qv4kz                2025-10-29T17:09:05Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 72ms (10.842s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-xzf6z                2025-10-29T17:09:05Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 86ms (10.923s including waiting). Image size: 1107512625 bytes.
    10.0.10.245   Pod                   supertuxkart-wk72n-gkgj2                2025-10-29T17:09:05Z   1        kubelet                           Pulled                            Successfully pulled image "nrt.ocir.io/$NAMESPACE/sandbox/agones-images/examples/supertuxkart-example:0.19" in 74ms (10.752s including waiting). Image size: 1107512625 bytes.
    ...
    <none>        GameServer            supertuxkart-wk72n-gkgj2                2025-10-29T17:09:06Z   1        gameserver-sidecar                RequestReady                      SDK state change
    
  5. supertuxkart-example:0.19 이미지 하나 가져오는데 10.447s 걸리고 이후 나머지 6개는 대기후 이미 받은 이미지 사용함

Custom 이미지 활용

오토 스케일시 동적으로 생성되는 노드가 Ready가 되는 시간을 줄이기 위한 방법으로 1.10.3 Worker Node 생성시 Custom Image 사용하기를 활용할 수 있습니다. 미리 필요한 이미지를 포함시킨 커스텀 이미지를 사용하면 새 노드가 큰 이미지를 받는 시간은 단축할 수 있습니다.

CI/CD를 통해 주기적인 업데이트가 되는 애플리케이션인 경우, 매번 Custom 이미지를 새로 만들것이냐, 변경이 적은 이미지만 커스텀 이미지에 포함시킬 것이냐는 생각해 볼 필요가 있습니다.

주요 설정 항목
  1. Add-On 배포시 설정
    Key Description 디폴트
    scanInterval How often cluster is re-evaluated for scale up or down. 10s
    unremovableNodeRecheckTimeout The timeout before we check again a node that couldn’t be removed before. 5m
    scaleDownDelayAfterAdd How long after scale up that scale down evaluation resumes. 10m
    scaleDownUnneededTime How long a node should be unneeded before it is eligible for scale down. 10m
    scaleDownCandidatesPoolRatio A ratio of nodes that are considered as additional non-empty candidates for scale down
    when some candidates from previous iteration are no longer valid.
    0.1 (10%)
    maxTotalUnreadyPercentage Maximum percentage of unready nodes in the cluster. After this is exceeded, CA halts operations. 45
    okTotalUnreadyCount Number of allowed unready nodes, irrespective of maxTotalUnreadyPercentage. 3
Karpenter

쿠버네티스 노드 오토스케일링에는 크게 Cluster Autoscaler와 Karpenter가 있습니다. Karpenter는 OKE에서는 Zoom에서 오픈소스를 사용할 수 있습니다. 해당 기능은 Self-Managed 노드를 활용합니다. 다음 문서를 참조합니다.



이 글은 개인으로서, 개인의 시간을 할애하여 작성된 글입니다. 글의 내용에 오류가 있을 수 있으며, 글 속의 의견은 개인적인 의견입니다.

Last updated on 14 Nov 2025