kubernetes - kube-dns 不断在 coreos 上使用 kubernetes 重新启动

标签 kubernetes coreos kubectl kube-dns kubelet

我通过 CoreOS alpha (1353.1.0) 在 Container Linux 上安装了 Kubernetes
使用 hyperkube v1.5.5_coreos.0 使用我的 coreos-kubernetes 在 https://github.com/kfirufk/coreos-kubernetes 安装脚本的分支。
我有两台 ContainerOS 机器。

  • coreos-2.tux-in.com 解析为 192.168.1.2 作为 Controller
  • coreos-3.tux-in.com 解析为 192.168.1.3 作为 worker
  • kubectl get pods --all-namespaces 返回
    NAMESPACE       NAME                                       READY     STATUS    RESTARTS   AGE
    ceph            ceph-mds-2743106415-rkww4                  0/1       Pending   0          1d
    ceph            ceph-mon-check-3856521781-bd6k5            1/1       Running   0          1d
    kube-lego       kube-lego-3323932148-g2tf4                 1/1       Running   0          1d
    kube-system     calico-node-xq6j7                          2/2       Running   0          1d
    kube-system     calico-node-xzpp2                          2/2       Running   4560       1d
    kube-system     calico-policy-controller-610849172-b7xjr   1/1       Running   0          1d
    kube-system     heapster-v1.3.0-beta.0-2754576759-v1f50    2/2       Running   0          1d
    kube-system     kube-apiserver-192.168.1.2                 1/1       Running   0          1d
    kube-system     kube-controller-manager-192.168.1.2        1/1       Running   1          1d
    kube-system     kube-dns-3675956729-r7hhf                  3/4       Running   3924       1d
    kube-system     kube-dns-autoscaler-505723555-l2pph        1/1       Running   0          1d
    kube-system     kube-proxy-192.168.1.2                     1/1       Running   0          1d
    kube-system     kube-proxy-192.168.1.3                     1/1       Running   0          1d
    kube-system     kube-scheduler-192.168.1.2                 1/1       Running   1          1d
    kube-system     kubernetes-dashboard-3697905830-vdz23      1/1       Running   1246       1d
    kube-system     monitoring-grafana-4013973156-m2r2v        1/1       Running   0          1d
    kube-system     monitoring-influxdb-651061958-2mdtf        1/1       Running   0          1d
    nginx-ingress   default-http-backend-150165654-s4z04       1/1       Running   2          1d
    
    所以我可以看到 kube-dns-782804071-h78rf 不断重启。kubectl describe pod kube-dns-3675956729-r7hhf --namespace=kube-system 返回:
    Name:       kube-dns-3675956729-r7hhf
    Namespace:  kube-system
    Node:       192.168.1.2/192.168.1.2
    Start Time: Sat, 11 Mar 2017 17:54:14 +0000
    Labels:     k8s-app=kube-dns
            pod-template-hash=3675956729
    Status:     Running
    IP:     10.2.67.243
    Controllers:    ReplicaSet/kube-dns-3675956729
    Containers:
      kubedns:
        Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:kubedns
        Image:      gcr.io/google_containers/kubedns-amd64:1.9
        Image ID:       rkt://sha512-c7b7c9c4393bea5f9dc5bcbe1acf1036c2aca36ac14b5e17fd3c675a396c4219
        Ports:      10053/UDP, 10053/TCP, 10055/TCP
        Args:
          --domain=cluster.local.
          --dns-port=10053
          --config-map=kube-dns
          --v=2
        Limits:
          memory:   170Mi
        Requests:
          cpu:      100m
          memory:       70Mi
        State:      Running
          Started:      Sun, 12 Mar 2017 17:47:41 +0000
        Last State:     Terminated
          Reason:       Completed
          Exit Code:    0
          Started:      Sun, 12 Mar 2017 17:46:28 +0000
          Finished:     Sun, 12 Mar 2017 17:47:02 +0000
        Ready:      False
        Restart Count:  981
        Liveness:       http-get http://:8080/healthz-kubedns delay=60s timeout=5s period=10s #success=1 #failure=5
        Readiness:      http-get http://:8081/readiness delay=3s timeout=5s period=10s #success=1 #failure=3
        Volume Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
        Environment Variables:
          PROMETHEUS_PORT:  10055
      dnsmasq:
        Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq
        Image:      gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
        Image ID:       rkt://sha512-8c5f8b40f6813bb676ce04cd545c55add0dc8af5a3be642320244b74ea03f872
        Ports:      53/UDP, 53/TCP
        Args:
          --cache-size=1000
          --no-resolv
          --server=127.0.0.1#10053
          --log-facility=-
        Requests:
          cpu:      150m
          memory:       10Mi
        State:      Running
          Started:      Sun, 12 Mar 2017 17:47:41 +0000
        Last State:     Terminated
          Reason:       Completed
          Exit Code:    0
          Started:      Sun, 12 Mar 2017 17:46:28 +0000
          Finished:     Sun, 12 Mar 2017 17:47:02 +0000
        Ready:      True
        Restart Count:  981
        Liveness:       http-get http://:8080/healthz-dnsmasq delay=60s timeout=5s period=10s #success=1 #failure=5
        Volume Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
        Environment Variables:  <none>
      dnsmasq-metrics:
        Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:dnsmasq-metrics
        Image:      gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
        Image ID:       rkt://sha512-ceb3b6af1cd67389358be14af36b5e8fb6925e78ca137b28b93e0d8af134585b
        Port:       10054/TCP
        Args:
          --v=2
          --logtostderr
        Requests:
          memory:       10Mi
        State:      Running
          Started:      Sun, 12 Mar 2017 17:47:41 +0000
        Last State:     Terminated
          Reason:       Completed
          Exit Code:    0
          Started:      Sun, 12 Mar 2017 17:46:28 +0000
          Finished:     Sun, 12 Mar 2017 17:47:02 +0000
        Ready:      True
        Restart Count:  981
        Liveness:       http-get http://:10054/metrics delay=60s timeout=5s period=10s #success=1 #failure=5
        Volume Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
        Environment Variables:  <none>
      healthz:
        Container ID:   rkt://f6480fe7-4316-4e0e-9483-0944feb85ea3:healthz
        Image:      gcr.io/google_containers/exechealthz-amd64:v1.2.0
        Image ID:       rkt://sha512-3a85b0533dfba81b5083a93c7e091377123dac0942f46883a4c10c25cf0ad177
        Port:       8080/TCP
        Args:
          --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
          --url=/healthz-dnsmasq
          --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
          --url=/healthz-kubedns
          --port=8080
          --quiet
        Limits:
          memory:   50Mi
        Requests:
          cpu:      10m
          memory:       50Mi
        State:      Running
          Started:      Sun, 12 Mar 2017 17:47:41 +0000
        Last State:     Terminated
          Reason:       Completed
          Exit Code:    0
          Started:      Sun, 12 Mar 2017 17:46:28 +0000
          Finished:     Sun, 12 Mar 2017 17:47:02 +0000
        Ready:      True
        Restart Count:  981
        Volume Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from default-token-zqbdp (ro)
        Environment Variables:  <none>
    Conditions:
      Type      Status
      Initialized   True 
      Ready     False 
      PodScheduled  True 
    Volumes:
      default-token-zqbdp:
        Type:   Secret (a volume populated by a Secret)
        SecretName: default-token-zqbdp
    QoS Class:  Burstable
    Tolerations:    CriticalAddonsOnly=:Exists
    No events.
    
    这表明 kubedns-amd64:1.9Ready: false
    这是我的 kude-dns-de.yaml 文件:
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      name: kube-dns
      namespace: kube-system
      labels:
        k8s-app: kube-dns
        kubernetes.io/cluster-service: "true"
    spec:
      strategy:
        rollingUpdate:
          maxSurge: 10%
          maxUnavailable: 0
      selector:
        matchLabels:
          k8s-app: kube-dns
      template:
        metadata:
          labels:
            k8s-app: kube-dns
          annotations:
            scheduler.alpha.kubernetes.io/critical-pod: ''
            scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly", "operator":"Exists"}]'
        spec:
          containers:
          - name: kubedns
            image: gcr.io/google_containers/kubedns-amd64:1.9
            resources:
              limits:
                memory: 170Mi
              requests:
                cpu: 100m
                memory: 70Mi
            livenessProbe:
              httpGet:
                path: /healthz-kubedns
                port: 8080
                scheme: HTTP
              initialDelaySeconds: 60
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 5
            readinessProbe:
              httpGet:
                path: /readiness
                port: 8081
                scheme: HTTP
              initialDelaySeconds: 3
              timeoutSeconds: 5
            args:
            - --domain=cluster.local.
            - --dns-port=10053
            - --config-map=kube-dns
            # This should be set to v=2 only after the new image (cut from 1.5) has
            # been released, otherwise we will flood the logs.
            - --v=2
            env:
            - name: PROMETHEUS_PORT
              value: "10055"
            ports:
            - containerPort: 10053
              name: dns-local
              protocol: UDP
            - containerPort: 10053
              name: dns-tcp-local
              protocol: TCP
            - containerPort: 10055
              name: metrics
              protocol: TCP
          - name: dnsmasq
            image: gcr.io/google_containers/kube-dnsmasq-amd64:1.4.1
            livenessProbe:
              httpGet:
                path: /healthz-dnsmasq
                port: 8080
                scheme: HTTP
              initialDelaySeconds: 60
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 5
            args:
            - --cache-size=1000
            - --no-resolv
            - --server=127.0.0.1#10053
            - --log-facility=-
            ports:
            - containerPort: 53
              name: dns
              protocol: UDP
            - containerPort: 53
              name: dns-tcp
              protocol: TCP
            # see: https://github.com/kubernetes/kubernetes/issues/29055 for details
            resources:
              requests:
                cpu: 150m
                memory: 10Mi
          - name: dnsmasq-metrics
            image: gcr.io/google_containers/dnsmasq-metrics-amd64:1.0.1
            livenessProbe:
              httpGet:
                path: /metrics
                port: 10054
                scheme: HTTP
              initialDelaySeconds: 60
              timeoutSeconds: 5
              successThreshold: 1
              failureThreshold: 5
            args:
            - --v=2
            - --logtostderr
            ports:
            - containerPort: 10054
              name: metrics
              protocol: TCP
            resources:
              requests:
                memory: 10Mi
          - name: healthz
            image: gcr.io/google_containers/exechealthz-amd64:v1.2.0
            resources:
              limits:
                memory: 50Mi
              requests:
                cpu: 10m
                memory: 50Mi
            args:
            - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
            - --url=/healthz-dnsmasq
            - --cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1:10053 >/dev/null
            - --url=/healthz-kubedns
            - --port=8080
            - --quiet
            ports:
            - containerPort: 8080
              protocol: TCP
          dnsPolicy: Default
    
    这是我的 kube-dns-svc.yaml :
    apiVersion: v1
    kind: Service
    metadata:
      name: kube-dns
      namespace: kube-system
      labels:
        k8s-app: kube-dns
        kubernetes.io/cluster-service: "true"
        kubernetes.io/name: "KubeDNS"
    spec:
      selector:
        k8s-app: kube-dns
      clusterIP: 10.3.0.10
      ports:
      - name: dns
        port: 53
        protocol: UDP
      - name: dns-tcp
        port: 53
        protocol: TCP
    
    任何有关该问题的信息将不胜感激!
    更新rkt list --full 2> /dev/null | grep kubedns 显示:
    744a4579-0849-4fae-b1f5-cb05d40f3734    kubedns             gcr.io/google_containers/kubedns-amd64:1.9      sha512-c7b7c9c4393b running 2017-03-22 22:14:55.801 +0000 UTC   2017-03-22 22:14:56.814 +0000 UTC
    
    journalctl -m _MACHINE_ID=744a45790849b1f5cb05d40f3734 提供:
    Mar 22 22:17:58 kube-dns-3675956729-sthcv kubedns[8]: E0322 22:17:58.619254       8 reflector.go:199] pkg/dns/dns.go:145: Failed to list *api.Endpoints: Get https://10.3.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 10.3.0.1:443: connect: network is unreachable
    
    我试图将 - --proxy-mode=userspace 添加到 /etc/kubernetes/manifests/kube-proxy.yaml 但结果是一样的。kubectl get svc --all-namespaces 提供:
    NAMESPACE       NAME                   CLUSTER-IP   EXTERNAL-IP   PORT(S)         AGE
    ceph            ceph-mon               None         <none>        6789/TCP        1h
    default         kubernetes             10.3.0.1     <none>        443/TCP         1h
    kube-system     heapster               10.3.0.2     <none>        80/TCP          1h
    kube-system     kube-dns               10.3.0.10    <none>        53/UDP,53/TCP   1h
    kube-system     kubernetes-dashboard   10.3.0.116   <none>        80/TCP          1h
    kube-system     monitoring-grafana     10.3.0.187   <none>        80/TCP          1h
    kube-system     monitoring-influxdb    10.3.0.214   <none>        8086/TCP        1h
    nginx-ingress   default-http-backend   10.3.0.233   <none>        80/TCP          1h
    
    kubectl get cs 提供:
    NAME                 STATUS    MESSAGE              ERROR
    controller-manager   Healthy   ok
    scheduler            Healthy   ok
    etcd-0               Healthy   {"health": "true"}
    
    我的 kube-proxy.yaml 有以下内容:
    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-proxy
      namespace: kube-system
      annotations:
        rkt.alpha.kubernetes.io/stage1-name-override: coreos.com/rkt/stage1-fly
    spec:
      hostNetwork: true
      containers:
      - name: kube-proxy
        image: quay.io/coreos/hyperkube:v1.5.5_coreos.0
        command:
        - /hyperkube
        - proxy
        - --cluster-cidr=10.2.0.0/16
        - --kubeconfig=/etc/kubernetes/controller-kubeconfig.yaml
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: "ssl-certs"
        - mountPath: /etc/kubernetes/controller-kubeconfig.yaml
          name: "kubeconfig"
          readOnly: true
        - mountPath: /etc/kubernetes/ssl
          name: "etc-kube-ssl"
          readOnly: true
        - mountPath: /var/run/dbus
          name: dbus
          readOnly: false
      volumes:
      - hostPath:
          path: "/usr/share/ca-certificates"
        name: "ssl-certs"
      - hostPath:
          path: "/etc/kubernetes/controller-kubeconfig.yaml"
        name: "kubeconfig"
      - hostPath:
          path: "/etc/kubernetes/ssl"
        name: "etc-kube-ssl"
      - hostPath:
          path: /var/run/dbus
        name: dbus
    
    这是我能找到的所有有值(value)的信息。有任何想法吗? :)
    更新 2
    http://pastebin.com/2GApCj0n Controller ContainerOS 上的 iptables-save 输出
    更新 3
    我在 Controller 节点上运行 curl
    # curl https://10.3.0.1 --insecure
    Unauthorized
    
    意味着它可以正确访问它,我没有添加足够的参数来授权它吗?
    更新 4
    感谢 @jaxxstorm,我删除了 calico manifests,更新了他们的 quay/cni 和 quay/node 版本并重新安装了它们。
    现在 kubedns 不断重启,但我认为现在 calico 可以工作了。因为它第一次尝试在工作节点而不是 Controller 节点上安装 kubedns,而且当我对 kubedns pod 进行 rkt enter 并尝试 wget https://10.3.0.1 时,我得到:
    # wget https://10.3.0.1
    Connecting to 10.3.0.1 (10.3.0.1:443)
    wget: can't execute 'ssl_helper': No such file or directory
    wget: error getting response: Connection reset by peer
    
    这清楚地表明有某种 react 。哪个好?
    现在 kubectl get pods --all-namespaces 显示:
    kube-system     kube-dns-3675956729-ljz2w                  4/4       Running             88         42m
    
    所以.. 4/4 准备好了,但它一直在重新启动。
    http://pastebin.com/Z70U331Gkubectl describe pod kube-dns-3675956729-ljz2w --namespace=kube-system 输出
    所以它无法连接到 http://10.2.47.19:8081/readiness ,我猜这是 kubedns 的 IP,因为它使用端口 8081。不知道如何继续进一步调查这个问题。
    感谢一切!

    最佳答案

    kube-dns 有一个就绪探针,它尝试通过 kube-dns 的服务 IP 进行解析。您的服务网络可能有问题吗?

    在此处查看答案和解决方案:
    kubernetes service IPs not reachable

    关于kubernetes - kube-dns 不断在 coreos 上使用 kubernetes 重新启动,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42637493/

    相关文章:

    kubernetes - 如何访问Helm图表自定义值文件的嵌套项目?

    戈兰错误: reference to undefined identifier ‘syscall.TUNSETIFF’

    google-compute-engine - 为什么我无法安装 Google Heapster Kubernetes 插件?

    linux - CoreOS Kubernetes 如何监控 Node 进程?

    kubernetes - 默认的 kubeadm 配置文件在哪里?

    jenkins - 从 jenkins 容器运行 kubectl 命令

    kubernetes - Helm vs Batch Kubernetes 部署

    Azure 负载均衡器 IP 覆盖客户端 IP

    kubernetes - 如何在RHEL VM上运行minikube?

    kubernetes - 如何在 Kubernetes 中为同一部署分配不同数量的 pod 到不同节点?