Why can kubernetes scheduler ignore nodeAffinity?

R

riberk2019-07-14 09:59:33

Kubernetes

riberk, 2019-07-14 09:59:33

Hello.
There is a k8s cluster version 1.12 deployed on aws using kops
The cluster has a number of nodes marked with the 'example.com/wtf' label, which takes the values a, b, c, d
For an example, something like this

Node name          example.com/wtf
instance1                  a
instance2                  b
instance3                  c
instance4                  d

And there is a test deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-scheduler
spec:
  replicas: 6
  selector:
    matchLabels:
      app: test-scheduler
  template:
    metadata:
      labels:
        app: test-scheduler
    spec:
      tolerations:
        - key: spot
          operator: Exists
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              - key: example.com/wtf
                operator: In
                values:
                - a
            weight: 40
          - preference:
              matchExpressions:
              - key: example.com/wtf
                operator: In
                values:
                - b
            weight: 35
          - preference:
              matchExpressions:
              - key: example.com/wtf
                operator: In
                values:
                - c
            weight: 30
          - preference:
              matchExpressions:
              - key: example.com/wtf
                operator: In
                values:
                - d
            weight: 25
      containers:
      - name: a
        resources:
          requests:
            cpu: "100m"
            memory: "50Mi"
          limits:
            cpu: "100m"
            memory: "50Mi"
        image: busybox
        command:
          - 'sleep'
          - '99999'

Judging by the documentation, nodeAffinity should add up for each node that a Pod can be scheduled to,
and the node with the highest weight sum wins. But in my case, the nodes are selected quite randomly.
For example, 6 pods from the deployment are planned for such 5 nodes:

NODE    LABEL
wtf1	NONE
node1	a
node2	b
node3	c
wtf2	NONE

At the same time, the nodes wtf1 and wtf2 do not contain my label at all (there is another node with this label with the value 'd').
There is a place on all nodes, they are available and can run pods
In this regard, 2 questions
1. Why is this happening?
2. Does the scheduler write somewhere logs of how the node for the pod was selected? There is no this information in the events and the scheduler logs on the masters are also empty.
The main point of these manipulations is that I want to separate all nodes with tags by size
and fill with my applications first small and cheap virtual machines, and only then large and expensive ones (not to the detriment of fault tolerance, i. e. podAntiAffinity has more weight than nodeAffinity in size).