Malfunctioning Eddie's Ramblings

Trying to make sense through rambling?

AWX and execution/hop nodes, how hard can it be? - Part 1

2022-05-03 MalfuncEdddie

Table of Contents

Intro

With the migration of AWX to k8s (with the awx-operator) I kind of gotten stuck because I lost some connectivity to my test enviroment (AKS in azure and ansible “clients” in my homelab). Yes I could fix the network but with with the release of ansible-receptor I should be able to just deploy a execution node in my homelab and execute everything from the execution node. How hard can it be?

Disclaimer

Disclaimer 1:

THESE RAMBELINGS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Disclaimer 2:

DO NOT USE IN PRODUCTION! Please buy red hat automation platform if you want a supported version. This is more of a “Can I run Doom on a printer” project.

Still here? Let go on a technical adventure!

Enviroment

We got a running AWX instance in k8s installed with the awx-operator. In my home lab I got a linux server that will (hopefully) act as execution node.

AKS                                                        Home lab
┌────────────────────────────────────────────┐
│                                            │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────┐ │
│ │awx-task│ │awx-web │ │awx-ee  │ │redis  │ │
│ │        │ │        │ │        │ │       │ │
│ │        │ │        │ │        │ │       │ │
│ │        │ │        │ │        │ │       │ │
│ └────────┘ └────────┘ └────────┘ └───────┘ │
│                                            ├────────────────────┐
│                                            │                    ▼
│                                            │             ┌──────────────┐
│                                            │             │ Linux        │
│                                            │             │              │
│                                            │             │execution     │
└────────────────────────────────────────────┘             │              │
                                                           │ node         │
                                                           └──────────────┘

Here is my aws manifest file(censored and secret missing)

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  postgres_configuration_secret: awx-postgres-configuration
  secret_key_secret: awx-secret-key
  service_type: LoadBalancer
  loadbalancer_protocol: http
  loadbalancer_port: 80
  service_labels: |
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  service_annotations: |
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  admin_user: admin
  admin_email: ****

How I think it should work

The backing project of the automation mesh as Red Hat calls it is a prject called ansible-receptor (receptor). This allows you to build a mesh network that can execute certain “work commands” on receptor nodes in its mesh network.

Awx workflow:

  1. Press launch.
  2. AWX launches an internal k8s pod executes the playbook.

Awx workflow with receptor:

  1. Press launch.
  2. Awx knows it should be executed not on k8s but on a execution node.
  3. It sends by receptor network an ansible-runner command to the node.
  4. Node launches a container that executes the playbook.

Translating “How I think it works” to “Lets make this work”

First thing we need to do is get a working understanding of how receptor works. If we can setup a receptor network as a test we know we got a working connection.

How does receptor work

Receptor is a service that can be a listener or a connector (or both). It has multiple security options like tls, signed work and firewall rules. Receptor has a “work-command” section that describes what you can execute on a node. This will be the ansible-runner command. More info on receptor.

To keep it easy we will setup a receptor network without any security stuff.

┌───────────┐                        ┌──────────┐
│vm1        │                        │ vm2      │
│           │                        │          │
│listener   │                        │ listener │
│2222       │                        │ 2223     │
│           ├────────────────────────►          │
│           │                        │          │
│           │                        │          │
└───────────┘                        └───┬──────┘
                                         │
                                         │
                                         │
                                         │
                                    ┌────▼─────┐
                                    │ vm3      │
                                    │          │
                                    │ connect  │
                                    │ vm2 2223 │
                                    │          │
                                    │          │
                                    │          │
                                    └──────────┘

VM1 listens on 2222 VM2 connects to VM1 on port 2222 and listens on 2223 VM3 connects to VM2 on port 2223 (has no direct link to VM1)

receptor.conf of VM1

---
- node:
    id: VM1

- log-level: info

- control-service:
    service: control
    filename: /var/run/receptor/receptor.sock
    permissions: 0660

- tcp-listener:
    port: 2222

- tcp-peer:
    address: VM2:2223
    redial: true

receptor.conf of VM2

---
- node:
    id: VM2

- log-level: info

- control-service:
    service: control
    filename: /var/run/receptor/receptor.sock
    permissions: 0660

- tcp-listener:
    port: 2223

receptor.conf of VM3

---
- node:
    id: VM3

- log-level: info

- control-service:
    service: control
    filename: /var/run/receptor/receptor.sock
    permissions: 0660

- tcp-peer:
    address: VM2:2223
    redial: true

Start receptor on the VM’s VM1

INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
WARNING 2022/05/09 11:17:40 Backend connection failed (will retry): dial tcp 172.31.222.229:2223: connect: connection refused
INFO 2022/05/09 11:17:45 Connection established with VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45    VM1: VM2(1.00)
INFO 2022/05/09 11:17:45    VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45    VM2 via VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45    VM1: VM2(1.00)
INFO 2022/05/09 11:17:45    VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45    VM3: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45    VM3 via VM2
INFO 2022/05/09 11:17:45    VM2 via VM2

VM2

INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
INFO 2022/05/09 11:17:45 Connection established with VM1
INFO 2022/05/09 11:17:45 Connection established with VM3
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45    VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45    VM1: VM2(1.00)
INFO 2022/05/09 11:17:45    VM3: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45    VM1 via VM1
INFO 2022/05/09 11:17:45    VM3 via VM3

VM3

INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
WARNING 2022/05/09 11:17:40 Backend connection failed (will retry): dial tcp 172.31.222.229:2223: connect: connection refused
INFO 2022/05/09 11:17:45 Connection established with VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45    VM3: VM2(1.00)
INFO 2022/05/09 11:17:45    VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45    VM1: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45    VM1 via VM2
INFO 2022/05/09 11:17:45    VM2 via VM2

Use receptorctl on VM1 to the others

#receptorctl --socket /var/run/receptor/receptor.sock ping VM3
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM3 in 1.233213ms
Reply from VM3 in 1.253816ms
^C
# receptorctl --socket /var/run/receptor/receptor.sock ping VM2
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM2 in 487.912µs
Reply from VM2 in 406.554µs
^C
# receptorctl --socket /var/run/receptor/receptor.sock ping VM1
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM1 in 32.88µs
Reply from VM1 in 45.395µs
# receptorctl --socket /var/run/receptor/receptor.sock status
Warning: receptorctl and receptor are different versions, they may not be compatible
Node ID: VM1
Version: 1.1.1
System CPU Count: 1
System Memory MiB: 3935

Connection   Cost
VM2          1

Known Node   Known Connections
VM1          VM2: 1
VM2          VM1: 1 VM3: 1
VM3          VM2: 1

Route        Via
VM2          VM2
VM3          VM2

Node         Service   Type       Last Seen             Tags
VM1          control   Stream     2022-05-09 11:20:24   {'type': 'Control Service'}
VM2          control   Stream     2022-05-09 11:19:45   {'type': 'Control Service'}
VM3          control   Stream     2022-05-09 11:19:45   {'type': 'Control Service'}

Yeey we got connection from VM1 to VM3!

Changing the AWX install on k8s and taking a peek at the development docker version of AWX

Now that we know how receptor works we need to translate this configuration to awx/k8s.

First thing that has to be done is to define an extra service that allows our execution node to make a connection to AWX on a given port. The receptor service on the execution node needs this access. It should look something like:

apiVersion: v1
kind: Service
metadata:
  name: receptor
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
spec:
  selector:
    app.kubernetes.io/component: awx
  ports:
    - port: 6996
      targetPort: 6996
  type: LoadBalancer

This creates a service and you can connect on it using port 6996.

But what should the config look like? To answer this question we will look at the docker-compose dev version of AWX. docker-compose version.

In the template directory we can see three receptor configs. Fumbling with some configs I think it should look like this (some security stuff disabled as this is a proof of concept):

AWX node

---
- node:
    id: awx-ee
    #firewallrules:
    #  - action: "reject"
    #    tonode: awx_{{ item }}
    #    toservice: "control"

- log-level: info

- tcp-listener:
    port: 6996

#- work-signing:
#    privatekey: /etc/receptor/work_private_key.pem
#    tokenexpiration: 1m

#- work-verification:
#    publickey: /etc/receptor/work_public_key.pem


#- tls-server:
#    name: mutual-tls
#    cert: /etc/receptor/certs/awx.crt
#    key: /etc/receptor/certs/awx.key
#    requireclientcert: true
#    clientcas: /etc/receptor/certs/ca.crt

- control-service:
    service: control
    filename: /var/run/receptor/receptor.sock

- work-command:
    worktype: local
    command: ansible-runner
    params: worker
    allowruntimeparams: true
    verifysignature: true

- work-kubernetes:
    worktype: kubernetes-runtime-auth
    authmethod: runtime
    allowruntimeauth: true
    allowruntimepod: true
    allowruntimeparams: true
    verifysignature: true

- work-kubernetes:
    worktype: kubernetes-incluster-auth
    authmethod: incluster
    allowruntimeauth: true
    allowruntimepod: true
    allowruntimeparams: true
    verifysignature: true

Work node (hostname recptor)

---
- node:
    id: receptor

- log-level: info

- tcp-peer:
    address: <IP from azure LB>:6996
    redial: true

#- work-verification:
#    publickey: /etc/receptor/work_public_key.pem

- work-command:
    worktype: ansible-runner
    command: ansible-runner
    params: worker
    allowruntimeparams: true
    verifysignature: true

- control-service:
    service: control
    filename: /var/run/receptor/receptor.sock

On the AWX pod there are 4 containers. If I look at the mounts of each container there are 2 containers that some receptor mounts.

  1. awx-task has

/etc/receptor/receptor.conf from awx-receptor-config (ro) /var/run/receptor from receptor-socket (rw)

and

  1. awx-ee

/etc/receptor/receptor.conf from awx-receptor-config (ro) /var/run/receptor from receptor-socket (rw)

When opening a console session and reviewing the supervisord config file I found out that receptor is only running on the awx-ee container. I tought this would be in the task container but … okay :)

The awx operator allows you to have custom volume mounts template directory

We add a config map:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: receptor-ee-extra-config
  namespace: awx
data:

  receptor.conf: |
    ---
    - node:
        id: axw-ee

    - log-level: info

    - control-service:
        service: control
        filename: /var/run/receptor/receptor.sock
        permissions: 0660
        #tls: tls_server


    # Listener
    - tcp-listener:
        port: 6996

    - local-only:

    - work-command:
        worktype: local
        command: ansible-runner
        params: worker
        allowruntimeparams: true

    - work-kubernetes:
        worktype: kubernetes-runtime-auth
        authmethod: runtime
        allowruntimeauth: true
        allowruntimepod: true
        allowruntimeparams: true

    - work-kubernetes:
        worktype: kubernetes-incluster-auth
        authmethod: incluster
        allowruntimeauth: true
        allowruntimepod: true
        allowruntimeparams: true

kubectl apply -f …. -n awx

We edit the “main awx.yml”

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  postgres_configuration_secret: awx-postgres-configuration
  secret_key_secret: awx-secret-key
  service_type: LoadBalancer
  loadbalancer_protocol: http
  loadbalancer_port: 80
  service_labels: |
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  service_annotations: |
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  admin_user: admin
  admin_email: ****
  extra_volumes: |
    - name: receptor-cfg-ee
      configMap:
        defaultMode: 420
        items:
          - key: receptor.conf
            path: receptor.conf
        name: receptor-ee-extra-config
  ee_extra_volume_mounts: |
    - name: receptor-cfg-ee
      mountPath: /etc/receptor/receptor.conf
      subPath: receptor.conf

kubectl apply -f …. -n awx

Do not forget to have the receptor service running on the home lab server.

And behold!

# receptorctl  --socket /var/run/receptor/receptor.sock status
Warning: receptorctl and receptor are different versions, they may not be compatible
Node ID: receptor
Version: +g
System CPU Count: 1
System Memory MiB: 3935

Connection   Cost
axw-ee       1

Known Node   Known Connections
axw-ee       receptor: 1
receptor     axw-ee: 1

Route        Via
axw-ee       axw-ee

Node         Service   Type       Last Seen             Tags
receptor     control   Stream     2022-05-09 12:26:03   {'type': 'Control Service'}
axw-ee       control   Stream     2022-05-09 10:25:57   {'type': 'Control Service'}

Node         Work Types
receptor     ansible-runner
axw-ee       local, kubernetes-runtime-auth, kubernetes-incluster-auth

Are we there yet?

Short and long answer: NO!

Now we have a working receptor network, but we still have to let AWX know the peers exist. Revisiting docker-compose version we find out some awx-mange commands do this.

We log in to our awx-task container

awx-manage register_queue --queuename=remote --instance_percent=100
awx-manage provision_instance --hostname="receptor" --node_type="execution"
awx-manage register_peers axw-UID --peers "receptor"

First command creates a new instance group Second command provisions the instance Third command generates a linkg beween the awx pod an my execution node. (I still have to find out how to give the awx pod in recpator a fixed name)

(I do think I also needed to associate the instance with the group)

And behold!

alt text alt text

However….

Failure

At this point I tought I had it all figured out. The AWX gui showed a good topology and everything was registered. However when testing this out by changing an inventory to use the remote instance group –> PENDING. I got a job that is forever in the pending state. Analysing the logs I get the following lines.

2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler ad_hoc_command 14282 (pending) couldn't be scheduled on graph, waiting for next cycle
2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler Skipping task ad_hoc_command 14282 (pending) in pending, not enough capacity left on controlplane to control new tasks
2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler Finishing Scheduler

It looks like the job is trying to run but cannot be scheduled on the control plane?

Help !?

This is the part where I ask for help. I have no idea what I am missing. On the mailing list I did get a hint that “You may need to look into the work-signing field in receptor configs.”. So that will be the next thing I will look into. If I’ll ever find out I will create a part 2 article where I solve this and also try to add some security options.

If someone wants to contribute you can always comment below.

PS: You can also always Buy me a coffee