AWX and execution/hop nodes, how hard can it be? - Part 1
Table of Contents
Intro
With the migration of AWX to k8s (with the awx-operator) I kind of gotten stuck because I lost some connectivity to my test enviroment (AKS in azure and ansible “clients” in my homelab). Yes I could fix the network but with with the release of ansible-receptor I should be able to just deploy a execution node in my homelab and execute everything from the execution node. How hard can it be?
Disclaimer
Disclaimer 1:
THESE RAMBELINGS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Disclaimer 2:
DO NOT USE IN PRODUCTION! Please buy red hat automation platform if you want a supported version. This is more of a “Can I run Doom on a printer” project.
Still here? Let go on a technical adventure!
Enviroment
We got a running AWX instance in k8s installed with the awx-operator. In my home lab I got a linux server that will (hopefully) act as execution node.
AKS Home lab
┌────────────────────────────────────────────┐
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌───────┐ │
│ │awx-task│ │awx-web │ │awx-ee │ │redis │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
│ └────────┘ └────────┘ └────────┘ └───────┘ │
│ ├────────────────────┐
│ │ ▼
│ │ ┌──────────────┐
│ │ │ Linux │
│ │ │ │
│ │ │execution │
└────────────────────────────────────────────┘ │ │
│ node │
└──────────────┘
Here is my aws manifest file(censored and secret missing)
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
namespace: awx
spec:
postgres_configuration_secret: awx-postgres-configuration
secret_key_secret: awx-secret-key
service_type: LoadBalancer
loadbalancer_protocol: http
loadbalancer_port: 80
service_labels: |
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service_annotations: |
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
admin_user: admin
admin_email: ****
How I think it should work
The backing project of the automation mesh as Red Hat calls it is a prject called ansible-receptor (receptor). This allows you to build a mesh network that can execute certain “work commands” on receptor nodes in its mesh network.
Awx workflow:
- Press launch.
- AWX launches an internal k8s pod executes the playbook.
Awx workflow with receptor:
- Press launch.
- Awx knows it should be executed not on k8s but on a execution node.
- It sends by receptor network an ansible-runner command to the node.
- Node launches a container that executes the playbook.
Translating “How I think it works” to “Lets make this work”
First thing we need to do is get a working understanding of how receptor works. If we can setup a receptor network as a test we know we got a working connection.
How does receptor work
Receptor is a service that can be a listener or a connector (or both). It has multiple security options like tls, signed work and firewall rules. Receptor has a “work-command” section that describes what you can execute on a node. This will be the ansible-runner command. More info on receptor.
To keep it easy we will setup a receptor network without any security stuff.
┌───────────┐ ┌──────────┐
│vm1 │ │ vm2 │
│ │ │ │
│listener │ │ listener │
│2222 │ │ 2223 │
│ ├────────────────────────► │
│ │ │ │
│ │ │ │
└───────────┘ └───┬──────┘
│
│
│
│
┌────▼─────┐
│ vm3 │
│ │
│ connect │
│ vm2 2223 │
│ │
│ │
│ │
└──────────┘
VM1 listens on 2222 VM2 connects to VM1 on port 2222 and listens on 2223 VM3 connects to VM2 on port 2223 (has no direct link to VM1)
receptor.conf of VM1
---
- node:
id: VM1
- log-level: info
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
permissions: 0660
- tcp-listener:
port: 2222
- tcp-peer:
address: VM2:2223
redial: true
receptor.conf of VM2
---
- node:
id: VM2
- log-level: info
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
permissions: 0660
- tcp-listener:
port: 2223
receptor.conf of VM3
---
- node:
id: VM3
- log-level: info
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
permissions: 0660
- tcp-peer:
address: VM2:2223
redial: true
Start receptor on the VM’s VM1
INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
WARNING 2022/05/09 11:17:40 Backend connection failed (will retry): dial tcp 172.31.222.229:2223: connect: connection refused
INFO 2022/05/09 11:17:45 Connection established with VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45 VM1: VM2(1.00)
INFO 2022/05/09 11:17:45 VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45 VM2 via VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45 VM1: VM2(1.00)
INFO 2022/05/09 11:17:45 VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45 VM3: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45 VM3 via VM2
INFO 2022/05/09 11:17:45 VM2 via VM2
VM2
INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
INFO 2022/05/09 11:17:45 Connection established with VM1
INFO 2022/05/09 11:17:45 Connection established with VM3
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45 VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45 VM1: VM2(1.00)
INFO 2022/05/09 11:17:45 VM3: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45 VM1 via VM1
INFO 2022/05/09 11:17:45 VM3 via VM3
VM3
INFO 2022/05/09 11:17:40 Running control service control
INFO 2022/05/09 11:17:40 Initialization complete
WARNING 2022/05/09 11:17:40 Backend connection failed (will retry): dial tcp 172.31.222.229:2223: connect: connection refused
INFO 2022/05/09 11:17:45 Connection established with VM2
INFO 2022/05/09 11:17:45 Known Connections:
INFO 2022/05/09 11:17:45 VM3: VM2(1.00)
INFO 2022/05/09 11:17:45 VM2: VM1(1.00) VM3(1.00)
INFO 2022/05/09 11:17:45 VM1: VM2(1.00)
INFO 2022/05/09 11:17:45 Routing Table:
INFO 2022/05/09 11:17:45 VM1 via VM2
INFO 2022/05/09 11:17:45 VM2 via VM2
Use receptorctl on VM1 to the others
#receptorctl --socket /var/run/receptor/receptor.sock ping VM3
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM3 in 1.233213ms
Reply from VM3 in 1.253816ms
^C
# receptorctl --socket /var/run/receptor/receptor.sock ping VM2
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM2 in 487.912µs
Reply from VM2 in 406.554µs
^C
# receptorctl --socket /var/run/receptor/receptor.sock ping VM1
Warning: receptorctl and receptor are different versions, they may not be compatible
Reply from VM1 in 32.88µs
Reply from VM1 in 45.395µs
# receptorctl --socket /var/run/receptor/receptor.sock status
Warning: receptorctl and receptor are different versions, they may not be compatible
Node ID: VM1
Version: 1.1.1
System CPU Count: 1
System Memory MiB: 3935
Connection Cost
VM2 1
Known Node Known Connections
VM1 VM2: 1
VM2 VM1: 1 VM3: 1
VM3 VM2: 1
Route Via
VM2 VM2
VM3 VM2
Node Service Type Last Seen Tags
VM1 control Stream 2022-05-09 11:20:24 {'type': 'Control Service'}
VM2 control Stream 2022-05-09 11:19:45 {'type': 'Control Service'}
VM3 control Stream 2022-05-09 11:19:45 {'type': 'Control Service'}
Yeey we got connection from VM1 to VM3!
Changing the AWX install on k8s and taking a peek at the development docker version of AWX
Now that we know how receptor works we need to translate this configuration to awx/k8s.
First thing that has to be done is to define an extra service that allows our execution node to make a connection to AWX on a given port. The receptor service on the execution node needs this access. It should look something like:
apiVersion: v1
kind: Service
metadata:
name: receptor
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
spec:
selector:
app.kubernetes.io/component: awx
ports:
- port: 6996
targetPort: 6996
type: LoadBalancer
This creates a service and you can connect on it using port 6996.
But what should the config look like? To answer this question we will look at the docker-compose dev version of AWX. docker-compose version.
In the template directory we can see three receptor configs. Fumbling with some configs I think it should look like this (some security stuff disabled as this is a proof of concept):
AWX node
---
- node:
id: awx-ee
#firewallrules:
# - action: "reject"
# tonode: awx_{{ item }}
# toservice: "control"
- log-level: info
- tcp-listener:
port: 6996
#- work-signing:
# privatekey: /etc/receptor/work_private_key.pem
# tokenexpiration: 1m
#- work-verification:
# publickey: /etc/receptor/work_public_key.pem
#- tls-server:
# name: mutual-tls
# cert: /etc/receptor/certs/awx.crt
# key: /etc/receptor/certs/awx.key
# requireclientcert: true
# clientcas: /etc/receptor/certs/ca.crt
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
- work-command:
worktype: local
command: ansible-runner
params: worker
allowruntimeparams: true
verifysignature: true
- work-kubernetes:
worktype: kubernetes-runtime-auth
authmethod: runtime
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
verifysignature: true
- work-kubernetes:
worktype: kubernetes-incluster-auth
authmethod: incluster
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
verifysignature: true
Work node (hostname recptor)
---
- node:
id: receptor
- log-level: info
- tcp-peer:
address: <IP from azure LB>:6996
redial: true
#- work-verification:
# publickey: /etc/receptor/work_public_key.pem
- work-command:
worktype: ansible-runner
command: ansible-runner
params: worker
allowruntimeparams: true
verifysignature: true
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
On the AWX pod there are 4 containers. If I look at the mounts of each container there are 2 containers that some receptor mounts.
- awx-task has
/etc/receptor/receptor.conf from awx-receptor-config (ro) /var/run/receptor from receptor-socket (rw)
and
- awx-ee
/etc/receptor/receptor.conf from awx-receptor-config (ro) /var/run/receptor from receptor-socket (rw)
When opening a console session and reviewing the supervisord config file I found out that receptor is only running on the awx-ee container. I tought this would be in the task container but … okay :)
The awx operator allows you to have custom volume mounts template directory
We add a config map:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: receptor-ee-extra-config
namespace: awx
data:
receptor.conf: |
---
- node:
id: axw-ee
- log-level: info
- control-service:
service: control
filename: /var/run/receptor/receptor.sock
permissions: 0660
#tls: tls_server
# Listener
- tcp-listener:
port: 6996
- local-only:
- work-command:
worktype: local
command: ansible-runner
params: worker
allowruntimeparams: true
- work-kubernetes:
worktype: kubernetes-runtime-auth
authmethod: runtime
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
- work-kubernetes:
worktype: kubernetes-incluster-auth
authmethod: incluster
allowruntimeauth: true
allowruntimepod: true
allowruntimeparams: true
kubectl apply -f …. -n awx
We edit the “main awx.yml”
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
namespace: awx
spec:
postgres_configuration_secret: awx-postgres-configuration
secret_key_secret: awx-secret-key
service_type: LoadBalancer
loadbalancer_protocol: http
loadbalancer_port: 80
service_labels: |
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service_annotations: |
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
admin_user: admin
admin_email: ****
extra_volumes: |
- name: receptor-cfg-ee
configMap:
defaultMode: 420
items:
- key: receptor.conf
path: receptor.conf
name: receptor-ee-extra-config
ee_extra_volume_mounts: |
- name: receptor-cfg-ee
mountPath: /etc/receptor/receptor.conf
subPath: receptor.conf
kubectl apply -f …. -n awx
Do not forget to have the receptor service running on the home lab server.
And behold!
# receptorctl --socket /var/run/receptor/receptor.sock status
Warning: receptorctl and receptor are different versions, they may not be compatible
Node ID: receptor
Version: +g
System CPU Count: 1
System Memory MiB: 3935
Connection Cost
axw-ee 1
Known Node Known Connections
axw-ee receptor: 1
receptor axw-ee: 1
Route Via
axw-ee axw-ee
Node Service Type Last Seen Tags
receptor control Stream 2022-05-09 12:26:03 {'type': 'Control Service'}
axw-ee control Stream 2022-05-09 10:25:57 {'type': 'Control Service'}
Node Work Types
receptor ansible-runner
axw-ee local, kubernetes-runtime-auth, kubernetes-incluster-auth
Are we there yet?
Short and long answer: NO!
Now we have a working receptor network, but we still have to let AWX know the peers exist. Revisiting docker-compose version we find out some awx-mange commands do this.
We log in to our awx-task container
awx-manage register_queue --queuename=remote --instance_percent=100
awx-manage provision_instance --hostname="receptor" --node_type="execution"
awx-manage register_peers axw-UID --peers "receptor"
First command creates a new instance group Second command provisions the instance Third command generates a linkg beween the awx pod an my execution node. (I still have to find out how to give the awx pod in recpator a fixed name)
(I do think I also needed to associate the instance with the group)
And behold!
However….
Failure
At this point I tought I had it all figured out. The AWX gui showed a good topology and everything was registered. However when testing this out by changing an inventory to use the remote instance group –> PENDING. I got a job that is forever in the pending state. Analysing the logs I get the following lines.
2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler ad_hoc_command 14282 (pending) couldn't be scheduled on graph, waiting for next cycle
2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler Skipping task ad_hoc_command 14282 (pending) in pending, not enough capacity left on controlplane to control new tasks
2022-05-09 11:26:56,350 DEBUG [d9b720fce0da479f841a103cb94a3aef] awx.main.scheduler Finishing Scheduler
It looks like the job is trying to run but cannot be scheduled on the control plane?
Help !?
This is the part where I ask for help. I have no idea what I am missing. On the mailing list I did get a hint that “You may need to look into the work-signing field in receptor configs.”. So that will be the next thing I will look into. If I’ll ever find out I will create a part 2 article where I solve this and also try to add some security options.
If someone wants to contribute you can always comment below.
PS: You can also always Buy me a coffee