How Do I Develop and Test Ansible Playbooks and Roles?

To develop and to keep my Ansible playbooks and roles in good shape, I first relied on virtual machines (VMs) running on VirtualBox and managed by Vagrant files. Before applying changes on production servers, I ran vagrant up to launch a virtual machine and run the playbooks against that virtual machine.

Later on, I wanted to automate the testing of my playbooks and roles, and be able to run tests more regularly using Gitlab CI/CD. Therefore I switched from Virtualbox virtual machines to light-weight Docker containers. This would allow me to always start from a clean slate and to spin up more tests in parallel without much overhead.

To test roles using Gitlab CI/CD, I used the following .gitlab-ci.yml.

.test_role_template: &test_role
  stage: test
  tags:
  - ansible
  - docker
  script:
  - pushd roles/${CI_JOB_NAME:5}/tests; ansible-playbook -i inventory test.yml; popd;

# ...

test-auditbeat:
  <<: *test_role
  rules:
  - changes:
    - roles/auditbeat/**/*

# ...

test-vault-server:
  <<: *test_role
  rules:
  - changes:
    - roles/consul/**/*
    - roles/consul-agent/**/*
    - roles/vault/**/*
    - roles/vault-server/**/*

The first block defines an yaml anchor test_role, which is used in all following blocks to test roles like auditbeat or vault-server. The rules dictate to run the test.yml playbook inside the tests folder of the role whenever a file in the role changes e.g. after updating the auditbeat_version in the defaults/main.yml or modifying to task or template files. In the case of the vault-server role, the tests also run whenever a dependent role would change. Whenever either the consul(-agent) roles or the vault(-server) roles change, the test.yml playbook of the vault-server role is executed.

In the tests folder, there is an inventory file present. For simple roles that only need one instance, the following template is used to source the file.

localhost ansible_connection=local
ansible-{{ role_name }} ansible_connection=docker

[all:vars]
ansible_python_interpreter='/usr/bin/env python3'

If more instances are required to reliably test roles, a more elaborated inventory file is used. For example, the inventory file in the vault-server role looks like this.

localhost ansible_connection=local

[ansible_vault_servers]
ansible-vault-server-1 ansible_connection=docker
ansible-vault-server-2 ansible_connection=docker
ansible-vault-server-3 ansible_connection=docker

[all:vars]
ansible_python_interpreter='/usr/bin/env python3

For executing tasks on the localhost, i.e. to start and stop the docker instance(s), we specify ansible_connection=local which executes the tasks directly as shell commands and not via a default SSH connection. For the same reason, we use ansible_connection=docker to execute tasks directly on the containers under test.

The test.yml playbook template reads as follows:

---
- hosts: localhost
  tasks:
  - name: start container
    docker_container:
      name: ansible-{{ role_name }}
      image: ubuntu:xenial
      command: /sbin/init
      state: started

- hosts: ansible-{{ role_name }}
  pre_tasks:
  - name: update all packages to the latest version
    apt:
      update_cache: yes
      upgrade: dist
      force_apt_get: yes
  roles:
  - role: {{ role_name }} 
  post_tasks: []

- hosts: localhost
  tasks:
  - name: remove container
    docker_container:
      name: ansible-{{ role_name }}
      state: absent

The first and last block are executed on the localhost to handle the start and removal of the docker container. The middle part, which is run on the docker instance(s), first updates all packages to the latest version and then executes the role.

The post_tasks is where the actual testing happens.

For the auditbeat role, the test succeeds when the configuration is valid.

  post_tasks:
  - name: check auditbeat installation
    command: auditbeat test config
    register: result
    changed_when: False
  - name: check auditbeat installation
    assert:
      that:
      - "'Config OK' in result.stdout"

For the vault-server role, the test should succeed if all Vault servers are reachable.

  post_tasks:
  - name: check all vault servers are reachable
    wait_for:
      port: '{{ item }}'
    loop:
    - 8200

Having this test infrastructure in place, allowed me to easily test roles whenever a new version was released and proved very helpful whenever a new Ubuntu LTS was released. For the former, a small modification like bumping the auditbeat_version is necessary to trigger the test of a role. For the latter, changing the image: ubuntu:xenial to image: ubuntu:bionic in bulk, would trigger testing of various roles.

However, the story isn’t actually that nice. In reality the first test.yml file looked like this:

---
- hosts: localhost
  tasks:
  - name: start container
    docker_container:
      name: ansible_{{ role name }}
      image: ubuntu:xenial
      command: /sbin/init
      state: started
+     capabilities:
+     - SYS_ADMIN
+     volumes:
+     - /sys/fs/cgroup:/sys/fs/cgroup:ro
+     tmpfs:
+     - /run
+     - /run/lock
+     - /tmp

The capabilities, volumes and tmpfs blocks were necessary to start systemd on Ubuntu 16.04 LTS (Xenial Xerus). However as of Ubuntu 18.04 LTS (Bionic Beaver) systemd was no longer present in the base image. The workaround was to start the container with interactive: yes to prevent the shell process from exiting.

   - name: start container
     docker_container:
       name: ansible_{{ role _name }}
-      image: ubuntu:xenial
-      command: /sbin/init
+      image: ubuntu:bionic
       state: started
-      capabilities:
-      - SYS_ADMIN
-      volumes:
-      - /sys/fs/cgroup:/sys/fs/cgroup:ro
-      tmpfs:  # necessary on Ubuntu 16.04 LTS host to start systemd
-      - /run
-      - /run/lock
-      - /tmp
+      interactive: yes

Moreover, I stumbled upon a bug (or a feature) that the service module doesn’t take into account use option nor ansible_service_mgr override. So I needed to clutter my task files with blocks like this:

- name: start vault server
  service: name=vault state=started
  when: ansible_connection != 'docker'

- name: start vault server
  command: service vault start
  when: ansible_connection == 'docker'
  args:
    warn: no

I realized that Docker containers were clearly the wrong tool for the job.

Later on I discovered Ansible Molecule, which I gave a try but I found it too bloated and kept my original test setup. However, from this experiment, I learnt that linux system containers (LXC) managed by LXD could be used as an alternative driver to Docker without the need to rely on more heavy virtual machines. Eureka!

These linux system containers have the benefit to be as light-weight as docker containers, but do provide full OS experience of virtual machines. This container abstraction more closely resembles the production environment. The same cloud images are available on the major cloud providers and an init system like systemd is launched whenever the container is started. Another benefit is that base image files can be configured to automatically refresh such that most packages are up to date.

Transition to LXD was super easy.

After removing all tasks annotated with ansible_connection == docker clauses, there was again a single execution path. Running multiple instances is possible with minimal overhead. Test setup almost remained the same. Instead of starting docker containers, the test launched linux containers and the tear down stopped and deleted them. All the tests remained unaltered.

The .gitlab-ci.yaml file didn’t change except for a tag to demand for lxd instead of docker to be present.

.test_role_template: &test_role
  stage: test
  tags:
  - ansible
  - lxd
  script:
  - pushd roles/${CI_JOB_NAME:5}/tests; ansible-playbook -i inventory test.yml; popd;

In the inventory skeleton the ansible_connection=docker line was replaced by ansible_connection=lxd.

localhost ansible_connection=local
ansible-{{ role_name }} ansible_connection=lxd

[all:vars]
ansible_python_interpreter='/usr/bin/env python3'

The test.yml playbook now reads like

---
- hosts: localhost
  tasks:
  - name: start container
    lxd_container:
      name: ansible-{{ role_name }}
      source:
        type: image
        mode: pull
        server: https://cloud-images.ubuntu.com/releases
        protocol: simplestreams
        alias: focal/amd64
      state: started
      wait_for_ipv4_addresses: true
      timeout: 600
      url: "{% raw %}{{ lxd_container_url | default('unix:/var/lib/lxd/unix.socket') }}{% endraw %}"

- hosts: ansible-{{ role_name }}
  pre_tasks:
  - name: update all packages to the latest version
    apt:
      update_cache: yes
      upgrade: dist
      force_apt_get: yes
  roles:
  - role: {{ role_name }} 
  post_tasks: []

- hosts: localhost
  tasks:
  - name: remove container
    lxd_container:
      name: ansible-{{ role_name }}
      state: absent
      url: "{% raw %}{{ lxd_container_url | default('unix:/var/lib/lxd/unix.socket') }}{% endraw %}"

So the starting and removing of container tasks changed to use the lxd_container module and a somewhat more elaborated form for specifying which image to use. The wait_for_ipv4_addresses was necessary for roles that rely on an ipv4 stack to be present and ready for action. The url line is added to run the tests manually on a macOS machine running an Ubuntu virtual machine controlled with Canonical Multipass.

All my roles, except for the auditbeat role, are properly tested with minimal overhead. Auditbeat however needs special kernel capabilities to run. So far I’ve not found an alternative to Run Auditbeat on Docker. Adding additional privileges as suggested in he Unable to start Auditbeat on LXC container thread did not do the trick. I still got the following error

Exiting: 1 error: failed to create audit client: failed to get audit status: operation not permitted

Ideas to workaround this single remaining issue are welcome! For now I settle with the fact that LXD can also manage virtual machines. In the tasks to start and stop the lxd_container a single line to specify type: virtual-machine is required.

diff --git a/roles/auditbeat/tests/test.yml b/roles/auditbeat/tests/test.yml
index f3db10b5..e403213f 100644
--- a/roles/auditbeat/tests/test.yml
+++ b/roles/auditbeat/tests/test.yml
@@ -10,8 +10,7 @@
         server: https://cloud-images.ubuntu.com/releases
         protocol: simplestreams
         alias: focal/amd64
-      config:
-        security.privileged: 'true'
+      type: virtual-machine
       state: started
       wait_for_ipv4_addresses: true
       timeout: 600
@@ -47,5 +46,6 @@
   - name: remove container
     lxd_container:
       name: ansible-auditbeat
+      type: virtual-machine
       state: absent
       url: "{{ lxd_container_url | default('unix:/var/lib/lxd/unix.socket') }}

Keep on developing and (start) testing your Ansible playbook and roles!