Pipelining

This is part of a series of posts on ideas for an ansible-like provisioning system, implemented in Transilience.

Running actions on a server is nice, but a network round trip for each action is not very efficient. If I need to run a linear sequence of actions, I can stream them all to the server, and then read replies streamed from the server as they get executed.

This technique is called pipelining and one can see it used, for example, in Redis, or Mitogen.

Roles

Ansible has the concept of "Roles" as a series of related tasks: I'll play with that. Here's an example role to install and setup fail2ban:

class Role(role.Role):
    def main(self):
        self.add(builtin.apt(
            name=["fail2ban"],
            state="present",
        ))

        self.add(builtin.copy(
            content=inline("""
                [postfix]
                enabled = true
                [dovecot]
                enabled = true
            """),
            dest="/etc/fail2ban/jail.local",
            owner="root",
            group="root",
            mode=0o644,
        ), name="configure fail2ban")

I prototyped roles as classes, with methods that push actions down the pipeline. If an action fails, all further actions for the same role won't executed, and will be marked as skipped.

Since skipping is applied per-role, it means that I can blissfully stream actions for multiple roles to the server down the same pipe, and errors in one role will stop executing that role and not others. Potentially I can get multiple roles going with a single network round-trip:

#!/usr/bin/python3

import sys
from transilience.system import Mitogen
from transilience.runner import Runner


@Runner.cli
def main():
    system = Mitogen("my server", "ssh", hostname="server.example.org", username="root")

    runner = Runner(system)

    # Send roles to the server
    runner.add_role("general")
    runner.add_role("fail2ban")
    runner.add_role("prosody")

    # Run until all roles are done
    runner.main()

if __name__ == "__main__":
    sys.exit(main())

That looks like a playbook, using Python as glue rather than YAML.

Decision making in roles

Besides filing a series of actions, a role may need to take decisions based on the results of previous actions, or on facts discovered from the server. In that case, we need to wait until the results we need come back from the server, and then decide if we're done or if we want to send more actions down the pipe.

Here's an example role that installs and configures Prosody:

from transilience import actions, role
from transilience.actions import builtin
from .handlers import RestartProsody


class Role(role.Role):
    """
    Set up prosody XMPP server
    """
    def main(self):
        self.add(actions.facts.Platform(), then=self.have_facts)

        self.add(builtin.apt(
            name=["certbot", "python-certbot-apache"],
            state="present",
        ), name="install support packages")

        self.add(builtin.apt(
            name=["prosody", "prosody-modules", "lua-sec", "lua-event", "lua-dbi-sqlite3"],
            state="present",
        ), name="install prosody packages")

    def have_facts(self, facts):
        facts = facts.facts  # Malkovich Malkovich Malkovich!

        domain = facts["domain"]
        ctx = {
            "ansible_domain": domain
        }

        self.add(builtin.command(
            argv=["certbot", "certonly", "-d", f"chat.{domain}", "-n", "--apache"],
            creates=f"/etc/letsencrypt/live/chat.{domain}/fullchain.pem"
        ), name="obtain chat certificate")

        with self.notify(RestartProsody):
            self.add(builtin.copy(
                content=self.template_engine.render_file("roles/prosody/templates/prosody.cfg.lua", ctx),
                dest="/etc/prosody/prosody.cfg.lua",
            ), name="write prosody configuration")

            self.add(builtin.copy(
                src="roles/prosody/templates/firewall-ruleset.pfw",
                dest="/etc/prosody/firewall-ruleset.pfw",
            ), name="write prosody firewall")

    # ...

This files some general actions down the pipe, with a hook that says: when the results of this action come back, run self.have_facts().

At that point, the role can use the results to build certbot command lines, render prosody's configuration from Jinja2 templates, and use the results to file further action down the pipe.

Note that this way, while the server is potentially still busy installing prosody, we're already streaming prosody's configuration to it.

If anything goes wrong with the installation of prosody's package, the role will be marked as failed and all further actions of the same role, even those filed by have_facts() will be skipped.

Notify and handlers

In the previous example self.notify() also appears: that's my attempt to model the equivalent of Ansible's handlers. If any of the actions inside the with produce changes, then the RestartProsody role will be executed, potentially filing more actions ad the end of the playbook.

The runner will take care of collecting all the triggered role classes in a set, which discards duplicates, and then running the main() method of all resulting roles, which will cause more actions to be filed down the pipe.

Action conditions

Sometimes some actions are only meaningful as consequences of other actions. Let's take, for example, enabling buster-backports as an extra apt source:

        a = self.add(builtin.copy(
            owner="root",
            group="root",
            mode=0o644,
            dest="/etc/apt/sources.list.d/debian-buster-backports.list",
            content="deb [arch=amd64] https://mirrors.gandi.net/debian/ buster-backports main contrib",
        ), name="enable backports")

        self.add(builtin.apt(
            update_cache=True
        ), name="update after enabling backports",
           # Run only if the previous copy changed anything
           when={a: ResultState.CHANGED},
        )

Here we want to update Apt's cache, which is a slow operation, only after we actually write /etc/apt/sources.list.d/debian-buster-backports.list. If the file was already there from a previous run, we can skip downloading the new package lists.

The when= attributes adds an annotation to the action that is sent town the pipeline, that says that it should only be run if the state of a previous action matches the given one.

In this case, when on the remote it's the turn of "update after enabling backports", it gets skipped unless the state of the previous "enable backports" action is CHANGED.

Effects of pipelining

I ported enough of Ansible's modules to be able to run the provisioning scripts of my VPS entirely via ansible.

This is the playbook run as plain Ansible:

$ time ansible-playbook vps.yaml
[...]
servername       : ok=55   changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

real    2m10.072s
user    0m33.149s
sys 0m10.379s

This is the same playbook run with Ansible speeded up via the Mitogen backend, which makes Ansible more bearable:

$ export ANSIBLE_STRATEGY=mitogen_linear
$ time ansible-playbook vps.yaml
[...]
servername       : ok=55   changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

real    0m24.428s
user    0m8.479s
sys 0m1.894s

This is the same playbook ported to Transilience:

$ time ./provision
[...]
real    0m2.585s
user    0m0.659s
sys 0m0.034s

Doing nothing went from 2 minutes down to 3 seconds!

That's the kind of running time that finally makes me comfortable with maintaining my VPS by editing the playbook only, and never logging in to mess with the system configuration by hand!

Next steps

I'm quite happy with what I have: I can now maintain my VPS with a simple script with quick iterative cycles.

I might use it to develop new playbooks, and port them to ansible only when they're tested and need to be shared with infrastructure that needs to rely on something more solid and battle tested than a prototype provisioning system.

I might also keep working on it as I have more interesting ideas that I'd like to try. I feel like Ansible reached some architectural limits that are hard to overcome without a major redesign, and are in many way hardcoded in its playbook configuration. It's nice to be able to try out new designs without that baggage.

I'd love it if even just the library of Transilience actions could grow, and gain widespread use. Ansible modules standardized a set of management operations, that I think became the way people think about system management, and should really be broadly available outside of Ansible.

If you are interesting in playing with Transilience, such as:

do get in touch or send a pull request! :)

Next step: Reimagining Ansible variables.