Automated OS Image Testing

tl;dr:

The Automated OS Image Testing uses temporal workflows to build images according to instructions in packer-maas, test them according to our system-tests, and report the results of those tests in the results repo, which are duplicated in the results page. You can add more tests by extending test_full_circle and adding the correct reporting details to parse_test_results. Additional images can be added by creating the correct packer-maas instructions and inserting those details into image_mapping.yaml.

As a note: Tests are slow. A rough performance figure is 20-30 minutes to test each configuration of each feature for every machine for every image in a test. Without setup overhead and assuming all tests pass, three architectures with one machine each testing two features with three configurations each should take between 6 and 9 hours per image tested, for example.

Execution

The end-to-end pipeline has been fully implemmented as a single temporal workflow that should be called when running image tests. This handles requesting the correct jobs from Jenkins with the correct parameters, and passing the correct information to different child-workflows to get the desired set of results in our results repo.

How to start an image test

temporal

If you already have a temporal server you can request workflows on somewhere, great, you can skip this step.
If not, you can launch a local instance using temporal-server
Quickly summaried:

By default we use the image-testing namespace, so that we can coexist on other temporal servers:
This can be started with:
temporal server start-dev --namespace image-testing

worker(s)

Start the monolithic worker to execute all tasks in the image tests.
python3 monolithic_worker localhost:7233.
You can additiopnally pass a namespace argument, if not using the default “image-tetsing”

python3 monolithic_worker localhost:7233 --namespace <namespace>.

Note:

If, instead, you would like fine grain control over how many resoures go to each workflow, you can start seperate workers for each: (e2e_worker, image_building_worker, image_testing_worker, image_reporting_worker) This requires additionally passing the "use_seperate_queues": true when requesting image tests.

workflow

To request image(s) to be tested requires only a single call to the python script test_images.py. It will contain help text describing each of the parameters (see below). Executing may look as:

test_images.py centos7 --snap-channel 3.3/stable --url $jenkins_url --user $jenkins_user --pass $jenkins_pass

a help script is available as per standard:

$ python3 test_images.py -h

Older method

Calling a workflow with temporal still works, ie: calling e2e_workflow, such as:

temporal workflow start -t e2e_tests --type e2e_workflow -w 'centos_tests' -i '{"image_name": ["centos7", "centos8"], "maas_snap_channel": "3.3/stable", "jenkins_url": $jenkins_url, "jenkins_user": $jenkins_user, "jenkins_pass": $jenkins_pass}'

where -t defines the task queue, --type defines the workflow being executed, -w defines the workflow id, and -i defines the input parameters to the workflow. -t must be mono_queue, as that is the task queue e2e_worker listens on by default.
(If the use_seperate_queues was set as True when calling an image test, -t should instead be set to e2e_tests)

The workflow will then run to completion, calling it’s child workflows to build, test, and report on the image. Navigating to the UI of your temporal server (if it has one) will show you the status as the image tests progress.

There are a number of parameters, both required and optional, that can be given:

Required
  • image_name - The name, or list of names, of images to test.

  • Jenkins details

    • jenkins_url - The url of the Jenkins server where image tests are located.

    • jenkins_user - The username to use to login to the Jenkins server.

    • jenkins_pass - The password to use to login to the Jenkins server.

Optional
  • Filepaths

    • image_mapping - The filepath of the image mapping YAML distributed as part of MAAS-Integration-CI, defaults as image_mapping.yaml in the current working directory.

    • repo_location - The filepath of the location where the image results repo is to be cloned.

  • Test instances

    • maas_snap_channel - The snap channel to use when installing MAAS in image tests, defaults as latest/edge.

    • system_test_repo - The url of the system-tests repo to use for building and testing images, defaults as https://git.launchpad.net/~maas-committers/maas-ci/+git/system-tests.

    • system_test_branch - The branch in the system-test repo to use for building and tetsing images, defaults as master.

    • packer_maas_repo - The url of the PackerMAAS repo to use for building images, defaults as https://github.com/canonical/packer-maas.git.

    • packer_maas_branch - The branch in the PackerMAas repo to use for building images, defaults as main.

    • parallel_tests - A flag to request a single image test build for all images, rather than a test build per image, defaults as False.

    • overwite_results - A flag to request new results overwrite old results rather than combining with them, defaults as False.

    • use_seperate_queues - A flag to determine whether we are using mono queue or seperate queues for each workflow. defaults as False.

  • Retries

    • max_retry_attempts - How many times workflow activities should retry before throwing an exception, defaults as 10

    • heartbeat_delay - How many seconds between heartbeats for long running workflow activities, defaults as 15

  • Timeouts

    • Timeouts given are in seconds, and are passed to temporal as start_to_close, which defines the maximum execution time of a single invocation.

    • default_timeout - How long a workflow activity can run before being timed out, defaults as 300. This is used in place of any timeouts below that are not set.

    • jenkins_login_timeout - How long we wait to log into the Jenkins server.

    • return_status_timeout - How long we wait for an activity to fetch the status of a Jenkins build.

    • get_results_timeout - How long we wait for the results of a Jenkins build to be available.

    • fetch_results_timeout - How long we wait for an activity to fetch the results of a Jenkins build, and perform some operation on them.

    • log_details_timeout - How long we wait for an activity to fetch logs from a Jenkins build, and perform some operation on them.

    • request_build_timeout - How long we wait for an activity to request a Jenkins build.

    • build_complete_timeout - How long we wait for a Jenkins build to complete, defaults as 7200.

Specific to the python call script:

  • ip - The ip address of the temporal server, defaults as localhost.

  • port - The port used to communicate with the temporal server, defaults as 7233.

  • namespace - The namespace used on the temporal server, defaults as image-testing.

During the test

After calling the test, you should see the following things happen:

  1. Image building workflow starts.
    This will call the Jenkins job to build the images supplied.
  2. Image building Jenkins job starts
    This will build the images supplied according to the Packer Instructions.
  3. Image building jenkins job completes
    This will store all of the generated Images either: on the Jenkins node; in a swift bucket.
  4. Image testing workflow starts
    This will call the Jenkins job to test all of the images that were built.
  5. Image testing jenkins job starts
    This will test the images supplied according to the system-tests repo.
  6. Image testing jenkins job completes
    This will output logs and build artifacts containing test results.
  7. Image reporting workflow starts
    This will take the supplied logs and artifacts to determine which images were usable, and in what ways they can be used in MAAS.
  8. Results appear in the repo
    See below:

Results

You’ll see a new commit appear in the results repo results yaml that summarises the test results and feature statuses.
That can be a little hard to parse however, as it contains a combined view of all image results ever, so per-image results are additionally committed to their own named branch. This should help find the exact test result for the exact image if you’re looking for something specific.
An example of a results output may be something like:

centos7:
  image_results:
    architectures:
    - amd64
    - arm64
    - ppc64el
    maas_version:
      '3.4': 'deployment: failed ppc64el; passed amd64
        network: failed amd64-bond, amd64-bridge
        storage: failed amd64-flat, amd64-lvm, amd64-bcache'
    packer_versions:
    - 1.9.4
    prerequisites: []
    summary: 'deployment_state: part
      network_configuration: fail
      storage_configuration: fail'
  test_results:
    deployment_state:
      info: Partial success, passed for amd64
      summary:
        FAIL:
          ppc64el: 'systemtests.utils.UnexpectedMachineStatus: Machine ppc64le didn''t
            get to Deployed after 2002.2 seconds.

            Debug information:
            status: Failed deployment
            - Node changed status: From ''Ready'' to ''Allocated'' (to admin)
            - User starting deployment: (admin)
            - Node changed status: From ''Allocated'' to ''Deploying''
            - Deploying
            - Power cycling
            - Node powered on
            - TFTP Request: /ppc64el/pxelinux.cfg/01-98-be-94-02-6a-cb
            - PXE Request: installation
            - Performing PXE boot
            - TFTP Request: /ppc64el/pxelinux.cfg/01-98-be-94-02-6a-cb
            - PXE Request: installation
            - Performing PXE boot
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/boot-initrd
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/boot-kernel
            - Loading ephemeral
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/squashfs
            - Node installation: ''cloudinit'' searching for network data from DataSourceMAAS
            - Node installation: ''cloudinit'' attempting to read from cache [trust]
            - Marking node failed: Node operation ''Deploying'' timed out after 30
            minutes.
            - Node changed status: From ''Deploying'' to ''Failed deployment'''
        PASS:
        - amd64
        UNKNOWN:
          arm64: None
    network_configuration:
      info: Failed
      summary:
        FAIL:
          amd64-bond: Machine fun-skink didn't get to Deployed after 1841.5 seconds.
          amd64-bridge: Machine fun-skink didn't get to Deployed after 1838.0 seconds.
    storage_configuration:
      info: Failed
      summary:
        FAIL:
          amd64-bcache: Machine fun-skink didn't get to Deployed after 1843.8 seconds.
          amd64-flat: Machine fun-skink didn't get to Deployed after 1877.6 seconds.
          amd64-lvm: Unknown Error

The numbers Jack, What do they mean?

Let’s breakdown the example Yaml key-by-key:

centos7:

The image that was tested. This will match the image name in image_mapping.yaml

overall results

  image_results:
    architectures:
    - amd64
    - arm64
    - ppc64el

A list of architectures that have been tested

    maas_version:
      '3.4': 'deployment: failed ppc64el; passed amd64
        network: failed amd64-bond, amd64-bridge
        storage: failed amd64-flat, amd64-lvm, amd64-bcache'

A mapping of tested MAAS versions, along with the summary of their most recent test

    packer_versions:
    - 1.9.4
    prerequisites: []

The packer version and any prerequisites that were required to build this image

    summary: 'deployment_state: part
      network_configuration: fail
      storage_configuration: fail'

A summary of the most recent test.

feature results

  test_results:

Each of the keys in this section will refer to an individual feature, they will have the form:

$feature_name:
  info: A short written summary of the feature tests
  summary:
    PASS:
      - a list of all
      - the arches that successfully
      - passed this feature
    FAIL:
      $arch: $debug-message-of the failure
    UNKNOWN:
      $arch: $debug message of the non-failure state
    deployment_state:
      info: Partial success, passed for amd64
      summary:
        FAIL:
          ppc64el: 'systemtests.utils.UnexpectedMachineStatus: Machine ppc64le didn''t
            get to Deployed after 2002.2 seconds.

            Debug information:
            status: Failed deployment
            - Node changed status: From ''Ready'' to ''Allocated'' (to admin)
            - User starting deployment: (admin)
            - Node changed status: From ''Allocated'' to ''Deploying''
            - Deploying
            - Power cycling
            - Node powered on
            - TFTP Request: /ppc64el/pxelinux.cfg/01-98-be-94-02-6a-cb
            - PXE Request: installation
            - Performing PXE boot
            - TFTP Request: /ppc64el/pxelinux.cfg/01-98-be-94-02-6a-cb
            - PXE Request: installation
            - Performing PXE boot
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/boot-initrd
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/boot-kernel
            - Loading ephemeral
            - HTTP Request: /images/ubuntu/ppc64el/ga-20.04/focal/stable/squashfs
            - Node installation: ''cloudinit'' searching for network data from DataSourceMAAS
            - Node installation: ''cloudinit'' attempting to read from cache [trust]
            - Marking node failed: Node operation ''Deploying'' timed out after 30
            minutes.
            - Node changed status: From ''Deploying'' to ''Failed deployment'''
        PASS:
        - amd64
        UNKNOWN:
          arm64: None

Could this image deploy for each architecture? amd64 could, ppc64el failed with a debug trace, arm64 experienced an unknown error

    network_configuration:
      info: Failed
      summary:
        FAIL:
          amd64-bond: Machine fun-skink didn't get to Deployed after 1841.5 seconds.
          amd64-bridge: Machine fun-skink didn't get to Deployed after 1838.0 seconds.

Could this image configure networking? both bond and bridge for amd64 failed due to failing to deploy. as ppc64el and arm64 could not deploy, they were not tested.

    storage_configuration:
      info: Failed
      summary:
        FAIL:
          amd64-bcache: Machine fun-skink didn't get to Deployed after 1843.8 seconds.
          amd64-flat: Machine fun-skink didn't get to Deployed after 1877.6 seconds.
          amd64-lvm: Unknown Error

Could this image configure storage? flat and bcache for amd64 failed, lvm had an unknown error. ppc64el and arm64 were not tested due to inability to deploy

Explanations of jobs and workflows

Image Building

This is the domain of the image_building_workflow, whose process looks as:

  1. check jenkins is reachable
  2. request the image builder job from jenkins
  3. wait until the job starts
  4. waut until the job is complete
  5. fetch the job status (complete, failed)
  6. if the job wasn’t aborted:
    a. fetch the build results
    b. fetch the image mapping from the build
    c. determine the build results for each image built.
  7. return the build status and the build results

So what does the image builder job do?
Well, that’s defined at https://git.launchpad.net/~maas-committers/maas-ci/+git/maas-ci-config/tree/jenkins/jobs/image_builder.groovy, and is as follows

  1. clone the system-tests
  2. write the base labmaas config and image mapping to local filedir
  3. generate the testing config (see gen-config for details)
  4. run the image building environment
  5. teardown safely and store results.

Manually running

While this occurs as a part of the e2e_workflow, it can sometimes be helpful to manually build images without calling the entire testing stack.
There are two ways to do this: manually execute the temporal workflow, or manually trigger a jenkins job.

Jenkins

The jenkins job for image building lives on our integration ci. Provided you are logged in to your user account, you should be able to trigger builds.
There are a few parameters, many of which will often be left as default:

  • SYSTEMTESTS_GIT_REPO - The url of the systemtest repo to pull the image building test from.
  • SYSTEMTESTS_GIT_BRANCH - As above, except the specific branch in the repo.
  • IMAGE_NAMES - A comma seperated list of images that should be built during this job.
  • PACKER_MAAS_GIT_REPO - The url of the packer maas repo to pull instructions from.
  • PACKER_MAAS_GIT_BRANCH - As above, except the branch in that repo.
  • IMAGE_FILESTORE - The location for which built images will be stored.
  • CONTAINERS_IMAGE - The base image to use for the packer-maas container that builds images.
  • PYTEST_ARGS - Anything extra that should be passed to pytest

Temporal

Running this as a temporal workflow is less recommended as standalone, as the workflow executes a lot of extra steps required for the entire pipeline that aren’t necessarily useful if you’re just looking to build an image.

If however, you would still like to, you can all the image_building_workflow:

temporal workflow start -t image_building --type image_building_workflow -w 'centos_builds' -i '{"image_name": ["centos7", "centos8"], "maas_snap_channel": "3.3/stable", "jenkins_url": $jenkins_url, "jenkins_user": $jenkins_user, "jenkins_pass": $jenkins_pass}'

Ensure you have either the mono_worker, or the image_building_worker are active.
This workflow has a few parameters, explained above:

  • image_name
  • image_mapping
  • system_test_repo
  • system_test_branch
  • packer_maas_repo
  • packer_maas_branch

The default parameters are also providable:
jenkins_url, jenkins_user, jenkins_pass, job_name, max_retry_attempts, heartbeat_delay, default_timeout, jenkins_login_timeout, return_status_timeout, fetch_results_timeout, log_details_timeout, request_build_timeout, build_complete_timeout, get_results_timeout

Image Testing

This is where []

Image Reporting

WIP

Miscellaneous

When executing the e2e workflow, there are some additional activities that occur in-between workflows to transform and fetch data. The overall flow is as follows:

  • Execute the Image building workflow as a child workflow
  • Determine, from the workflow output:
    • The image builder job number
    • The images that failed to build
    • The images that succeeded in building
  • Fetch the packer version in use by scanning the Image builder job logs
  • for all of the images that were built successfully:
    • Execute the Image testing workflow as a child workflow
  • Generate a dictionary of image details for every image. Images that failed to build will be populated by default failure details. These details are:
    • built - Did this image succeed in the building stage
    • tested - Did this image succeed in the testing stage
    • build_num - Which Image builder job built this image
    • test_num - Which Image tester job tested this image
    • packer_version - Which version of packer was used to build this image
    • prerequisites - Which prerequisites (if any) were required to build this image
  • Execute the Image reporting workflow as a child workflow

Adding new tests

Testing a feature

The tests for each image live in systemtests.tests_per_machine:test_full_circle, under the conditional block image_to_test encapsulating the optional test_images step; Image tests have long execution times and only represent meaningful information in the image testing pipeline.

There are two context managers specific to image tests that will help in your journey here:

  • report_feature_tests, creates a logger for this feature and reports a failure/pass status depending on whether exceptions are thrown by execution within it’s with block.
  • release_and_redeploy_machine: releases a machine and executes code in the with block, ensuring the machine is redeployed afterwards, even if an exception is thrown during the with block exection.

We’ll use the storage configuration as an example for how to setup a new feature test and have it report in the temporal workflow.

From above, the steps to adding a new storage configuration to a machine in MAAS is:

  • release the machine if it is already deployed
  • set the storage layout, passing in the required configuration parameters
  • (re)deploy the machine.
  • check the machine deployed correctly.

And of course, we want to repeat this for each storage layout we wish to test, ie: flat, lvm, and bcache:

testable_layouts = ["flat", "lvm", "bcache"]
for storage_layout in testable_layouts:
    with report_feature_tests(
        testlog, f"storage layout {storage_layout}"
    ), release_and_redeploy_machine(
        maas_api_client,
        machine,
        osystem=deploy_osystem,
        oseries=deploy_oseries,
        timeout=TIMEOUT,
    ):
        maas_api_client.create_storage_layout(machine, layout_type=storage_layout, options={})

deploy_osystem, deploy_oseries, and TIMEOUT are variables set when initially deploying the machine used in the test, where create_storage_layout executes:

machine set-storage-layout machine["system_id"] storage_layout={layout_type}

for the machine under test

This would give log outputs similar to the following, depending on execution state:

  • [<test name>].storage layout flat: Starting test
  • [<test name>].storage layout flat: <error message>
  • [<test name>].storage layout flat: FAILED
  • [<test name>].storage layout flat: PASSED

Reporting the feature

Okay, great, we have a feature we can test. We still need to let temporal know this is a feature we should report.
This requires extending the image_reporting_workflow, specifically the parse_test_results activity, as that is where we parse the test results into a format understood by following activities.

We’ll make use of a few functions and classes here:

  • get_step_from_results, this returns only the desired log segment corresponding to a test step, as test_full_circle has multiple test steps (enlist, metadata, commission, deploy, test_image, rescue), many of which are not neccesary for the feature being parsed.
  • determine_feature_state, this searches the supplied log for the feature name, followed by the regex :?\s(\w+):?\s(?:\-\s)?([A-Z]{4,}).
    This matches <feature_name> <feature_type>: <feature_state> (ie: storage layout flat: PASSED), returning the feature_type and feature_state, as well as any errors if they are present.
  • FeatureStatus, a dataclass that neatly wraps the reported state of a tested feature.
    All of the above additionally scan, and compile a combined resultset, for all the machine architectures used in the test.
if image_tests := get_step_from_results(this_image_result, "test_image"):
  if storage_state := determine_feature_state("storage layout", image_tests):
    info, summary = storage_state
    storage_conf = FeatureStatus(
        "Storage Configuration",
        info=info,
        summary=summary,
    )
    image_results.storage_conf = storage_conf

The image_tests conditional block is shared between all features that need to parse the test_image test step.
Additionally, image_results is a variable holding the ImageTestResults dataclass, which wraps the entire testing results of a specific image, including all of it’s tested features. This dataclass contains the operations neccesary to convert itself to a dictionary for committing in the results repo.

Executing new code

If workers contain a build_id, then congratulations, your workflow now utilizes worker versioning. If they don’t, ignore this section until they do.

There are two locations that need to be updated:

  • worker_versions.yaml - This YAML file tells the workers which build versions are current and compatible with eachother
  • workers - Each worker should contain a build_id with the most up-to-date version required for it’s workflow.

The versioning in this YAML uses a trimmed down equivalent to semver, that is:
1.x is compatible with any other 1.x. If a worker has version 1.2 but the workflow is 1.3, the worker will continue to work unless a 1.3 worker comes along.
2.x is not compatible with 1.x. If a worker has version 1.2 but the workflow is 2.3, the worker will terminate itself after completing it’s next workflow, even if no other workers exist.
tl;dr: deprecated workers terminate immediately, compatible workers terminate only if other compatible workers exist.

We added extra features to the image_reporting workflow that we really want all workers to update to using. In other words, it’s kind of a breaking change. we only need to update the image_reporting workflow however, so we can only update the image reporting section of the versions dictionary.

---
queues:
  - e2e_tests
  - image_building
  - image_testing
  - image_reporting
  - mono_queue

versions:
+ image_reporting:
+   - 2.0
  default:
    - 1.0

we also need to modify the worker to accept this new versioning
from common_tasks import start_worker
from image_reporting_workflow import activities as image_reporting_activities
from image_reporting_workflow import workflows as image_reporting_workflows

if __name__ == "__main__":
    start_worker(
        task_queue="image_reporting",
        workflows=image_reporting_workflows,
        activities=image_reporting_activities,
-       build_id=1.0,
+       build_id=2.0,
    )

TL;DR

Extend test_machine:test_full_circle to include something like the following in the image_tests condition block:

for $feature_type in $list_of_supported_feature_types:
    with report_feature_tests(
        testlog, f"$feature_name {$feature_type}"
    ):
        $Test_the_feature_here

And extend image_reporting_workflow:parse_test_results to include something like the following:

if image_tests := get_step_from_results(this_image_result, "test_image"):
    if $feature_state := determine_feature_state("$feature_name", image_tests):
        state, readable, info = $feature_state
        $feature_results = FeatureStatus(
            "$feature_name",
            state=state,
            readable_state=readable,
            info=info,
        )
        image_results.$feature_name = $feature_results

(Where everything preceded by a $ should be named appropriately, of course)

Modify worker_versioning.yaml and the worker affected to reflect changes in the workflow and force running workers to switch to the new changes.

Adding new images

  • update packer MAAS with instructions
  • update image_mapping.yaml with details
  • Update the CI with the new image_mapping.yaml
  • run tests.

Temporal VM

Automated OS Image Testing can be executed from a dedicated all-in-one Temporal VM. This VM hosts a Temporal server running as a systemd service and the monolithic worker. The Temporal VM is accessible through the bastion and proper port forwarding should be made to get access to Temporal API and web interface.

Connection to Temporal VM

IP Address: 10.131.211.145

The following snippet should be added to .ssh/config:

Host temporal
    HostName maas-bastion-ps5.internal
    RemoteCommand sudo -u prod-maas-ui-design-lab ssh -L {random_port_a}:127.0.0.1:8233 -L {random_port_b}:127.0.0.1:7233 ubuntu@10.131.211.145
    RequestTTY yes
    LocalForward 8080 127.0.0.1:{random_port_a}
    LocalForward 8081 127.0.0.1:{random_port_b}

NOTE: Choose two random ports to be used on the bastion, since, if two ssh sessions try to bind at the same ports the port forwarding is not successful.

After adding the above snippet, Temporal VM can be accessed by keeping a live ssh session with it.

Web Interface

With the open SSH session, the Temporal Server can be accessed at the following address from the browser: http://127.0.0.1:8080

Temporal CLI

With the open SSH session, the Temporal Server can be accessed from the CLI in the following way:

temporal workflow start --address :8081 -t e2e_tests --type e2e_workflow -w 'centos_tests' -i '{"image_name": ["centos7", "centos8"], "maas_snap_channel": "3.3/stable", "jenkins_url": "http://maas-integration-ci.internal:8080/", "jenkins_user": "skatsaounis", "jenkins_pass": "redacted"}'
temporal workflow list --address :8081
temporal workflow terminate --address :8081 -w centos_tests

Promoting images to stable

This replaces the older process found here: Promoting images - #2

The expected flow is

  • determine list of candidate images
  • test images from candidate stream at CI
  • determine images that passed testing
    • only test images on harware they are supported on.
  • add stable to maas-images and push new state