Continuous Integration at Tuenti

Publicado el 13/8/2012 por Jose Mucientes, Software Test Engineer


Tuenti is a fast paced growing company and website. Our site is constantly growing with more and more users getting registered, new features being added and new issues being addressed to ensure a safe and steady growth of our system. It is therefore, within our company methodology to develop and release code as quick as possible with certain quality guarantees.
At the time of writing we have code releases on a tri-weekly basis, and we are aiming to be able to safely release code on a daily basis, minimizing the QA team efforts. In the next few lines I’ll describe what our test framework engineers have been working on to achieve the current system and what steps are being taken to get to the next level.

The Release Process

Typically the workflow for anyone to check code into Tuenti is the following:

So the cycle begins with a feature request and its implementation. That is a complex process on its own, but we will not get into that now. When a development team coding a feature starts the actual coding, the team will fork a branch from the last stable version of the code and he will include its project in our Continuous Integration system: Jenkins. Jenkins is an open source tool that among many other things, it can trigger code builds when code is checked into the repositories. During the development process Jenkins will give feedback to the developer so that he knows whether he broke the build or any of the tests.

Typically the developer will write its own tests to unit test the classes he is modifying or coding, some integration tests, and at last, the QA engineers assigned to the team will write some UI browser tests to ensure the feature works and feels right on the UI. Developers have test sandboxes at their environments, which can be very useful for them to quickly assert that their code is working, nonetheless they can only run a reduced set of tests, most likely the ones related to the feature being developed. This is where Jenkins comes into play.

Jenkins CI environment will automatically run for the developer the whole test suite on his project so that he can be sure that his feature is not breaking any already existing features nor causing any disruption on the site.
Once Jenkins gives the developer the thumbs up (represented by a blue icon), the developer can automatically merge his code into the integration branch. Each time a branch is merged, a build is triggered in the integration branch, so we ensure that what was already integrated by other developers, works with the code which has just been merged.

At some point the release operations team selects an error-free changeset which will be the one going live. The set of error-free changesets will be moved to a different branch, called release, so that developers can continue integrating code in the integration branch without affecting the code release. The set of changes in the release branch, is then manually tested by the QA team, and bugs are corrected as soon as they are reported. On the meantime automatic regressions keep happening to ensure that the fixes don’t introduce new bugs into the code.

Once the branch is stable and all the bugs are under control, the code will be rolled out to live in off-peak traffic hours to avoid disrupting our users. Our QA team will ensure that everything works as expected and report any bugs found which will be fixed. When the devops (Development Operations)  team considers that the site is stable the release is closed and the release branch is considered to be safe and stable. At that point the code is merged to the stable branch, thus stable branch will have the last version of the code in live.

If bugs are spotted after the release is closed, depending on how important and compromising they are, they will be fixed immediately and committed to the Hotfix repository, where the fix will be tested and later pushed into live outside the planned workflow. If the bugs were not urgent to fix, the fixed would be pushed to live in the next scheduled code release.

Continuous Integration Framework

Now that we gave you an insight to our release workflow, let’s take a look at how our test framework is implemented and what is it capable of.

Tuenti Test framework is always at continuous development. Is a pretty complex tool which at the same time provides a lot of value to the development teams. We constantly face new challenges to solve.  We keep developing features to help developers to get feedback as soon as possible and in the clearest way and we keep adding tweaks to the system to maximize speed and throughput.

Currently we have a test suite with over 10.000 tests. Among this tests some are slower than others, for example browser UI tests take an average of 16 seconds each, integration tests 2 seconds and unit tests just a few milliseconds. The number of tests is constantly growing, at a pace of around 400 tests per month, making the full regressions slower and slower. As if that wasn’t enough, the rapid growth of the development team skyrocketed the amount of code to be tested.
It currently takes our CI system around 110 minutes to run all those tests, and about 25 minutes when we use the pipeline strategy. For some it might be good enough, but there are certain scenarios in which we need very quick feedback to react ASAP, plus developers are always happy to get feedback as soon as possible.

How do we achieve all this? What challenges did we face and what challenges arose?

To achieve this we are using 21 fairly powerful machines (8 Cores @2.50GHz, with 12 and 24 GB of RAM). Each job is run a in a single machine, except for pipe-lined jobs for extra quick feedback. In each machine we execute tests in parallel in 6 isolated environments (as you can see in the figure below) which are distributed by a test queue. Each environment has its own DB with the required fixtures, and for the browser tests different VNC environments, as we had some problems with some tests failing when they lost browser focus by other browsers running tests in parallel. 

Optimizing the code of the browser tests managed to decrease the test regression time by about 20 minutes. We speed up builds by removing unconditional sleeps which aren’t strictly necessary, in some parts of the test framework.

On the issues side, we have a few non-deterministic tests. This tests produce a different outcome each time they are ran, that seems to not depend on code changes. If you have worked in test automation, you probably know what I’m talking about: those tests that seem to fail from time to time, making your build as failed for no apparent reason and at last, the test seems to pass when you run it again. Unstable tests are indeed a big issue. Some of the core benefits and goals of automation are lost when your suite has unstable tests: your suite is not self checking nor fully automatic (requires manual debugging) and the test results are not repeatable. Unstable tests waste a lot of test engineer's’ time in debugging to figure out what went wrong, plus it makes the suite look as flaky and unreliable, and you don’t want that happening. If your test regression becomes unreliable, developers will get angry and will always blame the framework when tests fail, no matter whether it was their fault or not.

To fight off this enemy, we took two strategies:
● Short-term strategy: we keep track of all the tests which have failed, and after the complete suite is finished, if the number of failed tests is below a given threshold, then the build process will automatically retry to run this tests in isolation without parallel execution. If the tests pass, we modify the logs and mark them as passed. We implemented this approach to filter false positives and save time analysing reports This approach is pretty effective and does a good job cleaning the reports but it has some drawbacks. The retry task adds some extra 15 minutes to the whole build. The strategy doesn’t cope with the problem root as the number of unstable tests will keep increasing and at last, some tests do fail even after being retried, requiring manual debugging.
● Long-term strategy: when a test fails and then passes is added into an unstable tests report produced by Jenkins after every build. After the reports are produced, the Quality Assurance or the DevOps Team  will temporarily remove the test from the regression suite and will examine the test to try to determine the root cause of non determinism when running the test. Once the test is fixed and stable is brought back to the suite. Thanks to this approach the number of unstable tests has drop by 80%, making our regressions reliable and completely automatic in most cases.

When it comes to providing feedback ASAP, we can’t parallelize limitlessly as we don’t have infinite machines and we need to find the optimal number to provide quick feedback minimizing the potential loss of total throughput of builds that might be affected as trade off if we use more than one machine per job execution. Also running tests in more environments, mean more setUp time (the built code shall be deployed in that environment, the data bases should be prepared for the environment, etc). So far we  have applied the following strategies:
● Early feedback: we provide results per test on real time as they are executed and we run the tests which failed during the last build first, so that developers can check ASAP whether their tests have been fixed.
● Execution of selected tests: developers can customize their regressions to customize which tests they want to run. They can choose to run only unit and integration or only browser tests. In early development stages we advise to run unit and integration, and at last acceptance upon merging to integration branch. Developers can also specify the tests to run by using phpunit @group tag annotation so they can have a fine grained selection of the tests they need.
● Build Pipeline: for special jobs in which is critical to get very quick feedback or we need reports of nearly all the changesets. For the pipeline we use 6 testing nodes. The pipeline is divided in 3 steps mainly: build and unit tests, main test regression and aggregating the test reports. We trigger 5 jobs to run the main regression  in parallel.  The pipeline produces results in about 25 -35 minutes compared to 110’ - 120 minutes  for normal builds. The drawback is that build pipelines require many machines and reduce the total build throughput per hour at peak times.
● On demand cloud resources: we used to use Amazon’s web services to get  testing resources in the pas as a proof of conceptt, but we decided to invest in our own infrastructure as it proved to be more convenient economy wise. Nowadays we are thinking of going back to on demand virtualization as we need to produce quicker feedback for full regression. Quicker feedback of complete regressions with a constantly growing amount of tests, can only be translated into an increment of machines in our testing farm. However, as  most builds are needed during peak hours, it would make sense to have on demand resources rather than acquiring new machines which would be under-used at off-peak hours.