Asynchronous Testing and TUSchedulers

Publicado el 02/10/2014 por César Estébanez, Mobile Apps Engineer

Asynchronous code is intrinsically hard to test. Most of the async testing solutions out there use some kind of waiting mechanism that stops the test until the async logic returns. That kind of solutions are natural and easy to use, but they introduce non-determinism in your tests. Non-deterministic tests coud be extremely dangerous in Continuous Integration environments, since they can affect the deployment pipeline, slowing down the development team, and eventually the whole company. Furthermore, unstable tests reduce the developers' confidence in the tests, and can ruin the whole suite if it they are not taken care of (see Fowler 2011). 

Our iOS team at Tuenti recently released TUScheduler: a very simple Objective-C library to deal with Async Testing problem in a very elegant and robust way, and without introducing any non-determinism.

In this blogpost we explore the problem of async testing and how it affects test quality and the speed of development teams. We show the different solutions available, exploring their strengths and weaknesses. Finally, we explain the approach followed in TUScheduler, and talk a little bit about this simple library. Although TUScheduler and all the code snippets of this post are written in Objective-C, all the concepts and ideas can be directly applied to any language or framework that requieres testing asynchronous code.

The Problem of Async Testing

Asynchronous code is hard to test because in order to validate it, a test has to stop and wait for the async code to finish. Only then can it verify that its behaviour was as expected.

Inactive waiting techniques (i.e. using sleep()) are the worst possible approach in this case, since they stop the test for a fixed amount of time, which is usually several orders of magnitude higher than the real expected duration of the async task. This is why solutions based on active waits are strongly preferred.

Active Waiting Techniques

The most common approach in the async testing scene is to use some kind of polling mechanism to periodically check if the async code finished executing:


  NSTimeInterval pollingTime = 0.01;
  NSTimeInterval timeout = 2.0;
  NSDate *expiryDate = [NSDate dateWithTimeIntervalSinceNow:timeout];
 
  while(!testFinished && ([[NSDate date] compare:expiryDate] != NSOrderedDescending)) {
    [[NSRunLoop currentRunLoop] runUntilDate:[NSDate dateWithTimeIntervalSinceNow:pollingTime]];
    OSMemoryBarrier();
  }

The test checks to see if the async code has finished (here represented by the testFinished flag). If it hasn't, the test sleeps for a very short amount of time, and then checks again. This process is repeated until the expected condition is met, or the total waiting time surpasses the defined timeout.

The advantage of this approach over inactive waits is that we only waste at most pollingTime seconds. That is why most iOS async testing frameworks use this technique (e.g. Kiwi, GHUnit, SenTestingKitAsync). Apple is also providing a new native mechanism for async testing in Xcode 6 which allows expectations to be defined for your tests using XCTestExpectation.

A very common variation to Polling is to design the async calls to have customisable Callbacks that get called on completion. Tests can listen for those callbacks to know when to proceed to the verification phase. Once more, we need to use some kind of active wait based on a handcrafted timeout value.

Timeouts and Continuous Integration

Polling techniques are usually enough in simple environments: they provide simplicity and a very natural syntax, and, as long as timeouts are correctly tuned, polling should be fine most of the time.

However, there is a critical problem associated with polling: how to choose the timeout value. If this value is too tight, the test suite could very easily fail due to overloaded machines, or other special circumstances, introducing uncertainty in the tests. On the other hand, very high timeout values could make the whole suite to take a huge amount of time when several tests time out in the same run.

This problem can significantly slow down teams that relies on a Continuous Integration environment, in which integration of branches is done very frequently by different members of the team (or by different teams), using an automated deployment pipeline.

A Trivial Example

Take a look on the following code snippet:


- (void)loadContactsWithCompletion:(void(^)(NSArray *contacts))completion {
  dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
    NSArray *contacts = [self incrediblySlowMethodToLoadContacts];
    completion(contacts);
  });
}

How can we choose an appropriate timeout for testing this method? There is no way to know how long it would take to execute incrediblySlowMethodToLoadContacts. The only option in this case is to estimate a reasonable timeout by trial and error, which gives the tests enough time to get back from incrediblySlowMethodToLoadContacts and pass.

This estimated timeout is typically adjusted to work fine during normal deployment cycles, but there is no way to ensure that it would be enough under special circumstances. For example, during crunch times, when many teams are trying to integrate their branches, and all the systems are overloaded. In those circumstances, timeout could not be high enough, producing apparently random test failures.

This kind of non-determinism in tests is very dangerous, and can bite you in many ways (see Fowler 2011 for a very detailed analysis). But in this blogpost we are specially interested in how non-determinism can affect development teams that use an automated integration pipeline.

Tuenti Case Study

The iOS team of Tuenti is constantly changing due to the need of the company to move fast and quickly adapt to new challenges. However, a very typical configuration involves around 8 to 10 developers, distributed across an iOS Core Team and 3 to 5 product teams. Our codebase has more than 300k lines of code, that without counting more than 20 external libraries (some of them our own forks of other Open Source libraries), over 400 branches, and over 18,000 changesets.

Each team works independently on their own branches, which are integrated very frequently. We use Flow, an state-of-the-art automated deployment system created by our devops engineers to take full control of the integration process. When a branch is ready to be integrated, we send a pull request to Flow directly from Jira. Flow creates a temporary branch with the merge of the current integration branch and the pull request branch that will be merged, and enqueues the temporary branch waiting for a free slot in the build servers. When an slot is allocated, Flow creates a job in Jenkins for the branch. If the job gets blue ball, the branch is automatically merged into integration, the Jira ticket is updated, and developers are notified. Every changeset of integration is a Release Candidate, so we better make sure this whole process is robust.

In the iOS platform, a complete build (including creation of a fresh server instance, static analysis, build, testing, and report generation) takes approximately 40 minutes. On an extraordinarily busy day, when many features are scheduled to be integrated, a branch can sit in the queue for several hours. If after that time waiting the build fails because of an unstable test, the process needs to be restarted from the beginning. The owner of the affected branch has to merge all the changes of the branches that were originally queued after the affected branch, which are now being integrated before, and the branch goes back to the last position of the queue. As the servers get busier and busier, more tests timeout, and more branches are erroneously rejected. The problem grows exponentially affecting more developers, and effectively blocking the integration pipeline of the whole team.

So it is not a matter of just running the tests again. In our every day work at Tuenti, unstable tests can significantly slow our teams down, and reduce our ability to move fast and deliver on time.

TUScheduler Approach and the Humble Object pattern

TUScheduler is an alternative to the Polling/Callback approach. It is inspired in the Humble Object Pattern. The idea is to extract all the hard-to-test code (the async behavior) from your business logic into a Humble Object (TUScheduler), so you can test your business logic separately, just as if it were fully synchronous. The Humble Object must be a very thin concurrency layer, so simple that it doesn't even need to be tested.

When a class needs to perform any asynchronous job, instead of directly posting this job using any of the available concurrency APIs (GCD, NSOperationQueue, etc.), it will delegate this responsibility into a TUScheduler. The scheduler can be provided as a parameter or injected into the class during initialization.

When running tests, the class will be injected with a test-specific scheduler, called the TUFullySynchronousScheduler. This scheduler is not concurrent at all, it just immediately executes all the jobs sent to it in a purely synchronous way. This way, tests can focus on testing the business logic without having to care about timeouts or concurrency management.

Does this affect the quality of our tests in any way? The answer is no. This little scheduling trick should be completely safe: the essence of concurrent programming is to separate what has to be done from when it is done, so the class should not be making any assumption on when the scheduled jobs are being executed, and the associated tests should be just as valid no matter whether the jobs are synchronous or asynchronous. In fact, if a piece of code is in some way dependent on when the async jobs are done, then it is probably exposed to race conditions and other synchronization problems, and should be refactored ASAP.

Furthermore, the use of TUScheduler also brings a very nice side-effect: classes are not coupled to a particular concurrency API any more, so you can swap APIs at any time by simply changing the construction of your async classes. If you are using Dependency Injection like we do at Tuenti, then the change is less than one line of code, probably just one word!

TUSchedulerFactory

The concurrency mechanisms that a class uses for its internal operation are not interesting to the component that creates the class. Each class should be responsible for managing its own schedulers privately, since they are just implementation details. It would be a very bad idea if the component that creates an object is also responsible for configuring and injecting an scheduler. That would be knowing too much about the object and its internal details. Furthermore, some classes could need several schedulers to operate (e.g. a serial one to keep an internal collection synchronized and a concurrent one to post heavy background tasks).

This is why the recommended way of using TUScheduler is injecting the abstract factory <TUSchedulerFactory> into the class, and using it internally to produce and configure all the <TUScheduler> instances that the class might need.

For testing, TUScheduler provides an special concrete factory called <TUFullySynchronousSchedulerFactory>, which always returns fully synchronous schedulers. This way, the concurrency is still isolated inside the Humble Object, making the async logic testable, but at the same time each class keeps its implementation details private.

Test Coverage

TUScheduler is designed to be extremely simple (only six classes, two protocols, and very few lines of code). That is the spirit behind Humble Object Pattern: the extracted logic must be so simple that it does not even need to be tested at all.

Even so, TUScheduler comes with a suite of unit tests that enforces the contract of the <TUScheduler> protocol in both the TUGCDScheduler and TUOperationQueueScheduler concrete implementations. This way, users of TUScheduler can focus on testing their business logic, knowing that the concurrency management is tested elsewhere, and more importantly, only once.

These tests use the special helper class TUAsyncTestSentinel, which implements a Polling algorithm to test the concurrent behaviour of schedulers. Yes, timeouts. Why not? In this case there is no way to avoid non-determinism, and, on the other hand, there is no problem with unstable tests as long as they are kept isolated and out of the deployment pipeline, where they cannot bite us. We at Tuenti use TUScheduler as an external library, with its own tests that run separately on each push to ensure they still pass. That is all we need to know.

At any case, TUScheduler is so simple that one could even completely ignore the tests. Visual inspection should be more than enough to spot a bug in such a simple piece of code.

Conclusion

Testing async code is intrinsically hard. Most solutions for writing async tests include some kind of timeout system, which will introduce indeterminism to the suite of tests. While many development teams can live with this small amount of uncertainty, some of us just can't.

When a team relies on an automated build system, that aspires to maintain an evergreen integration branch, and in which branches need to be integrated frequently and independently, unstable tests could represent an important problem, slowing developers down, and eventually the whole team.

We recently released TUScheduler, an elegant solution to async testing based on the Humble Object Pattern, that decouples business logic from concurrent behaviour, letting your team focus on what it is really important. You can find TUScheduler in our github repo.

If you find it useful, please, contribute!

Swizzling in Objective-C for Fun & Profit

Publicado el 24/2/2014 por Rafael López, Mobile Apps Engineer

https://github.com/tuenti/TMInstanceMethodSwizzler

Code injection is a powerful procedure which allows you to modify the behavior of a given method by adding code to be executed before or after its implementation, or even by replacing it altogether.

There are several scenarios in which code injection might be useful such as when adding profiling, logging or statistics to existing methods, creating partial mocks of objects, make testing easier, or implementing objects using an aspect oriented approach.

The way this is normally done is that Objective-C takes a method from a class and changes its implementation at runtime, via class method swizzling —you change the original implementation of the method with a new one while keeping the original implementation somewhere so you can restore it later. This procedure is hipster-approved, and a great way to learn about the inner workings of the Objective-C runtime.

That said, this approach has one main disadvantage: each and every one of the instances of a class with a swizzled method will inherit the new behavior. If you might want to have two different instances of the same class which behave differently at runtime so you can, for instance, log the events generated by two objects using different logging services.

In order to make this possible, we have created TMInstanceMethodSwizzler, a utility class which allows you to specify an object and one of its methods, providing a block of code which should be called on a specified moment —before, after, or in lieu of the original implementation.

A sample use case is provided: TMTimeoutManager. This is a class which allows you to expect a method to be called before a given amount of time, invoking a block of code when this occurs or another one if the provided timeout is reached.

Both classes are the result of a Hack me up, an internal contest in which Tuenti engineers are given 24 hours to develop whatever they think might be useful, funny, or worth making. You can watch a presentation of the project on the YouTube channel of its author, Rafael López Diez.

The Test Advisor

Publicado el 02/9/2013 por Rafa de la Torre & Miguel Angel García, Test FW Engineers

The Test Advisor is a tool that lets you know what set of tests you should execute given a set of changes in your source code. It is automatic and does not require you to tag all of your test cases.

A Bit of Background

The story of CI has been a tough one here at Tuenti. At one point, our CI system had to handle a huge load of tests of different types for every branch. We are talking about something around 20K test cases, a 2.4h build time, and Jenkins acting as a bottleneck for the development process. From a developer perspective, we track a measure of their suffering, called FPTR (short of “from push to test results”). It peaked at 8h.


Therefore, something had to be done with the tests. There were multiple options: add more hardware, optimize the tests, optimize the platform, delete the tests... or a combination of these options. When considering solutions, you need to keep in mind that products evolve which could potentially produce more tests that worsen the problem. In the long run, the best solution will always somehow involve test case management and process control, but that’s another story for a future post...

After adding more hardware, implementing several performance improvements, and managing slow and unstable tests, we concluded that we had to improve the situation in a sustainable way.

One solution we thought of was to identify the relevant tests for the changes being tested and execute only those tests, instead of executing the full regression. The idea was not novel and it may sound pretty obvious, but implementing it is not at all.

In short, the idea is to identify the changes in the system, the pieces that depend upon those changes and then execute all of the related tests. Google implemented it on top of their build system. However, we don’t have that great of a build system in which dependencies and tests are explicitly specified, and our starting point was a rather monolithic system.

The first approach we took in that direction was to identify the components of the system and annotate the tests indicating what component they are aiming at testing. Thus the theory was: “If I modify component X, then run all the tests annotated with @group X”. It didn’t work well: the list of components is live and evolves with the system and requires maintenance, tests needed to be annotated manually, maintaining synchronicity required a lot of effort, and there was no obvious way to check the accuracy of the annotations.

A different approach is to gather coverage information and exploit it in order to relate changes in source files and tests covering those changes. Getting coverage globally is not easy with our setup. We still use a custom solution based on phpunit+xdebug. There are still some problems with that approach though, that mostly affect end-to-end and integration tests: it is hard to relate the source files with the files finally deployed to tests servers, partly due to the way our build script works. Yes, it is easier for unit tests, but they are not really a problem since they are really fast. Additionally, we did not want to restrict our solution strictly to php code.

What it Is

The Test Advisor is a service that gathers “pseudo-coverage” information to be used later on, essentially to determine what the relevant test cases for a given changeset are.

When it was proposed, some of the advantages that we envisaged were:

  • Reduced regression times in a sustainable way. It would no longer be limited by the size of the full regression.
  • Improved feedback cycles
  • No need for manual maintenance (as opposed to annotating tests)

Regarding feedback cycles, a couple of benefits we foresaw were related to some pretty common problems in CI setups: robustness and speed of the tests. A common complaint was about robustness and false positives: “My code doesn’t touch anything related to that test but it failed in Jenkins”. If we had the Test Advisor, who would suffer bad quality tests in the first place? The ones modifying code related to those tests and who want to get their changes into production. No more discussion about ownership. No more quick and dirty workarounds for the flakiness of the tests. The same applies to their speed. It would be in your best interest to develop and maintain high quality tests :)

How It Works

Most of our product stack works under debian linux. We decided to use the inotify kernel subsystem to track filesystem events. This way, we can get that pseudo-coverage at a file level, independent from the language and the test framework used, or even the test runner.

We developed it using python as the main language, which is a good language for developing a quick prototype and putting all the pieces together. We also used pynotify.

The Test Advisor is conceptually composed of three main parts:

  • A service for information storage
  • A service for collecting coverage information. It is notified by the test runner when a test starts and ends. It has the responsibility of setting up and retrieving information about the files being accessed to complete the test scenario.
  • A client to retrieve and exploit information from the TestAdvisor.

This is how it looks at a high level:


As you can see from the figure, the TestAdvisor can take advantage of any test type being executed (unit, integration or E2E browser tests). It is also independent from the language used to implement the tests and the SUT.

Problems Along the Way

So we were easily able to get the files covered by a particular test case, but those types files are not actual source files. They go through a rather complex build process in which files are not only moved around, but they are versioned, stripped, compressed, etc. and some of them are even generated dynamically to fulfill http requests in production. We tried to hook into the build process to get a mapping between source and deployed files but it turned out to be too complex. We finally decided to use the “development mode” of the build, which basically links files and directories.

Another problem was the file caching. The hiphop interpreter (among others) caches the php files being processed and decides whether a file has changed based on its access time, and inotify does not provide a way of monitoring stat operations on the files. We researched a few possibilities for getting around this:

  • Override system calls to stat and its variants by means of LD_PRELOAD, not an easy one to get fully working (and also made us feel dirty messing there).
  • Instrument the kernel with systemtap, kprobes or similar. A bit cleaner, but a mistake and your precious machine may freeze. We also got the feeling that we were trying to re-implement inotify.

Finally, we picked the KISS solution: just perform a touch on the files accessed between tests. This way, the regression executed to grab information will be slower, but we don’t care much about the duration of a nightly build, do we? :)

Work that Lies Ahead

Right now, the TestAdvisor is used as a developer tool. We’ve achieved great improvements in the CI by other means (most of them process-related changes). However, we are still eager to integrate the TestAdvisor into our development and release cycle within our CI setup.

Developers will probably use a pipeline for their branches, consisting of 3 distinct phases: Unit Testing , Advised Testing and (optionally) Pre-Integration. An “advised build” in green can be considered sufficiently trustable to qualify for integration into the mainline.

This approach has been applied to the core of our products, the server-side code, and the desktop and mobile browser clients. We expect it to also be applied in the CI of the mobile apps at some point.

JavaScript Continuous Integration

Publicado el 22/8/2013 por Miguel Ángel García, Test FW Engineer; Juan Ramírez, Software Engineer & Alberto Gragera, Software Engineer

Here at Tuenti, we believe that Continuous Integration is the way to go in order to release in a fast, reliable fashion, and we apply it to every single level of our stack. Talking about CI is the same as talking about testing, which is critical for the health of any serious project. The bigger the project is, the more important automatic testing becomes. There are three requirements for testing something automatically:

  • You should be able to run your tests in a single step.
  • Results must be collected in a way that is computer-understable (JSON, XML, YAML, it doesn’t matter which).
  • And of course, you need to have tests. Tests are written by programmers, and it is very important to give them an easy way to write them and a functioning environment in which to execute them.

This is how we tackled the problem.

Our Environment

We use the YUI as the main JS library in our frontend, so our JS-CI need to be YUI compliant. There are very few tools that allow a JS-CI, and we went for JSTestDriver at the beginning but quickly found two main problems:

  • There were a lot of problems between the YUI event system and the way that JSTestDriver retrieves the results.
  • The way tests are executed is not extremely human-friendly. We use a browser to work (and several to test our code in), so it would be nice to use the same interface to run JS tests. This discarded other tools like PhantomJS and other smoke runners.

Therefore, we had to build our own framework to execute tests, keeping all of the requirements in mind. Our efforts were focused on keeping it as simple as possible, and we decided to use the YUI testing libraries (and Sinon as mocking framework) because they allowed us to maintain each test properly isolated as well as proper reporting using XML or JSON and human-readable formats at the same time.

The Approach

This consisted in solving the simpler use case first (execute one test in a browser) and then iterate from there. A very simple PHP framework was put together to go through the JS module directories identifying the modules with tests (thanks to a basic naming convention) and execute them. The results were shown in a human-readable way and in an XML in-a-box.
PHP is required because we want to use a lot of the tools that we have already implemented in PHP related to YUI (dependency map generation server side, etc.).


After that, we wrote a client using Selenium/Webdriver so that the computer-readable results could be easily gathered, combined, and shown in report.

At this point in the project:

  • Developers could use the browser to run the test just by going to a predictable URL,
  • CI could use a command-line program to execute them automatically. Indeed, the results are automatically supported by Jenkins so we only needed to properly configure the jobs, telling our test framework to put the results in the correct folder and store them in the Jenkins job.

That was enough to allow us to execute them hundreds of times a day.

Coverage

Another important aspect of testing is knowing the coverage of your codebase. Due to the nature of our codebase and how we divide everything into modules that work together, each module may have dependencies with others, despite the fact that our tests are meant to be designed per-module.

Most of the dependencies are mocked with SinonJS, but there are also tests that don’t. Since a test in a module can exercise another, we can decide to compute coverage “as integration tests” or with a per-module approach.

The first alternative would instrument all of the code (with YUI Test) before running the tests, so if a module is exercised, it may be due to a test that is meant to exercise another module. An approach like that may encourage developers to create acceptance tests instead of unit tests.

Since we wanted to encourage developers to create unit tests for every module, the decision was finally taken to compute the coverage with a per-module approach. This way, we instrument only the code of the module being tested and ignore the others.

To make this happen, we moved the run operation from the PHP framework to a JS runner because that allowed for some of the required operations to get the coverage as instrument and de-instrument the code before and after running the tests.

Once the test is executed, the final test coverage report is generated, so it would be unnecessary to integrate with our current Jenkins infrastructure and setup all the jobs to generate the report. Coverage generation requires about 50% more time than regular test execution, but it’s absolutely worth it.

To be Done

There is more work that can be done at this point, such as improving the performance by executing tests in parallel or force a minimal coverage for each module, but we haven't done this yet because we think that the coverage is only a measure of untested code and it doesn't ensure the quality of the tests.

It is much more important to have a testing culture than just get the metrics of the code.

Video Recordings of Browser Tests

Publicado el 18/6/2013 por David Santiago, Test Engineer Senior

As part of the automated tests suite we run for every release candidate here at Tuenti, there are over 300 webdriver tests. They are used to verify that the site behaves as expected from the user’s point of view.

If you have ever dealt with these kinds of end-to-end tests, you know they are more prone to nondeterministic behavior than smaller, more focused integration tests. That’s expected given the higher number of moving parts involved. Our Jenkins continuous integration setup deals with these kinds of issues by retrying failed tests before considering them actual errors, helping to reduce the noise they cause.

Retrying a failed test that later passes wastes resources in our build pipelines, yet there is an even larger cost associated with them: the search for the root cause of the wrong behavior.

We already store screenshots for failed tests alongside their stack trace to help diagnose problems, but with some corner cases it’s not enough and it’s not uncommon for those tests to work perfectly fine when debugged locally. In those cases, the developer / test engineer working on it would have to be in front of the failed test when it ran in our CI environment, which is obviously not possible.

Therefore, we extended our webdriver grid infrastructure in order to record and store videos for the execution of such tests. Here is a diagram of the overall architecture, with additions to the previous one:

We implemented a video recording web service that runs on each server (actually virtual machines), where webdriver nodes for our grid run. This service allows the video recording in a VM to remotely start and stop in addition to storing it with a given name. For the video recording itself we used the Monte Media java library, extending it a bit to fit our needs.

By taking advantage of the hooks available in the code of the webdriver grid code, we transparently start the recording of the video whenever a new browser session is requested. From that moment on, the video is recorded while the test runs. If the test fails, our webdriver client api allows to request the storage of the video with a proper name that is afterwards used to access it. Another extension in the form of a servlet registered in the webdriver grid server allows the performance of such a request. The servlet knows where the node for the given test is located and asks the video service running there to stop and save the video.

Now, whenever a browser test fails in a build, a link to the video that was recorded during its execution is provided as part of the error information for the test, besides the failed assertion message and the stack trace.

We’d like to thank Kevin Menard for his support in the selenium users group and, for being available to help when someone needed it.

Pages

Siguenos