The Tuenti Release and Development Process: Deploying Configuration

Posted on 11/11/2013 by Víctor García, DevOps Engineer

Blog post series: part 5
You can read the previous post here.

Normally, website development differentiates between regular code and configuration code.

Having configuration in a separated code allows you to do quick and basic changes without touching the logic. These changes are safer because they should be just an integer value change, a boolean swap, a string substitution, etc. and don’t involve a full release process.

Some good practices have been described recently on the Internet about how to write your code in a flexible way to avoid useless releases, how to do A/B testing, about how to make your database changes backwards compatible, etc. and all of these good practices involved a good configuration system.

Following the DevOps Culture

Here at Tuenti, we are very fond of the DevOps culture and try to apply it as much as possible. We consider it the way to go and the way to do things efficiently, and it’s there is proof that it has helped us to improve quite a lot.

In our company, the configuration deployment is a clear example of DevOps culture. There is no manual intervention and there isn’t anyone who does deployments such as a release manager or operations guy. Every developer pushes his/her configuration changes to production on his/her own. Therefore, Devs are doing Ops tasks.

We use ConfigCop for that.

ConfigCop

ConfigCop is a tool to deploy configuration to preproduction and production servers. Any developer can and must use it to, first, test their changes in a preproduction server and then, deploy them to production.

Deploy to Preproduction

Preproduction deployments logic are fully done on the client side and the basic options a developer can use is a configuration initialization and a configuration update, the latter being the one that uploads any configuration to be tested.

ConfigCop pull the latests code from Mercurial, gathers all configuration files and stages them all applying some overrides necessary to make it work in preproduction servers and also generating some .json files readable by HipHop.

Then, just deploy them using Rsync and the developer is ready to test.

Deploy to Production

A production deployment must be sequential and cannot be done in parallel because conflicts may arise.

Therefore, for this ConfigCop uses its server side. It’s basically a server-client communication over RPC that establishes a locking mechanism to perform sequential deployments. The lock is given to the developer currently deploying a configuration and it can’t be stolen.

Until the developer has finished a configuration change, another one can’t start.

The workflow is pretty simple:

  • Start a configuration change by just typing a command line.
  • ConfigCop checks if it’s unlocked. If it is, it deploys the configuration change to a preproduction testing server that will always have the same code as production.

    • This is important: we are assuring that the configuration change will work properly in production.
  • Notify the developer that s/he can start testing.
  • Once it’s tested, push it to production with a single command.
  • The deployment takes process and it finishes in less than 1 minute.

Results

ConfigCop freed the release managers of doing these operational tasks and now developers with just 2 command lines are able to test and deploy configuration to production in a easy, reliable, and fast way.

Real Examples

  • Due to a network problem, some chat servers are down and the load in those that are alive is increasing very quickly, we will probably suffer an outage.

    • A developer can just do a config change to temporarily disable the chat to 10% of the users.
    • The problem is fixed in less than a minute.
  • A release requires changes in the database schema, the developer coded inserting in both tables to make it backwards compatible. The release is successfully deployed so the old schema can be deprecated.

    • A developer does a config change to make the code to insert to the new tables.
    • The developer does it on his/her own, nobody is bothered with such task.
  • The product team wants to test two different layouts (A/B testing) for the registration form and wants some users using the old one, and others using the new one so they can measure stats and choose the best one.

    • The developer will play with the percentages of users using A version or B version, and when the final decision is taken, set 100% to the chosen one.

The Tuenti Release and Development Process: Release Time!

Posted on 9/16/2013 by Víctor García, DevOps Engineer

Blog post series: part 4
You can read the previous post here.

Release Candidate Selection

So, we have made sure that the integration branch is good enough and every changeset is a potential release candidate. Therefore, the release branch selection is trivial, just pick the latest changeset.

The release manager is in charge of this. How does s/he do it? Using Flow and Jira.

Like the pull request system we talked about in the previous post, Jira orchestrates the release workflow, and Flow takes care of the logic and operations. So:

  1. The release manager creates a Jira “release” type ticket.
  2. To start the release, just transition the ticket to “Start”

    1. When this operation begins, Jira notifies Flow and the release start process begins.
  • This is what Flow does in the background:

    1. It creates a new branch from the latest integration branch changeset.
    2. Analyzes the release contents and gather the involved tickets, to link them to the release ticket (Jira supports tickets linking)
    3. It configures Jenkins to test the new branch.
    4. It adds everyone involved in the release as watchers (a Jira watcher will be notified in every ticket change) in the release ticket, so that all of them are aware of anything related.
    5. Sends an email notification with the release content to everyone in the company’s tech side.
    6. This process takes ~1 minute.

Building, Compiling and Testing the Release Branch

Once the release branch is selected, it is time to start testing it because there might be a last minute bug that escaped all of the previous tests and eluded our awesome QA team’s eyes.

Flow detects new commits done in release (this commits almost never occur) and builds, compiles and updates an alpha server dedicated for the release branch.

Build

Why do we build the code? PHP is an interpreted language that doesn’t need to be built! Yes, it’s true, but we need to build other things:

  1. JavaScript and HTML code minimization
  2. Fetch libraries
  3. Static files versioning
  4. Generate translation binaries
  5. Build YUI libraries
  6. etc.

Compilation

We also use HipHop, so we do have to compile PHP code.
The HipHop compilation for a huge code base like ours is a quite heavy operation. We get the full code compiled in about 5 minutes using a farm built with 6 servers. This farm is dedicated just to this purpose and the full compilation only takes about 5 - 6 minutes.

Testing

The code built and compiled is deployed to an alpha server for the release branch, and QA tests the code there. The testing is fast and not extensive. It’s basically a sanity test over the main features since Jenkins and the previous testings assure its quality. Furthermore, the error log is checked just in case anything fails silently but leaves an error trace.
This testing phase usually takes a few minutes and bugs are rarely found.
Furthermore, Jenkins also runs all of the automated tests, so we have even more assurance that no tests have been broken.

Staging Phase, the Users Test for Us

Staging is the last step step before the final production deployment. It consists of a handful of dedicated servers where the release branch code is deployed and thousands of real users transparently “test” it. We just need to keep an eye on the error log, the performance stats, and the servers monitors to see if any issue arises.

This step is quite important. New bugs are almost never found here, but the ones that are found are very hard to detect, so anything found here is is more than welcome, especially because those bugs are usually caused by a big amount of users browsing the site, a case that we can’t easily reproduce in an alpha or development environment.

Updating Website!

We are now sure that the release branch code is correct and bugs free. We are ready to deploy the release code to hundreds of frontends servers. The same built code we used for deploying to that alpha for the release branch will be used for production.

The Deployment: TuentiDeployer

The deployment is performed with a tool called TuentiDeployer. Every time we’ve mentioned a “deploy” within these blog posts, that deploy was done using this tool.

It is used across the entire company and for any type of deployment for any service or to any server uses it. It’s basically a smart wrapper over Rsync and WebDav that parallelizes and supports multiple and flexible configurations letting you deploy almost anything you want wherever you want.

The production deployment, of course, is also done with TuentiDeployer, and pushing code to hundreds of servers only takes 1 - 2 minutes (mostly depending on the network latency).

It performs different types of deployments:

  1. PHP code to some servers that does not support HipHop yet.
  2. Static files to the static servers.
  3. An alpha deployment to keep at least one alpha server with live code.
  4. HipHop code to most of the servers:

    1. Not  fully true, we can’t push a huge binary file to hundreds of servers.
    2. Instead, it only deploys a text file with the new HipHop binary version.
    3. The frontend servers have a daemon that detect this file has changed.
    4. If it changed, all servers get the binary file from the artifact server.

      • This file is there as a previous step, pushed after its compilation.
    5. Obviously, hundreds of servers getting a big file from an artifact server will shut it down, so there are a bunch of cache artifacts servers to fetch from and relieve the master one.

Finishing the Release

Release done! It can be safely merged into the live branch so that every developer can get it to work with the latest code. This is when Jira and Flow takes part again. The Jira ticket triggers all the automated process with just the click of a button.

The Jira release ticket is transitioned to the “Deployed” status and a process in Flow starts. This process:

  1. Merges the release branch to the live branch.
  2. Closes the release branch in Mercurial.
  3. Disables the Jenkins job that tested the release branch to avoid wasting resources.
  4. Notifies everyone by email that the release is already in production.
  5. Updates the Flow dashboards as the release is over.

Oh No!! We Have to Revert the Release!!

After deploying the release to production, we might detect there is something important really broken and we have no other choice but to revert the release because that feature cannot be disabled by config and the fix seems to be complex and won’t be ready in the short term.

No problem!! We store the code built compression of the last releases, so we just need to decompress it and do the production deployment.
The revert process takes only about 4 minutes and almost never takes place.

You can keep reading the next post here.

The Test Advisor

Posted on 9/02/2013 by Rafa de la Torre & Miguel Angel García, Test FW Engineers

The Test Advisor is a tool that lets you know what set of tests you should execute given a set of changes in your source code. It is automatic and does not require you to tag all of your test cases.

A Bit of Background

The story of CI has been a tough one here at Tuenti. At one point, our CI system had to handle a huge load of tests of different types for every branch. We are talking about something around 20K test cases, a 2.4h build time, and Jenkins acting as a bottleneck for the development process. From a developer perspective, we track a measure of their suffering, called FPTR (short of “from push to test results”). It peaked at 8h.


Therefore, something had to be done with the tests. There were multiple options: add more hardware, optimize the tests, optimize the platform, delete the tests... or a combination of these options. When considering solutions, you need to keep in mind that products evolve which could potentially produce more tests that worsen the problem. In the long run, the best solution will always somehow involve test case management and process control, but that’s another story for a future post...

After adding more hardware, implementing several performance improvements, and managing slow and unstable tests, we concluded that we had to improve the situation in a sustainable way.

One solution we thought of was to identify the relevant tests for the changes being tested and execute only those tests, instead of executing the full regression. The idea was not novel and it may sound pretty obvious, but implementing it is not at all.

In short, the idea is to identify the changes in the system, the pieces that depend upon those changes and then execute all of the related tests. Google implemented it on top of their build system. However, we don’t have that great of a build system in which dependencies and tests are explicitly specified, and our starting point was a rather monolithic system.

The first approach we took in that direction was to identify the components of the system and annotate the tests indicating what component they are aiming at testing. Thus the theory was: “If I modify component X, then run all the tests annotated with @group X”. It didn’t work well: the list of components is live and evolves with the system and requires maintenance, tests needed to be annotated manually, maintaining synchronicity required a lot of effort, and there was no obvious way to check the accuracy of the annotations.

A different approach is to gather coverage information and exploit it in order to relate changes in source files and tests covering those changes. Getting coverage globally is not easy with our setup. We still use a custom solution based on phpunit+xdebug. There are still some problems with that approach though, that mostly affect end-to-end and integration tests: it is hard to relate the source files with the files finally deployed to tests servers, partly due to the way our build script works. Yes, it is easier for unit tests, but they are not really a problem since they are really fast. Additionally, we did not want to restrict our solution strictly to php code.

What it Is

The Test Advisor is a service that gathers “pseudo-coverage” information to be used later on, essentially to determine what the relevant test cases for a given changeset are.

When it was proposed, some of the advantages that we envisaged were:

  • Reduced regression times in a sustainable way. It would no longer be limited by the size of the full regression.
  • Improved feedback cycles
  • No need for manual maintenance (as opposed to annotating tests)

Regarding feedback cycles, a couple of benefits we foresaw were related to some pretty common problems in CI setups: robustness and speed of the tests. A common complaint was about robustness and false positives: “My code doesn’t touch anything related to that test but it failed in Jenkins”. If we had the Test Advisor, who would suffer bad quality tests in the first place? The ones modifying code related to those tests and who want to get their changes into production. No more discussion about ownership. No more quick and dirty workarounds for the flakiness of the tests. The same applies to their speed. It would be in your best interest to develop and maintain high quality tests :)

How It Works

Most of our product stack works under debian linux. We decided to use the inotify kernel subsystem to track filesystem events. This way, we can get that pseudo-coverage at a file level, independent from the language and the test framework used, or even the test runner.

We developed it using python as the main language, which is a good language for developing a quick prototype and putting all the pieces together. We also used pynotify.

The Test Advisor is conceptually composed of three main parts:

  • A service for information storage
  • A service for collecting coverage information. It is notified by the test runner when a test starts and ends. It has the responsibility of setting up and retrieving information about the files being accessed to complete the test scenario.
  • A client to retrieve and exploit information from the TestAdvisor.

This is how it looks at a high level:


As you can see from the figure, the TestAdvisor can take advantage of any test type being executed (unit, integration or E2E browser tests). It is also independent from the language used to implement the tests and the SUT.

Problems Along the Way

So we were easily able to get the files covered by a particular test case, but those types files are not actual source files. They go through a rather complex build process in which files are not only moved around, but they are versioned, stripped, compressed, etc. and some of them are even generated dynamically to fulfill http requests in production. We tried to hook into the build process to get a mapping between source and deployed files but it turned out to be too complex. We finally decided to use the “development mode” of the build, which basically links files and directories.

Another problem was the file caching. The hiphop interpreter (among others) caches the php files being processed and decides whether a file has changed based on its access time, and inotify does not provide a way of monitoring stat operations on the files. We researched a few possibilities for getting around this:

  • Override system calls to stat and its variants by means of LD_PRELOAD, not an easy one to get fully working (and also made us feel dirty messing there).
  • Instrument the kernel with systemtap, kprobes or similar. A bit cleaner, but a mistake and your precious machine may freeze. We also got the feeling that we were trying to re-implement inotify.

Finally, we picked the KISS solution: just perform a touch on the files accessed between tests. This way, the regression executed to grab information will be slower, but we don’t care much about the duration of a nightly build, do we? :)

Work that Lies Ahead

Right now, the TestAdvisor is used as a developer tool. We’ve achieved great improvements in the CI by other means (most of them process-related changes). However, we are still eager to integrate the TestAdvisor into our development and release cycle within our CI setup.

Developers will probably use a pipeline for their branches, consisting of 3 distinct phases: Unit Testing , Advised Testing and (optionally) Pre-Integration. An “advised build” in green can be considered sufficiently trustable to qualify for integration into the mainline.

This approach has been applied to the core of our products, the server-side code, and the desktop and mobile browser clients. We expect it to also be applied in the CI of the mobile apps at some point.

The Tuenti Release and Development Process: Development Branch Integration

Posted on 8/26/2013 by Víctor García, DevOps Engineer

Blog post series: part 3
You can read the previous post here.

In the previous blog post, we mentioned that one of the requisites a development branch must fulfill is a pull request. This is the only process in Tuenti to merge code to the integration branch.
We really want to have an evergreen integration branch that is always in a good state and with no tests failing. To accomplish that, we created a pull request system managed by a tool called Flow.

The Flow Pull Request System

Flow is a tool that fully orchestrates the whole integration, release and deployment process and offers dashboards to show everyone the release status, integration branch, pull requests results, etc. A full blog post will explain this, so here we’re focusing on branch integration.

Although Flow has the logic and performs the operations, every action is triggered through Jira.


Following the above diagram, here are the steps performed:

  • The developer creates an “integration” ticket in Jira. This ticket is considered a unique entity that represents a developer’s intention to integrate code, so the ticket must contain some information regarding the code that will be merged:

    • Branch contents
    • Author
    • QA member representative
    • Risks
    • Branch name
    • Mercurial revision to merge
  • Then, the ticket must be transitioned to the “Accepted” status by a QA member, so we can keep track and be assured that at least one of them has reviewed this branch.
  • To integrate the branch, the ticket must be transitioned to the “Pull request” status. This action will launch a pull request in Flow.

    • Flow and Jira are integrated and they “talk” in both directions. In this case, Jira sends a notification to Flow informing of a new pull request.
    • Additionally, in the background, Flow will gather the tickets involved in the branch and will link all of these tickets with the integration ticket using the Jira ticket relationships.
    • This provides a good overview of what it’s included in that branch to be merged.
  • Flow has a pull request queue, so they are processed sequentially.
  • Flow starts processing the pull request:

    • It creates a temporary branch with the merge of the current integration branch and the pull request branch that will be merged.
    • It configures Jenkins and triggers a build for that temporary branch to run all tests.
    • It waits until Jenkins has finished and when it’s done, Jenkins notifies Flow.

      • Jenkins executes all tests in 40 minutes, using a special and more parallelized configuration.
    • Then, it checks the Jenkins results and decides whether or not the branch can actually be merged to the integration branch.

      • If successful, it performs the merge, transitions the Jira ticket to “Integrated,” and starts with the next pull request.
      • Otherwise, the pull request is marked as failed and transitions the Jira ticket to “Rejected”.
    • Every operation Flow sends an email to the branch owner notifying him or her of the status of pull request and adds a Jira comment to the ticket.
  • If the pull request fails, Flow shows the reason for this in its dashboard and in the ticket (failed tests, merge conflicts), so the developer must:

    • Fix the problems
    • Change the Mercurial revision in the Jira ticket because the fix would require a new commit.
    • Transition the ticket to the pull request status again to launch a new pull request.


As you can see, this process requires minimum manual intervention, only clicking some Jira buttons is enough. It’s quite stable and more importantly, it lets the developers work on other projects while Flow is working, so they don’t have to worry about merges anymore. They just launch the pull request in their Jira ticket, and can forget about everything else.

So, this is how we assure the integration branch is always safe and that tests don’t fail. Therefore, every integration branch changeset is a potential release candidate.

It’s been proved that this model has improved the integration workflow, the developers performance throughput has increased so it can be considered as a total success.

You can keep reading the next post here.

Follow us