Yesterday, we had the people from PHPMad user group back in our Madrid offices to hold the 2nd monthly meetup. This time, the speaker was our backend engineer Daniel Pañeda, and he talked about his experience migrating to HipHop, the HipHop virtual machine, and the general state of the project. You can read the epic tale of the migration process at Tuenti in this post. The talk wasn’t recorded this time, but you can see the slides here and if you are interested, you can also browse our patches to HipHop in our Github account.
You can find upcoming events related to this group on their meetup page.
Blog post series: part 5
You can read the previous post here.
Normally, website development differentiates between regular code and configuration code.
Having configuration in a separated code allows you to do quick and basic changes without touching the logic. These changes are safer because they should be just an integer value change, a boolean swap, a string substitution, etc. and don’t involve a full release process.
Some good practices have been described recently on the Internet about how to write your code in a flexible way to avoid useless releases, how to do A/B testing, about how to make your database changes backwards compatible, etc. and all of these good practices involved a good configuration system.
Here at Tuenti, we are very fond of the DevOps culture and try to apply it as much as possible. We consider it the way to go and the way to do things efficiently, and it’s there is proof that it has helped us to improve quite a lot.
In our company, the configuration deployment is a clear example of DevOps culture. There is no manual intervention and there isn’t anyone who does deployments such as a release manager or operations guy. Every developer pushes his/her configuration changes to production on his/her own. Therefore, Devs are doing Ops tasks.
We use ConfigCop for that.
ConfigCop is a tool to deploy configuration to preproduction and production servers. Any developer can and must use it to, first, test their changes in a preproduction server and then, deploy them to production.
Preproduction deployments logic are fully done on the client side and the basic options a developer can use is a configuration initialization and a configuration update, the latter being the one that uploads any configuration to be tested.
ConfigCop pull the latests code from Mercurial, gathers all configuration files and stages them all applying some overrides necessary to make it work in preproduction servers and also generating some .json files readable by HipHop.
Then, just deploy them using Rsync and the developer is ready to test.
A production deployment must be sequential and cannot be done in parallel because conflicts may arise.
Therefore, for this ConfigCop uses its server side. It’s basically a server-client communication over RPC that establishes a locking mechanism to perform sequential deployments. The lock is given to the developer currently deploying a configuration and it can’t be stolen.
Until the developer has finished a configuration change, another one can’t start.
The workflow is pretty simple:
ConfigCop freed the release managers of doing these operational tasks and now developers with just 2 command lines are able to test and deploy configuration to production in a easy, reliable, and fast way.
The team in charge of the Mobile Virtual Network Operator (hereafter MVNO) at Tuenti was asked to build a new checkout for the store. For legal and business reasons, a checkout for an MVNO has to comprise steps such as shipping information, billing information, customer ID check, etc. Our main goal was to find the best checkout that minimizes users abandoning the process.
The checkout had to be extensible and configurable enough to fulfill the following requirements:
There are two main entities in the back-end to model our checkout framework: the Order and the StateMachine.
The Order is a simple, persistent container into which we place the input the user has chosen through the checkout. In the Order, we keep a record of the data plan chosen, the initial top-up, shipping information, etc. An Order is identified by its OrderId.
The StateMachine is a state machine that drives the checkout process from the beginning to end. Its responsibility is to define a flow of steps, so that the front-end can request the next step of the checkout in a generic way, which the StateMachine will provide. A StateMachine is identified by the FlowId.
When we have a step that depends on those that come before it, the StateMachine is able to resolve it through guard conditions. For example, if the user chooses neither a an initial top-up amount nor a data bundle, the machine won’t provide the payment step.
The relation between the two is that an Order references a StateMachine, so an order can only be processed by a single StateMachine.
The design described above allows us to perform the A/B testing on the order of the steps simply defining another StateMachine, which is no more than a declaration of steps and transitions.
For example, if we’re doing an A/B/C testing, we define 3 different state machines.
Any transition of the StateMachine has two attributes:
Although the checkout happy case is to go forward until the end, we had to make it possible to go back to edit steps. Having implemented a state machine, the natural move was to just define edit transitions between steps. These edit transitions go backwards in the StateMachine.
Implementing the step editing possibility as a transition not only removes code complexity and looks like the natural behaviour of the state machine, but it also makes it clear to us the from which states users can edit others (for example, if the user has already gone through the payment step, we shouldn’t allow him/her to change the initial top-up).
Being that a main goal is to make finishing the checkout as fast as possible, we didn’t want our customers to repeat the steps they already completed, even if that only means clicking and submitting the pre-filled data again. This can be done both when editing a previous step and after a full reload.
Thus, we make the user jump from the editing step when submitting it to where s/he was before editing (that is, forward in the state machine). The initial implementation was to navigate automatically through the original transitions, calculating them given the previously introduced data (automatically resolving branching, like postpay/prepay), until we got to the point we where we could not transition further because of lack of data: That’s the state of the user before, so we would stop the machine there.
This initial implementation of these auto transitions was quickly shown to be incorrect when we needed to add callbacks to the transitions, such as sending a request to our third party ID validation system, sending an email of a certain event to the user, etc. These actions took place every time the transition was traversed, which was an undesirable behaviour.
In the end, we started coding the auto transitions as if they were independent transitions, without any triggers. Like with the edit transitions, this gave us the power to choose which transitions could be automatic and which can’t.
In summary, every StateMachine is defined by its transitions, which are specified in 3 groups:
The front-end side of the project is based on our in-house framework. The typical architecture of a product based on this framework consists of an Agent that receives the http request parameters and returns a response with rendered content and data to the client-side.
We added a new layer to decouple the rendering. With it, we are able to control the kind of views we are using for this flow, that is, it allow us to perform A/B testing against the look and feel of the checkout.
With the architecture presented previously, the testing a step is as easy as creating a new renderer inheriting from the original, and overriding the method in charge of render the step we want to change.
The entry point of the checkout (usually an agent) is the one who defines the flow (StateMachine) and the renderer of the process for this user. Our A/B testing API chooses both of them, which are stored in the Order (that is recovered by session), so if the user reloads the page or revisits it after several days, they still see and continue the same process.
The whole design and implementation has been a challenge for us. Making something as sensitive as a MVNO checkout that was configurable was a very ambitious project.
We feel as though we developed more a checkout framework than the product itself, something that exceeded our original estimations. However, we are confident that it will pay off, because we had total flexibility to change any aspect of the process in an easy way, as long as we have the power to do A/B testing against any side of the checkout and, of course, build more specific flows.
Last Wednesday, we hosted the first meetup of the new PHP users group, PHPMad, formerly the Symfony Madrid group, in our nice Madrid office. In addition to a short presentation by the group, we had the chance to hear a few words from invited speaker David Buchman. He gave a talk and presented a demo on the The Symfony2 Content Management Framework, of which he is one of the core developers. The talk was recorded, so you can watch it here.
You can find upcoming events related to this group on their meetup page.
Blog post series: part 4
You can read the previous post here.
So, we have made sure that the integration branch is good enough and every changeset is a potential release candidate. Therefore, the release branch selection is trivial, just pick the latest changeset.
The release manager is in charge of this. How does s/he do it? Using Flow and Jira.
Like the pull request system we talked about in the previous post, Jira orchestrates the release workflow, and Flow takes care of the logic and operations. So:
Once the release branch is selected, it is time to start testing it because there might be a last minute bug that escaped all of the previous tests and eluded our awesome QA team’s eyes.
Flow detects new commits done in release (this commits almost never occur) and builds, compiles and updates an alpha server dedicated for the release branch.
Why do we build the code? PHP is an interpreted language that doesn’t need to be built! Yes, it’s true, but we need to build other things:
We also use HipHop, so we do have to compile PHP code.
The HipHop compilation for a huge code base like ours is a quite heavy operation. We get the full code compiled in about 5 minutes using a farm built with 6 servers. This farm is dedicated just to this purpose and the full compilation only takes about 5 - 6 minutes.
The code built and compiled is deployed to an alpha server for the release branch, and QA tests the code there. The testing is fast and not extensive. It’s basically a sanity test over the main features since Jenkins and the previous testings assure its quality. Furthermore, the error log is checked just in case anything fails silently but leaves an error trace.
This testing phase usually takes a few minutes and bugs are rarely found.
Furthermore, Jenkins also runs all of the automated tests, so we have even more assurance that no tests have been broken.
Staging is the last step step before the final production deployment. It consists of a handful of dedicated servers where the release branch code is deployed and thousands of real users transparently “test” it. We just need to keep an eye on the error log, the performance stats, and the servers monitors to see if any issue arises.
This step is quite important. New bugs are almost never found here, but the ones that are found are very hard to detect, so anything found here is is more than welcome, especially because those bugs are usually caused by a big amount of users browsing the site, a case that we can’t easily reproduce in an alpha or development environment.
We are now sure that the release branch code is correct and bugs free. We are ready to deploy the release code to hundreds of frontends servers. The same built code we used for deploying to that alpha for the release branch will be used for production.
The deployment is performed with a tool called TuentiDeployer. Every time we’ve mentioned a “deploy” within these blog posts, that deploy was done using this tool.
It is used across the entire company and for any type of deployment for any service or to any server uses it. It’s basically a smart wrapper over Rsync and WebDav that parallelizes and supports multiple and flexible configurations letting you deploy almost anything you want wherever you want.
The production deployment, of course, is also done with TuentiDeployer, and pushing code to hundreds of servers only takes 1 - 2 minutes (mostly depending on the network latency).
It performs different types of deployments:
Release done! It can be safely merged into the live branch so that every developer can get it to work with the latest code. This is when Jira and Flow takes part again. The Jira ticket triggers all the automated process with just the click of a button.
The Jira release ticket is transitioned to the “Deployed” status and a process in Flow starts. This process:
After deploying the release to production, we might detect there is something important really broken and we have no other choice but to revert the release because that feature cannot be disabled by config and the fix seems to be complex and won’t be ready in the short term.
No problem!! We store the code built compression of the last releases, so we just need to decompress it and do the production deployment.
The revert process takes only about 4 minutes and almost never takes place.
You can keep reading the next post here.