Tuenti Group Chat: Simple, yet complex

Publicado el 28/9/2012 por Diego Muñoz, Senior Engineer

We have recently released the #1 requested feature at Tuenti, group chat.
It has been a titanic effort, months of developing the server code, client side code, and systems new infrastructure to support this highly anticipated feature. But was it so big as to take so much time?

Scope

Since 2010 improvements have been made to the chat server code (Ejabberd, using Erlang as the programming language), achieving important performance gains and lowering the server resource consumption.
We had approximately 3x better performance than a vanilla Ejabberd setup, which taking into account that we currently have more than 400M daily chat messages is not bad at all.

We also had 20 chat server machines, each running on average 6 instances of Ejabberd, and behaving even too well, under their capabilities, so resharding the machines and setting up a load balancer was appealing.

Chat history was almost done, but we had to add support of group chat. It is one of the first projects we do with HBase instead of MySQL as the storage layer.

The messages delivery system (aka message receipts) was also quite advanced in its development, but not yet finished. It uses a simple flow of Sent -> Delivered -> Read states.

Multi-presence means being able to open multiple browser windows and/or multiple mobile devices and not losing the chat connection at any of them (up to a maximum). In order to achieve this the server side logic needs to handle not only jabber ids but also resources, so that the same JID can be connected from multiple sources at the same time.

The “new Tuenti”: This new version of the main website required to focus great part of the technical resources of the company. The team in charge of the Chat not only has that responsibility, so we had to dedicate engineers to build parts of the new website.
As it implied a complete new visual look, the chat had to change its appearance too.

And of course, the group chat.

  • Being able to chat with multiple people at once
  • Roles of room owner (the “administrator” of that chat group/room), members and banned members
  • Storing the rooms even if you close the window (until you explicitly close them)
  • Supporting both default group room avatars (a pretty mosaic) or custom ones (choose any of your photos or upload a new one)
  • Supporting custom room titles
  • Room mute

The Old Chat Web Client

The web chat it is a full Javascript client, only using Flash for videochat. We use a modified opensource Javascript XMPP library, JSJaC, tailored to our needs.
A rough schematic architecture of the chat client is:

  • HTML receiver file, that performs long polling connections to the chat servers to simulate a typical socket.
  • One requests controller that processes incoming XML chat messages (stanzas, iqs and the like) using the JSJaC library, and converts them to javascript objects.
  • A chat UI controller, for instructing chat windows, buddy list, etc. commands
  • Buddylist, User and other classes, all of them twice, one with UI prefix, and the other with Data prefix. We separate UI behaviours from data handling, and all components communicate with each other (think of linked widgets more than a traditional desktop chat client application).
  • User class performs two tasks: It represents a buddy list contact, but it also represents a conversation room (stores the conversation, etc.)

The code has been working perfectly and with almost no client-side maintenance since it was launched in 2009, just adding new features and visual style changes.

What went right

New cluster: It works really good. Now we not only have load balancing, but also we can perform upgrades on one leaf and keep the chat up with the other leaf’s nodes.
Each node now has 10 machines running up to 4 instances per machine, so we actually do more with less hardware.

Cleaner, up to date code: Now we have inheritance in the chat client code, allowing to avoid repeating code by having a base chat room, then one-to-one and group rooms. Data-related classes are also now better separated from UI-related ones, a lot of the code now has lots of comments, we have private and public fields (by convention, not enforced by any javascript framework).
Many events are now handled by YUI and we have dozens of javascript files that are still bundled into one when we deploy the code live, so it eases a lot the development.
Overall, now the client will support future enhancements and additions quite faster.

Fast, very fast: Server side code is even faster. More optimized, more adapted to our needs, being able to handle up to 13 times more messages at once! Custom XMPP stanzas have been built to allow fast but correct delivery.

Everything works as expected: We didn’t had to do any tradeoffs due to technical limitations. We have kept the same browsers support (including IE7) and all features work as the original requisites defined.

Two UIs co-exist happily: Both versions of www.tuenti.com, each with their distinct UI, share all inner code and are easy to extend.

What went wrong

Ran two projects in parallel: Along with housekeeping tasks, the team had other high priority projects to work on, which took resources and time out of the group chat. Bad timing made half of the client side team dedicated to building the “new Tuenti” instead of bringing the full power until last stages of development.

One step at a time: Tuenti has migrated almost all client side code to use Yahoo’s YUI library. We had to migrate the chat client, plus do a huge code refactor to add support of group chats, plus visual changes of the new website, plus new features (chat history, receipts...). This generated a lot of overhead and a first phase of code instability where we didn’t know quickly if a bug was due to the refactor, due to YUI or due to a new feature not yet finished.
Probably would have been much better to first migrate to the new framework, then refactor and then apply the visual changes and implement or finish the new features.

Single responsibility principle: A class should have only one single responsibility. By far, the biggest and hardest part of the refactor was to separate the original User class into ChatUser and ChatRoom. We couldn’t think about group chat back in 2009, but we estimated too optimistically the impact of this change when planning the group chat.

Lack of client side tests: Old chat client had no tests, so QA has to manually test everything and this generated too a lot of overhead.
We are now getting ready a client side testing environment and framework to have the new chat codebase bug-free.

CSS 3 selectors performance: With the multipresence and the new social application, all users now have many more friends online or reachable via mobile device at once. Rendering hundreds of chat friends, plus some performance-wise dangerous CSS 3 selectors hit us in the late stages of development.
We hurried to do some fixes and we are still improving performance as some browsers still suffer a bit from the amount of DOM nodes plus CSS matching rules.

Architectural description of the new Frontend Framework

Publicado el 04/6/2010 por Andrzej Tucho?ka Lead Code Architect Prem Gurbani Frontend Architect

Abstract

Web applications are becoming more feature rich and ever more ubiquitous, while trying to address the needs of millions of users. Server-side scalability and performance are a serious matter as farms are unceasingly expanding for high growth sites. User Interface designers are looking for novel approaches to stimulate user engagement which results in more advanced and feature-rich clients. Remaining competitive requires companies to constantly push and release new features along with overall UI redesigns. The evolution of the web paradigm requires architectural changes that use models that will increase the flexibility and address the scaling problems in terms of performance, development process and demanding product requirements. This paper presents a web architectural design that decouples several layers of a web application, while delegating all presentation related routines to the client. To address organizational concerns in the design, client-side layers have been highly decoupled resulting with a well defined, natural responsibilities for each of them: structure - HTML, layout - CSS, behavior - JavaScript. The server exclusively produces data that is sent to the client and then mapped to the HTML templates taking advantage of interpreting data structure (presence of items, detecting sets of data, manual operations) for constructing result view to the user. The data produced by the server is client-independent enabling reuse of the server API by several clients. Overall, the strong responsibilities of identified layers allow parallelizing the development process and reduce operational friction between teams. The server-side part of the frontend framework is designed with the novel Printer-Controller-Abstraction (PCA) that was constructed as a variation of the Presentation-Abstraction-Controller (PAC) architectural pattern. The design keeps the high flexibility of the graph of controllers and introduces additional concepts like response caching and reuse and allows easy changes of the input and output formats.

State of the Art

The current system is running rich-client JavaScript and HTML/CSS generated by the server. The Responses are generated using an in-house designed template engine working on a MVC-like (Model-View-Controller) framework. From an organizational perspective frontend engineers build the controllers and views in PHP and afterwards the framework populates the templates (mixture of PHP, HTML and Javascript) to produce the output. Since each part of the response can be generated with PHP, they are taking many shortcuts that result in tight coupling of every possible piece of the code. From a design perspective the existing front-controller is an ad-hoc control flow that routes the calls to the MVC framework. The standard request protocol is done through URL-visible GET requests complemented with data sent via POST. The existing template engine is tightly coupled with the View, and delivers final HTML to every user.

The existing system has several problems that need to be addressed on an architectural level: - minimize inter-team dependencies that force the work organization to be sequential, - avoid duplicating work while introducing additional client applications, - maximize possible optimization and caching solutions that can be implemented on several levels of the system, - reduce TCO (Total Cost of Ownership) by reducing bandwidth and CPU load of the in-house infrastructure, - maximize flexibility in terms of changing the UI (User Interface) while reducing time required to release UI changes, - implement easily adoptable communication protocol to increase opportunities for external usage of the system, - maximize the reuse of the common data used of the system (list of friends, partitioning schema) and minimize the cost of bootstrapping the system.

An analysis performed on existing web applications such as Facebook, Gmail, Flickr, MySpace or Twitter shows that none of these sites produce AJAX (Asynchronous JAvascript and XML) responses which decouple data from presentation. Usually, these responses are a pre-built stream mixing Javascript, CSS, HTML and data which is then inserted into specific containers in DOM (Document Object Model) or just evaluated in the Javascript engine. The abstract of suggested solution defines a communication protocol built on JSON-RPC (JavaScript Object Notation - Remote Procedure Call) with well-defined scopes of responsibilities for technological components of the client and a highly customizable structure of server-side controllers.

Several frameworks exist that introduce similar solutions in order to decouple HTML from data and perform rendering at the client browser. ExtJS [1] allows placing templates in separate files that can be fetched and cached by the browser. Later, AJAX is used to fetch data from the server to populate the templates. This leads to bandwidth savings as there is no redundant HTML served in every pageview and cost of producing rendered HTML is moved to the client browser. However, the drawback of Ext JS approach is that they are introducing a new language in the templates and therefore increase the cost of any UI and design related operations. Consequently there is an increased need for interaction between designers and client-side developers. Also, the HTML templates produced by designers must be converted into Ext Templates increasing the possible points of failure and the complexity of the development process, as well as making maintenance cumbersome. Many other JavaScript libraries essentially work with a similar concept. These include: Mjt [2], UIZE [3], TrimPath [4] or EmbeddedJS [5]. An implementation worth mentioning here is one done by PureJS [6], which tackles this issue by creating HTML templates that do not require usage of conditional or looping statements. The templates remain in HTML and the data is matched with injection points identified by a specific attribute in the structure. Conditional statements are triggered by simply hiding or showing an element. In order to implement loops, they are inferred automatically by detecting if the data is an array. However, PureJS does not effectively manage decoupling of data from the structure. More complex (real) usages of PureJS framework require constructing statements named "directives". These define the process of inserting the data into HTML along with other elements such as additional data structures or adding user interaction. The existing system produces server-side Views as part of the MVC framework. In a multi-client environment this leads to an added load on the server infrastructure along with increased complexity of implementing any changes to the presentation. Currently the system uses seven different interfaces, each with a separate set of views, controllers and dynamic templates. The new solution will reduce that to much simpler and static templates. An additional limitation of the current MVC system is that it is capable of producing only one view per URL with a request; in contrast, a request with a chained JSON-RPC call can perform several operations and return data that can be used for display, caching, or configuration of the client. The new approach also opens several optimization opportunities in terms of reusing bootstrapped instance and in-memory cache of the system and number of client connections.

Overall front-end architecture and strategy

The front-end framework project is part of the overall architectural redesign of Tuenti.com. The front-end layer is responsible for rendering views, communication with the server and UI interaction. The second part of the project includes back-end redesign which is outside of the scope of this document. Main concerns identified by the stakeholders for the system and that are addressed by the front-end design include: flexibility, cost and schedule, and integrability.

The frontend framework's principles are to produce a highly decoupled system by introducing a natural separation of concerns in the source code: structure (HTML), layout styles (CSS), behavior and control (Javascript), and data (JSON-RPC). All except for the data will be cached on the client and in content caching solutions reducing the load of the in-house infrastructure. Furthermore, due to the nature of some requests that return mostly unchanged data, it is possible to cache and reuse them. This opportunity is supported by dramatically (45% to 91% according to the tests) reducing the size of the response produced by the server. The decoupling mentioned above also supports the organizational aspects of development projects based on the framework. Work can easily be parallelized between teams participating in the project. The only required interaction between them takes place in the analysis stage, when the interface and the model are defined to satisfy product requirements. Projects that involve redesign of the user interface can remove (or minimize) the need for developer involvement due to lack of any transformations to the templates that are now based on pure HTML. JSON was picked as a data transport format because it is technology independent (easily parsed) but also native for the presentation layer of the main client, which is written in Javascript. It can be easily (in accordance to PCA design of the server-side) replaced by XML or other formats on demand. The communication protocol itself has been designed to address the need for semantic identification of the data, human readability, and server-side optimizations. The performance concern is being addressed by the possibility to chain multiple calls in one request; this technique not only reduces the need to bootstrap the system but also can drastically reduce the number of connections between the client and the server (which is especially significant for mobile applications).

A more detailed view of the execution flow:

Client-side framework

The approach followed in the framework makes use of a similar concept followed by Pure JS for templates. However, all behaviour and logic is provided by client-side JavaScript only and the mapping between the data and template structure is performed automatically basing on the semantical information contained in both. Since the mentioned mapping information is contained within standard id and class attributes of the HTML tags, they can be naturally reused by Javascript and CSS without introducing any new meta-language. Effectively, the framework resides on the user browser and is responsible for interacting with servers to fetch static and dynamic data, template rendering, and user interaction. Specifically, the server interaction involves sending requests for downloading all statics from the content caches by pooling and parallelizing requests for optimal bandwidth usage and total time of handling the request. Upon receiving the response from the servers the framework executes the hooks for client data transformations (e.g. date transformation) and renders the page. The Main Controller is the core of the client-side framework, as it manages the communication with the servers and handles several hooks allowing code execution within the process of handling user action. The data is retrieved by sending JSON-RPC requests to the server which provide the response to called procedures, but also can contain additional information. This will usually contain data corresponding to the actions that took place within the functional scope of interest of the user (e.g. new message has arrived) but also can contain data used for configuration of the client such as re-routing the client to a different farm, throttling automatic chat status updates, etc.

Example JSON-RPC call:

{
    "Friends": ["getOnline", {"maxCount": "10", "order": "recentVisit"}],
    "Messages": ["getThread", {"id": "12670096710000000066", "msgCount": "5"}],
}

The server will provide a response in JSON which has a flat structure, so it does not infer or suggest any hierarchy on the final display, and is purely data-centric.

{
    "output" : [
        {"friends" : [
            {
                "userId" : 23432,
                "avatar" : {
                    "offsetX": 31,
                    "offsetY": 0,
                    "url": "v1m222apwer124SaaE.jpg",
                },

                "friendAlias" : "Nick"
            },
            {
                "userId" : 63832,
                "avatar" : {
                    "offsetX": 32,
                    "offsetY": 50,
                    "url": "MLuv22apwer114SaaE.jpg",
                },
                "friendAlias" : "John"
            },
        ]},
        {
            "threadId": "12670096710000000066",
            "canReply": true,
            "totalMessageNumber": 3,
            "messages": [
                {
                    "isUnread": true,
                    "senderId": 32,
                    "senderFullName": "Daniel Martín",
                    "body": "But I must explain to you...",
                    "validBody": true
                },
                {
                    "isUnread": false,

                    "senderId": 66,
                    "senderFullName": "Carlos De Miguel Izquierdo",
                    "body": "Sed ut perspiciatis unde omnis...",
                    "validBody": true
                }
            ]
        }
    ]
}

The page structure is defined in pure HTML. This means that the templates themselves do not require introducing any new meta-language, relying instead on the framework to show/repeat pieces depending on the interpretation of the response data. It is possible though to extend execution of the user action by arbitrary routines that can add additional logic. Still, that doesn't influence the way that the templates are being created and all of the routines for handling templates are implemented in Javascript.

Here is an example of a piece of HTML code which serves as a template to display user avatars:

<span>
 <img src="" alt="" />
</span>

With this code sample the Template Engine is capable of implicitly performing a very flexible conditional statement where it is possible to show the element basing on the presence of the avatar in the data, repeat it if the avatar is an array, and inject data into the DOM element basing on the information contained in the params attribute. If no matching with the data can be found the DOM element will be left unprocessed. The data centric approach of the framework means that the Template Engine will identify elements in the structure by iterating through the data, and matching them in the DOM structure that is being dynamically scoped. When iterating into nested structures, the Template Engine will now search only within the corresponding context in the DOM. If certain DOM elements are not identified by the Template Engine they are left unprocessed and their default appearance will apply. In order to match elements to data the DOM elements only need to refer to the data through values set in two distinct element attributes, class attribute for data to be injected in the page structure and params attribute for data to be made available to UI interaction scripts. Specific actions for user interaction can be added to the DOM elements by specifying the action. All actions will be implemented in an external static JavaScript file and, as a good practice and internal coding convention, it will not be allowed to place any other JavaScript code inside the HTML Templates. This is a natural decoupling of behavior from structure which is similar to decoupling page structure from style by not setting inline CSS using the style attribute.

Server-side framework

The main architectural element of the server-side part of the front-end framework is inspired by an architectural pattern known as Presentation-Abstraction-Controller (PAC) [7][8]. The design (named Printer-Controller-Abstraction) bases on identifying data-centric controllers and allows free interactions between them. The graph structure of controllers receives data from the abstraction layer where abstractions are instantiated by controllers. The abstraction layer communicates with the domain layer that manages and identifies domain entities. All of the controllers that participate in the request processing populate a central response buffer that later is printed (currently only JSON printer) and output from the system. The Abstraction layer plays a relatively small role in the PCA structure and only interacts with the Domain layer. But its presence is very important from the perspective of the back-end framework.

General model: None of the PCA agents produce the presentation as such. Instead the response is sent to a Response Buffer where each agent caches the data it produces. The Response Printer is a lazy component which just produces a representation of the data when the full request has been performed (when Printer asks for it). The Response component allows greater control over re-usability of responses across the hierarchy in a request, allowing reuse of the agent response throughout the lifetime of a request. At the end of processing, the Printer component can accept any printing strategy to produce output in desired format. The complexity of each agent is dramatically reduced as the framework does not produce any views. The Abstraction layer of an agent will access the Domain layer, which is part of the backend framework, to fetch the requested data. The Controller contains all the actions an agent can perform and it may instantiate multiple Abstraction objects to fetch data to build its output. The structure of controllers provides them with a lot of flexibility. A Controller can delegate tasks to other agents or fetch their responses and then perform the requested action. Additionally, the controllers are capable of accessing responses of other agents through the response buffer.

Conclusion and Future Work

The new design improves the project organization performance by reducing inter-team dependencies and abbreviating communication to its initial analysis stage. Apart from that it reduces amount of work that is required to prepare the front-end code for the rich UI clients. Teams are able to focus more on their core activities and technologies reducing friction and optimizing the communication paths. Upcoming visual redesign projects can be done primarly by a design team rather involving significant work from client-side developers to support iterative changes to views, templates and controllers. Simple and pure HTML templates improve the throughput of designers who may now work in a WYSIWYG way, able to use their tools directly on the templates.

Key benefits of this architecture are: - minimize overhead of maintaining multiple interfaces, - shift team focus to match their responsibilities, - removing the need of server-side changes when changing page structure, layout, design or UI scripts, - parallelize development project efforts, - minimize bandwidth consumption (savings of 45%-92% depending on the page type), - minimize server-side CPU use (savings of 65%-73% depending on the page type), - improve developer's performance by providing tools with clearly defined responsibilities and scope.

A working prototype has been built, proving the above concepts are viable and functional. A subset of an existing feature, the Private Messages module, of the Tuenti.com application was used to test the above framework. This initial proof of concept (PoC) of the framework has preliminary been evaluated using Firefox 3.5. and Chromium. This successful PoC shows that highly complex and feature rich applications can be developed with the proposed framework. Server-side code complexity is greatly reduced through the usage of the PCA framework. Preliminary results show that response time is almost a third when compared to the existing MVC framework. Templates can now be visualized directly in the browser, raw template sizes are at least 30% smaller, and there are no conditional or iterative flows. Cost of producing rendered HTML is now moved to the client browser's which might now become a challenge that has to be faced before rolling out the system into live environment. However, the overall response time observed by the user is still lower than in the current implementation and will be subject to further optimizations.

References

[1] Ext JS. Palo Alto CA (USA), 2009 [Online]. [2] Mjt, "Template-Driven Web Applications in JavaScript" [Online]. [3] UIZE, "JavaScript Framework", 2010 [Online]. [4] TrimPath, 2008 [Online]. [5] Jupiter Consulting, "EmbeddedJS, An Open Source JavaScript Template Library", Libertyville IL (USA), 2010 [Online]. [6] BeeBole, "PureJS, Templating Tool to generate HTML from JSON data", 2010 [Online]. [7] Coutaz, Joëlle, "PAC: an Implementation Model for Dialog Design", 1987. H-J. Bullinger, B. Shackel (ed.). Proceedings of the Interact'87 conference, September 1-4, 1987, Stuttgart, Germany. North-Holland. pp. 431–436 [8] J. Cai, R. Kapila and G. Pal, "HMVC: The HMVC: The layered pattern for developing strong client tiers", 2000, JavaWorld [Online].

Scalability Talk at International Week of Technological Innovation

Publicado el 19/4/2010 por Erik Schultink Chief Technical Officer

On Wednesday (21.04.2010), I'm giving a talk about scalability at Tuenti at "International Week of Technological Innovation", hosted by Universidad Europea de Madrid. In prepping that talk over the weekend, I put together some very interesting data about the work our team has done over the last 6 months here at Tuenti. This data shows the hard-won gains from months of the approach of partition, archive, optimize - then profile/monitor and repeat. I'm pretty proud of that work and want to highlight some of it.

As I'll speak about in my talk, I define scaling as maintaining acceptable performance under increasing amounts of load. I think of the performance of the system as graph like the one shown below:

The x-axis is request rate (e.g requests/sec); the y-axis is response time (ms). We care about the total throughput of the system - the total number of requests that can be served in a unit of time - while ensuring that every response is generated faster than some upper-bound on response time (dashed red line). You can think of this value as the "capacity" of the system as the point beyond which response time is above this threshold (ie performance is unacceptable) - the intersection of the red and blue lines. Scaling is moving this point farther and farther to the right, through actions such as optimizing, re-architecting, and adding infrastructure. All of these actions shift and re-shape the performance curve of the system - hopefully for the better.

What does that look like in practice? At Tuenti, we profile a portion of requests to our system and from that data, I produced the following performance curves:

Each curve in that graph is from a sample dataset, each of which was taken about 2 months apart. As in the theoretical graph, the x-axis is request-rate and the y-axis is average response time for those requests. For full disclosure, I scaled the data and excluded some outliers to get curves that overlaid nicely on each other - but these curves remain quite representative of the performance profiles of our system and, despite some extrapolation, don't mask any bottlenecks lurking within the range of the x-axis.

These curves tell a very interesting story. In October, our system was clearly much inferior to what it is today. Although that dataset is quite noisy, it is clear that performance degraded rapidly at a much lower range of request rates than in later months. I don't recall a particular bottleneck we were facing at that time - but likely it's explained by bumping into CPU and DB contention.

Two months later, in December, we had flattened this curve substantially. Although one could complain that some outliers at the left extreme are forcing a very generously fitting trendline, it's pretty clear that we had better performance in December at high request-rates than in October. Note that, interestingly enough, response time at lower load levels is actually worse in December than in October - we had traded about 10 ms in best-case performance for increased scalability, but that's a trade I'll take any day. Overall throughput of the system is more important that response time of a any request.

After another two months of work, in February, we had reclaimed that 10 ms while further flattening the curve. The dataset also looks much more stable, with less noise. In April, hard work brought response times down another 10 ms while maintaining a very healthy looking curve and stable dataset.

Overall, I think this graph gives a fantastic representation of 6-months of work scaling a Web 2.0 system - maintaining and improving performance, in the face of significant growth and new feature launches. Those response time figures are total - including CPU time rendering the page, as well as cache access and DB queries. Such work involves a lot of different teams: our backend scalability team of course, but also our backend framework and systems teams. And whatever optimizations those teams make, we still count on our product development teams to write new features in ways that don't abuse our frameworks, DBs, or CPUs.

Interested in pushing this curve farther? Check out jobs.tuenti.com.

Siguenos