Sunday, February 17, 2013

Optimization Magic

(a slightly circular reference: link to forum discussion)

One of the hardest things for a small team like ours to fit in to the product cycle on a regular basis is optimization.

Optimization is a little bit like making sure we are eating our vegetables every week. Every time we add or change a feature in the game, in theory we should also refactor everything around the code that just changed. Since that's usually not practical, large scale optimization really tends to happen in chunks when the team is finally given some time to focus on the task.

Now that we are about to release newly optimized code, I wanted to take some time to lift the hood on the work being done, and to share some insight into issues we have to contend with - from the software engine to server hardware.

We are really proud that APB Reloaded has remained a consistent top-5 (out of 100+) in Steam's Free2Play category since its December 2011 launch (as an aside; the game actually gets the vast majority of its traffic directly through rather than through Steam, but Steam provides a convenient benchmark to compare against other games), so we are planning for this game to provide many more years of entertainment for all its fans. But this tension between short term and long term goals (survive today, but plan for a 5+ year lifespan) means that every day we struggle with what we should focus on next; features, maps (Asylum anyone?), game content, security or optimization? All of these are pressing needs for one reason or another.

Given that turning off the game and entering an optimization-only cycle is not possible, we instead attempt  the next best thing; we optimize while running full speed. It's a little bit like changing the oil in your car while travelling at 65 MPH down the highway. What could possibly go wrong?

APB - a server resource hog?

Few Unreal engine based games have attempted to throw 100 fully customized players (how many FPS games ever give you 50 v 50 in a single map?) in a twitch FPS/TPS setting on a single area, where each player can have a fully customized character, car and skinned weapon, which means at least 200 fully custom player items (cars and characters), up to 850 autonomous NPC pedestrians and 350 autonomous NPC cars driving around in a district, in addition to everything that's movable and destructible (traffic lights, dumpsters, billboards etc.).

This means roughly 18,000 dynamic actors for the server to track while running in a single shard, and in fact on a single server core (and read more on the 'single-core' issue down in the hardware section).

Granted newer games using other engines like Planetside 2/Forelight have used a very different 'continent', 'distance' and 'mission' optimization system to allow much larger factions on a continent (though not necessarily in the same firefight), and even Fallen Earth uses a system of dynamic shards to allow 10,000 players in an area. But neither is an Unreal game.

Technically speaking customization has some impact on the server side performance (mostly due to large amounts of asset streaming), but customization has a larger negative impact on the game client, and tends to drive client-side frame rate lower than the expected frame rate from someone's gaming rig if customizations had not been such a central part of the game.

While there are clearly other FPS/TPS' that perform amazing graphical feats on older hardware (CoD-MW series, Crysis series, Gears of War series, Far Cry etc.), they rarely allow this many complex humans and AI actors in a single battle area, or when they do, the participants are streamlined and unified and do not permit nearly APB's level of insane customization (or city-wide destruction). Or they behave more like RPG or RTS games and generally have much lower requirements for hit registration, server tick rate and movement prediction. The amazing feat in APB is that this game still actually works on a lot of pre-2009-era hardware given the extreme computational complexity of the game.

Server FPS vs Client FPS

As a general rule, we want the server to perform a full pass of computations for all the 100 players and 18,000+ district actors 30 times per second (giving each CPU core at most 33ms to complete all computations in that one frame). If we achieve 30 FPS on the server, then connected game clients can easily run at 2X-3X the server tick rate (60FPS - 90FPS) without any noticeable loss in accuracy. At 1:2 or 1:3 server-to-client ratio movement prediction and frame-interpolation provide a very smooth game experience.

Unfortunately during the last few updates we have had to temporarily lower the server tick rate to 25 FPS and reduce the max CCU per core, so it's high time to perform another full optimization pass.

Software Optimizations and Server Side Computation times.

Below is a graph of what version 1.10.1 server-side computations look like under ideal test circumstances AND using our new test hardware (more details on this new hardware at the bottom of this post).

In the current 1.10.1 build the server completes 1 full frame (moving those thousands of actors around) on 1 core in 1 full district using the new hardware type in 19.2ms. In 'theory' this means the server on the new hardware is capable of running at 52FPS tick rate (!).

This is to be compared with the 'current gen' hardware, where we have only been able to run a 'safe' server tick rate of 25FPS in the current 1.10.1 build.

The lower part of the graph shows version 1.10.2 with the new software optimizations.

From the synthetic test it appears the team has been able to squeeze a 16% performance improvement in software alone (which amounts to about a 10 FPS improvement on the server). This improvement drops the per-frame processing time to 16.1ms, which means a theoretical 62FPS server tick rate (again on the new hardware).

This 'should' mean that software optimization alone (the 16% improvement) will let us go from 25FPS back to the original 30FPS serverside tickrate on the current hardware as part of the 1.10.2 update (to be determined after the game is live).

You can read these graphs from the bottom up, starting with receiving network packets from all connected actors, updating game elements and physics, updating cameras and streaming, and ending with sending data back to all clients. What's rather surprising is that almost 50% of the entire server processing time consist entirely of receiving/parsing and serializing/sending network traffic. The actual game updates (players, objects, physics etc.) take only 50% of available CPU time.

From the above graphics you can see that the team has been able to really squeeze and optimize the "Receive Network Traffic" and the "Update Game Objects" steps. We expect to continue optimizing all the steps in the system, but presuming QA signs off on the upcoming patch, we will measure the real-world impact of these improvements in the coming week.

The Single-Core Engine Conundrum

First a disclaimer. The Unreal Engine has served us (and thousands of other games and companies) incredibly well. It's a great engine and a fantastic rendering system. Now the engine has certain design choices that create certain hard-to-overcome limits (as all engines do).

The biggest one for large scale games is Unreal's monolithic and (almost) single-threaded server-client-response system. The philosophy behind Epic making that design choice back in the era of Unreal Tournament / Gears of War makes perfect sense, given the engine's focus on small-scale lobby based FPS/TPS games or even single player or co-op games. Some Unreal based RPG's (for example Blade and Soul) have clearly adopted the engine as a renderer, and then created an entirely proprietary server system to handle RPG style updates and connection loads (which usually requires 2000-3000 players per shard but only a server tick-rate of 10FPS or less in RPG mode).

APB Reloaded uses a hybrid of standard Unreal server code (originally we used Unreal version 2008, so the engine is getting a little aged at this point) and its own proprietary TCP message stack coordinating the communications between worlds and districts, as well as a very proprietary customization system. But the general actor-to-actor interaction relies on a system that's very close to the original Unreal system. Mostly handled in a single game update loop.

This means all the processing in a single district happens on a single core and in a single thread.

One way to think of this is that the engine fundamentally works like a turn-based game where each actor has  33ms to move per turn. Within the scope of a single server core/thread, the process gives each actor one chance to make a move (or combination of moves). When all actors have signaled their move, everyone is told of everyone else's  updated moves, and the game now proceeds to the next move (though from the chart above you can see that we actually only spend about 7.5ms moving stuff around, the rest of the time is spent sharing that information).

Human reaction time (or as it's called Mental Chronometry) is around 160ms, so processing everything at 33ms on the server, plus the packet roundtrip time (ideally less than 40-80ms) for a total of about sub 120ms of processing delay, should give us sufficient headroom to provide a good player experience.

However, even just a slight improvement in server side processing will actually enhance the fluidity of the game. We humans are very good at processing sequential frames of information and can easily spot the visual difference between film at 24fps and video at 30fps (or as the case may be "the Hobbit" at 48fps for those of you who now hate Peter Jackson). This means we will notice visual processing hiccups long before we react to new on-screen events.

Why does all this single-threaded-ness matter to us? Well, it turns out that most of the performance gains in recent years in server processors from Intel and AMD have NOT come from performing more computations on a single core, but rather to have many parallel cores performing parallel tasks.

Sadly for APB Reloaded, that type of parallel task division does not improve individual district performance... But... there is hope...

New OTW Hardware Test World going live: OverKill

In the near future we are about to release a new OTW (Open Test World) called OverKill. OverKill is actually an apt name and is the result of a lot of hardware experimentation by our IT team (and the above computation tests were run in this hardware as well).

The current generation APB servers consist of Intel Xeon X5570 "Nehalem" based processors (operating in 3.2Ghz Turbo Mode) with 4 cores x 2 processors each. We use Dell M610 blades like these (just recently pulled from our datacenter).

The benefits of blade servers are that we can increase the density of the hosting operation, since we can fit 16 servers in 10 "rack units." The drawbacks - the types of processors supported by blade servers and the inability to overclock those processors - have caused us some serious problems in optimizing the hardware for the game.

For quite some time we have been looking for a new processor solution specifically to handle Financial and Waterfront (and eventually Asylum) districts. Something that can live in our three datacenters, but at the same time give us a cost effective solution to run at much higher single-core clock speeds, while also taking advantage of the newer "Sandy Bridge" and "Ivy Bridge" Intel processor architectures.

After much playing around with various combinations of server chips, it turns out that  server boards and server chips really don't like or even permit overclocking and they are almost never engineered to optimize single-threaded performance (other than the incidental improvements that come from larger L2 and L3 cache systems), and we also need at least 6 cores to be able to perform these calculations in a cost effective manner (which let's us run 3 fully loaded districts on a single server) which left us with a conundrum.

After experimentation we have settled on having a public test using a custom solution that uses a high end desktop board (ASUS Rampage 4 Extreme) combined with an unlocked Intel i7-3930K 6-core processor that in a datacenter settings (with lots of cold air) easily runs stable at 4.25Ghz (technically we can push it to 5Ghz, but we are starting small).

Will it work once we throw real APB district computations into these systems? The synthetic test indicates it will indeed work. Will I/O performance hold up (given the strangle-hold that network I/O has on server CPU)? That's much harder to test, so we will find out as soon as we start running the  OTW tests.

In a synthetic benchmark the i7-3930K OC (compared to the stock X5570) shows raw gains of nearly 70% in single-threaded performance (!). We do lose two cores per server, but the extra expense (more servers) seem worth the vast performance gain.

CPU Bench – Single Threaded:
[ORIGINAL X5570] – 1349
[EXPERIMENTAL 3930K] – 2284

If we can capture some of these performance gains in the real world, and translate it into improved Action District performance, then our longterm goal is not only to ensure a stable 30 fps server tick-rate, but gradually be able to raise the CCU in each district as well.

From the graphs on software optimization, you can see that the new hardware with the new software 'COULD' run a theoretical server-side tickrate of 62 FPS, which is 206% more than we actually require for the 30 FPS tick target rate.

Our plan is to use the extra performance (again once we have run the real world tests) to ensure we can increase CCU in a single district. Since CCU taxes the server in a non-linear fashion, we expect to only increase CCU 25%-50% before dragging the server back down to 30 FPS tick. Of course this is still speculation, and is still to be determined during live testing.

Higher district CCU would mean better matchmaking (but THAT is a whole other blog entry, though needless to say 80 people in a district means 20 teams with potentially 10 ongoing matchups whereas 120 people in a district means 30 teams with 15 ongoing matchups, resulting in 50% improvement in match availability. Of course it's not quite that simple - but you get the gist). More players = better matchmaking.

OUR hardware, software and network vs. YOUR hardware, software and network.

In this post we have only talked about server side processing and optimization, and have not touched the OTHER things that also affect performance  First and foremost - you need a good gaming rig to play APB. We always recommend having 8GB of RAM and using 64-bit Windows 7. Anything less is asking for trouble. In particular using 64-bit Windows is critical. Also client-side FPS in most Unreal games tends to drop dramatically during very large semi-transparent VFX events (i.e. very big explosions where the player does NOT die - something APB of course has a lot of) so only higher end graphics cards tend to perform ok during those big VFX events (and to optimize that part of the engine code is a whole other ball of wax, far beyond the scope of this current post).

Of course network connectivity, and your latency to our core datacenters are critical as well (Los Angeles, Washington DC and Frankfurt) or to the datacenters managed by our Russian (Moscow) and our Brazilian (São Paulo) publishing partners.

I hope this article has shed some light on the optimization work currently being done. If you are one of our OTW testers, then expect to see the "OverKill" world come online in the next two weeks. And for everyone else we expect to release 1.10.2 very soon, which should have some immediate performance improvements.

Til Next Time!

No comments:

Post a Comment

Please keep comments on-topic and reasonably civilized. The moderators reserve the right to remove any off topic, uncivilized or troll-ish comments.