Edit: It was pointed out to me on reddit that there is actually proof that CCP is using Windows HPC. That made a few changes to this blogpost necessary.
Edit 2: In the meantime I have been told by CCP Veritas that, despite what the video says, they have decided against Windows HPC. That means of course that my statements about that being a problem for performance do not apply. It's great to get new information and to learn from mistakes.
I recommend you to also read the exchange on reddit where CCP Veritas offers a few interesting insights.
In the last two weeks EVE Online saw the development of a major conflict which is now commonly called
the Halloween War. As usual with such events, there's lots of reporting, grandstanding and chestbeating so all involved parties can feel a bit better about losing close to a trillion ISK and staying up all night, playing a game in slow motion.
Most recently, this collective exercise in gaming masochism resulted in the inevitable node crash massive fleet fights tend to create whenever they do not happen on something that CCP calls a "Reinforced Node".
The consequence of the node crash is a lot of fingerpointing at CCP and all kinds of theories and myths arise about the Tranquility server's ability to cope with the growth of EVE.
On the first glance, the Tranquility server system looks pretty impressive. Both CCP and the gaming press also like to further bedazzle the audience with spectacular terms like "Military Grade Hardware" and staggering numbers like "2500GHz Processing Power".
Certainly, Tranquility is a high-performance system that can do a lot. So why does it not manage to sustain large fleet battles or Burn Jita scale events?
Personally I have some professional experience with High Performance Computing Systems, otherwise known under the more catchy term "Supercomputing Clusters". I have been a user and maintainer of such systems when I was with the military, and more recently I have actually built and installed systems like that as a job.
Under scrutiny, and when comparing it with other HPC systems, Tranquility becomes quite a bit less impressive, and therein lies a possible explanation why CCP is suffering from node crashes when there are massive conflicts.
Where to begin ...?
Processing Power
First of all, processing power is not measured in clock speed alone. The generally accepted baseline unit of measuring processing power is called FLOPS which is a combination of clock speed and processing cores available.
Not only is clock speed alone not a sufficient way of measuring processing power, it also doesn't quite add up in the way that you can say you have 2000 GHz of total speed just because you have 500 CPUs running at 4GHz each. For that to be remotely applicable as a statement, you would have to have all those CPUs doing exactly the same thing parallel in a multithreaded way, and here we hit another stumbling block in CCP's setup.
Bits and Threads
Multithreading - simply put - means that you can split the same operation among many processor cores at the same time so that they share their workload. CCP personnel rarely make statements about the specifics, but on the most recent Episode of "Shit on Kugu" (a fitting name for that terrible podcast) CCP Dolan re-iterated that "EVE can run on a single core" and that CCP are working on making EVE 64 bit so they can actually use multithreading. (It's at about 45min into the podcast)
There is a hidden piece of information there: There is one commonly used server system that is incapable of running 32 bit applications parallel multithreaded while making full use of a multicore architecture: Microsoft Windows. As a matter of fact, the Windows Server OS is limited by the codebase in the number of cores it can utilize.
I have once worked with a Windows Server HPC system (TBH "Windows HPC" sounds like an oxymoron to me to begin with) and I can tell you it was the worst performing HPC system I have ever seen. I wouldn't expect to run CFC vs. Everyone Else on such a system without lots of sleepless nights because of crashes.
Nodes and Resources
In standard HPC terminology, a node is a single hardware platform with it's standard setup (CPU, Memory, coprocessors etc. all on one motherboard). Generally, you would have one or more so called master nodes which schedule processing tasks for all the compute nodes which do the actual number crunching. This is done by means of a so called job scheduler and the applications are built to support running in parallel. The job scheduler looks everywhere on the cluster - according to parameters which you can set - where it can find free resources to process the given workload.
So let's say you have 500 Ishtars all launch Sentry Drones and fire at a target. That is a set of mathematical operations which have to be done by the cluster. Now the scheduler gets this submitted as a job and then it would look for cores and memory and clock cycles which are sufficient and optimal to finish those calculations. That is if your system and your software support multithreaded parallel computation on multiple cores.
CCP uses a bit of a different terminology. They call an instance of the game a node, and that instance can run anything from a dozen to a single star system with everything in it. That sounds a bit awkward as a model for resource allocation.
CCP Dolan mentions how great it would be if they could dynamically allocate resources without having to shut down a node. As it is now, they actually have to manually assign a solar system to something which they call a "Reinforced Node". What CCP means by that is, that they assign the computation of all tasks that happen for one game instance in one solar system onto a very powerful machine. I'd guess they do that during downtime because it seems from Dolan's statements that they actually have to take down that game instance and restart it on different hardware.
All that would not be necessary if they had a more intelligent HPC system.
Now, don't get me wrong. I have great respect for game developers. If you look at the amazing graphics and concurrent operations they can squeeze out of a model like that, one can only be impressed. In most cases, a modern computer game has to be able to run on one PC or on one game console, and I can tell you, even your most pimped gaming rig will not outperform even a low-grade HPC compute node.
CCP are actually doing quite well considering what they have to make do with, but if I were one of the admins of that system, I'd wish for a special hell for the people who decided to build things that way in the first place.
So where does that leave us?
Maybe they do manage to make it happen that game instances can be transferred dynamically from one machine to another on the fly, but they would be much better off if all of EVE ran on all available machines as separate jobs that make efficient use of the resources.
So, until CCP actually builds another cluster and transfers all of EVE onto that new system, the big powers of New Eden will be stuck with TiDi and having to request a "Reinforced Node" if they know they want to have a massive fleet-fight in a particular system.
Or they could come up with a doctrine that doesn't require them to drop thousands of ships in one place to win, but for now that seems to be the path to victory, even if it is a slow grind of many hours.
i am picking up what you are laying down. that makes a lot of sense. so why on earth are they using windows?
ReplyDeleteby their standard ive got about 120ghz of processing power around my house, maybe more. itll be interesting what happens when star citizen launches. they are using 64bit linux at least
Why is anybody using windows except for office/home PCs? Personally I never found an answer for that question that didn't end up being some sort of smug sarcasm :)
DeleteWindows allows businesses to buy a platform that makes employees interchangeable. Hate your linux-based authentication service? good luck replacing the guy who built it. With windows the flexibility is ridged enough that an experienced admin can come into almost any environment and figure out what's going on, and fix it given some time.
DeleteThey are using Windows, because of the support, Microsoft work closely with the server guys, as well as IBM, Nvida, and a few others. Linux would probably work better agreed, but, there is no real support, certainly not like the support Microsoft would/could provide.
DeleteDon't speak of things you don't know. Redhat (for instance) has a strong support service, and many hardware companies (like IBM, Dell, ...) provide drivers and support too.
DeleteMicrosoft works closely with IBM, yes, but look how much invested in Linux before opening your big mouth.
MMOs operate on different time-scales than most other games due to the fact that a successfull MMO will run for several years or more. As a result, developers have to base some parts of their core technology on some assumptions about how hardware will evolve over the next several years. Based on the fact that EVE is more than 10 years old at this point, I think it can be forgiven that their data center isnt built on the most efficient and modern architecture.
ReplyDeleteAnother example:
Other games have had similar issues. EQ2 struggled with some graphical performance issues because parts of its graphics engine were designed under the assumption that future processors would run at significantly higher clock speeds than those available at release (over the years leading up to EQ2 release, CPU speeds had ramped up from just a couple hundred mhz to being counted in ghz). Unfortunately, that didnt happen, and CPUs evolved along the lines of multiple cores, rather than increased single-core speeds, and the room that was built into the graphics engine to utilize these theoretical advances in clock speed never really got utilized and it took years before SOE introduced multi-core support into their game client to accomodate.
Certainly there are lots of considerations that go into the development of an MMO with a perspective for the future. I think one of the main problems is, that there are not very many people who even know how to install and run HPC systems even today, and back in 2003 there would have been even less.
DeleteIt is very much a legacy issue, and while CCP have constantly been upgrading hardware and optimizing software (there were a few devblogs and presentations talking about that) they are eventually going to reach a dead end. If you have a true HPC system, you can always plug in extra compute nodes and expand, that makes it a lot easier to meet growing demands.
I don't think the problem is to plug in extra compute power but more the licensing cost in general for whole cluster.
DeleteJust to pick one point - as someone who works in IT architecture for the US Military the "military grade hardware" is mostly off-the-shelf blades. The hardened stuff is NOT what you want to use in any commercial application - unless you think your staff is going to do things like drop servers from one meter or be really careless with the thermal environment.
ReplyDeletethe motherboard in my desktop is "Military Grade" or so it says. i always just scoff at people who use buzz words, ironically military experience showed me that buzz words are often used by people with no idea what that actually mean.
DeleteExactly my experience with "military grade hardware". The actual super high-grade stuff is usually way too sensitive to be deployed in the field or even anywhere near it.
DeleteThe 'Military Grade' is referring to the RAM-SAN they bought for the SQL server back before RAM-SANs were used by anyone but the military. http://en.wikipedia.org/wiki/Texas_Memory_Systems
DeleteMilitary grade hardware = need high government budget to afford it.
Delete@Dax
DeleteOriginally I had a whole two paragraphs on that subject written, but I left it out because I felt the post would become too long without adding much information.
The thing is, the RamSAN was indeed a product mostly sold to military and government at that time. It is - however - not the case that "the technology only existed in the military" back then. Comparable SSD based storage systems were available, but they were terribly expensive when compared to TMS' product.
Another aspect is that a good chunk of the server code is in Stackless Python, which (as far as I could find) still suffers from the Global Interpreter Lock - meaning that no matter how many cores your server has, the Python process can effectively use only one of them, preventing true parallelism even inside a single game instance.
ReplyDelete(This comment comes with the customary AFAIK disclaimer)
Bah. lost my post :(
ReplyDeleteShort version:
Microsoft based, because when they were starting out, Microsoft helped them a lot. Changing it now would be difficult.
Nodes are VMs which handle a single solar system. The main code for a system (excluding chat, market and so on) is a single thread. Multithreading it would be a bit of a nightmare, with race conditions out the wazoo.
Next major things to help:
Brain in a Box, to cut out the lag spikes from session changes. (Major source of load in Jita)
Moving client updates to a secondary thread, rather than inline.
Don't worry, the nodes do multithreading. At least, to answer to each connected client in parallel. Currently the philosophy of "one threaded event driven server" is not widely adopted.
Delete