Edit: It was pointed out to me on reddit that there is actually proof that CCP is using Windows HPC. That made a few changes to this blogpost necessary.
Edit 2: In the meantime I have been told by CCP Veritas that, despite what the video says, they have decided against Windows HPC. That means of course that my statements about that being a problem for performance do not apply. It's great to get new information and to learn from mistakes.
I recommend you to also read the exchange on reddit where CCP Veritas offers a few interesting insights.
In the last two weeks EVE Online saw the development of a major conflict which is now commonly called
the Halloween War. As usual with such events, there's lots of reporting, grandstanding and chestbeating so all involved parties can feel a bit better about losing close to a trillion ISK and staying up all night, playing a game in slow motion.
Most recently, this collective exercise in gaming masochism resulted in the inevitable node crash massive fleet fights tend to create whenever they do not happen on something that CCP calls a "Reinforced Node".
The consequence of the node crash is a lot of fingerpointing at CCP and all kinds of theories and myths arise about the Tranquility server's ability to cope with the growth of EVE.
On the first glance, the Tranquility server system looks pretty impressive. Both CCP and the gaming press also like to further bedazzle the audience with spectacular terms like "Military Grade Hardware" and staggering numbers like "2500GHz Processing Power".
Certainly, Tranquility is a high-performance system that can do a lot. So why does it not manage to sustain large fleet battles or Burn Jita scale events?
Personally I have some professional experience with High Performance Computing Systems, otherwise known under the more catchy term "Supercomputing Clusters". I have been a user and maintainer of such systems when I was with the military, and more recently I have actually built and installed systems like that as a job.
Under scrutiny, and when comparing it with other HPC systems, Tranquility becomes quite a bit less impressive, and therein lies a possible explanation why CCP is suffering from node crashes when there are massive conflicts.
Where to begin ...?
Processing Power
First of all, processing power is not measured in clock speed alone. The generally accepted baseline unit of measuring processing power is called FLOPS which is a combination of clock speed and processing cores available.
Not only is clock speed alone not a sufficient way of measuring processing power, it also doesn't quite add up in the way that you can say you have 2000 GHz of total speed just because you have 500 CPUs running at 4GHz each. For that to be remotely applicable as a statement, you would have to have all those CPUs doing exactly the same thing parallel in a multithreaded way, and here we hit another stumbling block in CCP's setup.
Bits and Threads
Multithreading - simply put - means that you can split the same operation among many processor cores at the same time so that they share their workload. CCP personnel rarely make statements about the specifics, but on the most recent Episode of "Shit on Kugu" (a fitting name for that terrible podcast) CCP Dolan re-iterated that "EVE can run on a single core" and that CCP are working on making EVE 64 bit so they can actually use multithreading. (It's at about 45min into the podcast)
There is a hidden piece of information there: There is one commonly used server system that is incapable of running 32 bit applications parallel multithreaded while making full use of a multicore architecture: Microsoft Windows. As a matter of fact, the Windows Server OS is limited by the codebase in the number of cores it can utilize.
I have once worked with a Windows Server HPC system (TBH "Windows HPC" sounds like an oxymoron to me to begin with) and I can tell you it was the worst performing HPC system I have ever seen. I wouldn't expect to run CFC vs. Everyone Else on such a system without lots of sleepless nights because of crashes.
Nodes and Resources
In standard HPC terminology, a node is a single hardware platform with it's standard setup (CPU, Memory, coprocessors etc. all on one motherboard). Generally, you would have one or more so called master nodes which schedule processing tasks for all the compute nodes which do the actual number crunching. This is done by means of a so called job scheduler and the applications are built to support running in parallel. The job scheduler looks everywhere on the cluster - according to parameters which you can set - where it can find free resources to process the given workload.
So let's say you have 500 Ishtars all launch Sentry Drones and fire at a target. That is a set of mathematical operations which have to be done by the cluster. Now the scheduler gets this submitted as a job and then it would look for cores and memory and clock cycles which are sufficient and optimal to finish those calculations. That is if your system and your software support multithreaded parallel computation on multiple cores.
CCP uses a bit of a different terminology. They call an instance of the game a node, and that instance can run anything from a dozen to a single star system with everything in it. That sounds a bit awkward as a model for resource allocation.
CCP Dolan mentions how great it would be if they could dynamically allocate resources without having to shut down a node. As it is now, they actually have to manually assign a solar system to something which they call a "Reinforced Node". What CCP means by that is, that they assign the computation of all tasks that happen for one game instance in one solar system onto a very powerful machine. I'd guess they do that during downtime because it seems from Dolan's statements that they actually have to take down that game instance and restart it on different hardware.
All that would not be necessary if they had a more intelligent HPC system.
Now, don't get me wrong. I have great respect for game developers. If you look at the amazing graphics and concurrent operations they can squeeze out of a model like that, one can only be impressed. In most cases, a modern computer game has to be able to run on one PC or on one game console, and I can tell you, even your most pimped gaming rig will not outperform even a low-grade HPC compute node.
CCP are actually doing quite well considering what they have to make do with, but if I were one of the admins of that system, I'd wish for a special hell for the people who decided to build things that way in the first place.
So where does that leave us?
Maybe they do manage to make it happen that game instances can be transferred dynamically from one machine to another on the fly, but they would be much better off if all of EVE ran on all available machines as separate jobs that make efficient use of the resources.
So, until CCP actually builds another cluster and transfers all of EVE onto that new system, the big powers of New Eden will be stuck with TiDi and having to request a "Reinforced Node" if they know they want to have a massive fleet-fight in a particular system.
Or they could come up with a doctrine that doesn't require them to drop thousands of ships in one place to win, but for now that seems to be the path to victory, even if it is a slow grind of many hours.