Automation

For anyone that has ever had to do boring, repetitive tasks, there is always the wish that it could be done without any interaction at all. For a number of professional applications, automation can be a primary requirement - the ability to press a button and let something go, with consistency every time, removes headaches and can lead to scaling out the process.

When it comes to benchmarking, having an automated test suite enables several benefits. Tests can have consistent delays between each test to provide the same environment for temperature and turbo ramps, it should arrange the cache and standardize cache defragmentation, and it lends itself to repeated consistent results. Bonus points are then awarded if the testing can then be scaled out to multiple systems at once. Sitting at a system with irregular jumps in testing can add in more degrees of freedom on things that might not be consistent and effect the results. Plus it becomes incredibly dull, incredibly fast. I mean OEM product manufacturing line dull. To all my fellow reviewers out there, I know the pain when you have several hundred hours of gameplay on something like Far Cry 5, but it’s all just benchmarking.

This is where I point to the well-known graph about automation (original source unknown):

For small tasks or projects, sometimes manually doing the work is quicker. If it takes 5 minutes to do the task manually, but then 8 hours to write the script which saves 5 seconds, the script has to be run 5760 times for the payoff. If the script is run 50 times a day, then the payoff will be in 115 days. This ignores scale out, if the script allows multiple systems to run concurrently, but for a lot of tasks can make it a no brainer to put the effort in. Otherwise, 3 years later, it becomes ultimately depressing when running CineBench for the 80000th time. (Insert stories from TheDailyWTF about how in a company a boss does not want automation because it might kill their job). Insert obligatory XKCD.

When I first started at AnandTech, testing motherboards, I did not run anything automated. Going through a basic motherboard testing suite manually took three days, because when testing you have to be alert and present every time a test finished to run the next one (and if the mind wanders, that 2-minute test becomes 15 minutes until you realize it's done). For our 2015 CPU Benchmark suite, a basic script that was written performed about 20 tests and lasted around 4 hours. It looked like spaghetti code, and very quickly became annoying to manage and update, especially when a benchmark decided it wasn't going to work/needed to be bypassed – there was no easy way to add benchmarks either. On top of this, benchmark installing was manual. Insert more XKCD. Thank you XKCD.

The new scripts for our Windows 10 testing are larger, modular, and more involved. The goal was essentially to automate everything down to what was feasibly possible, within my knowledge (or didn't require much learning), and required no user interaction. Over the course of two months, while testing which benchmarks were usable and applicable, two major scripts were written: CPU Tests and CPU Gaming Tests. 

How to Automate: Batch Files, Powershell, and AHK

There are many ways to automate in a system. Ganesh, for example, uses PowerShell almost exclusively to call benchmarks from the command line. To say that PowerShell is a glorified command prompt doesn't do it justice, but Ganesh ensures that his workloads for mini-PC testing can only ever run from the command line, and the results can be parsed therein. 

I'm not as au fait with PowerShell (if I had time for a crash course, it'd be on my to-do list), so I use a combination of batch files and a tool called AutoHotKey (AHK for short). AHK is a simple enough scripting language which can run programs, call command line functions, call PowerShell scripts, emulate mouse movements, emulate clicks and keyboard presses, and perform internal math, with subroutine support. It is like a poor man's C++, with an alarming number of foibles, such as poor type definition and zero type checking, but it can work if you treat it right.

For each benchmark I tested for suitability, either a fixed benchmark like Cinebench or a custom workload such as WinRAR or Blender, I tried to get the test to run from a simple batch file command line and manipulate the output. For Cinebench 15, the output is part of the stderr, and for Photoscan it outputs a results file due to the python script it requires that Agisoft provided (and I've edited). For WinRAR it is a timing function wrapper around a command line call pointing at the workload, and for Civilization 6 it's a simple flag after adjusting the settings file. For benchmarks like Gears Tactics, or Cinebench R10, there is no command line option and we have to turn to AHK to simulate keyboard presses.

So with each benchmark profiled, the individual tests are written as separate functions in AHK with three stages: preparation/installation, execution, and result parsing.

Preparation involves ensuring that the benchmark can be run in its current state, installing it if it isn’t, and deleting any previous temporary results file (if present) to ensure the directory structure is valid where needed. With the right preparation, running each test in the same manner makes the result as consistent as possible. Parsing the output into something suitable usually means flicking through an output file and doing the appropriate regular expression functions to pull out the required value. Some tests automatically allow for repeated results (Corona or 3DPMv2), whereas others need multiple runs specified (WinRAR) and those results can be put into an array and averaged or geomeaned using AHK. A final function is written to take the results and ply them into a custom results directory.

Outside of the testing functions is a general preparation element to the script. For our testing we have four main modes: the full list of tests, a short list of tests (determined in the script), running a single test, and an option to continue from a certain point of a full test run (in case one benchmark needed attention and errored out the process, such as a web benchmark when the server host fails). The initialization of one of our scripts asks which benchmarks suite is required, and detects the CPU/GPU present in the system, before offering a default location to save the results based on the CPU/GPU combo. By having the results location determined when the script is started, we can move results to the directory as each test finishes, and the results are parsed into an easy to read format for a mental check before they go into the database. For ease of use, I have a results location on a NAS, and so as the script uploads benchmark results to it, I can start looking at the results uploaded to the NAS as the other benchmarks are running. Useful when running to a deadline! We also do additional checks on the state of Spectre and Meltdown fixes in the OS, to ensure consistency.

Sanity Checks of Results and Running Order

Mental checks of results become important - being able to spot an outlier, or identifying when a result seems abnormal. For example, through the initial testing, I noticed that one of the results in one of our web tests (scoring ~100ms) was staying in the clipboard for the next web test (scoring 700ms). This gave a much lower average for the second test - and this only happened on fast CPUs.  Similarly with game tests, over the benchmark being repeated multiple times, sometimes a result (for whatever reason) might be 10% down on all the others. So either automatic detection of outliers needs to be in place (doesn't work if two results out of four repeats are bad), or a manual mental check needs to take place. There are a few things that automation can’t replace easily, such as experience. This is where for some tests an average might be representative, or a median might be more appropriate.

Also useful to note is determining the benchmark running order. Experience with our previous automation has shown that the shortest tests should run first, in order to populate our results directory on the NAS quicker, and the longer tests should be near the end but not right at the end. The tests that more frequently cause unpredictable errors (e.g. DLL support on a new platform causing a system to hang, or a benchmark that is reliant on online license servers which could be down for maintenance) are put in last, so an overnight run will go through as many tests as possible first before tackling potential breaks in the testing.

GPU Tests and Steam

The methods listed above work for our CPU and CPU Gaming tests. The CPU Gaming tests have an additional element, given that we are using games from Steam, and we are using only one log-in account for multiple systems under test at once. For the most part, if the game title likes to run nice offline, the test can be run offline. Unfortunately there are some games where the benchmark script will run 95% smoother (GTA, RDR2) when the user is logged in, due to online DRM checks.

For this, the script I’ve written runs a test and lock mechanism when trying to log in to Steam, and only tries to run the online tests if the account is not already signed in elsewhere. If the account is already signed in on a different system, the first system will instead automatically run one of the offline tests and come back after one test to see of online is available. If not, it will run another of the other offline tests, check again and so on, until there are no more offline tests to run, where it will sit and wait and probe every 120 seconds for access to Steam. For the machine that is online, it will run both sets of the online tests back-to-back, and then go back offline to run the rest of the offline tests, freeing the lock for any other machine that needs it. Some of this uses Steam's APIs, probing how Steam’s login mechanism works, and undocumented features. 

Nine Hundred CPUs OS Preparation and Benchmark Installation
Comments Locked

110 Comments

View All Comments

  • ruthan - Monday, July 27, 2020 - link

    Well lots of bla, bla, bla.. I checked graphs in archizlr they are classic just few entries.. there is link to your benchmark database, but here i see preselected some Crysis benchmark, which is not part of article.. and dont lead to some ultimate lots of cpus graphs. So it need much more streamlining.

    i usually using old Geekbench for cpus tests and there i can compare usually what i want.. well not with real applications and games, but its quick too. Otherwise usually have enough knowledge to know if is some cpu good enough for some games or not.. so i dont need some very old and very need comparisions. Something can be found at Phoronix.
    These benchmarks will always lots relevancy with new updates, unless all cpus would in own machines and update and running and reresting constantly - which could be quite waste of power and money.
    Maybe some golden path is some simple multithreaded testing utility with 2 benchmarks one for integers and one for floats.
  • Ian Cutress - Wednesday, August 5, 2020 - link

    When you're in Bench, Check the drop down menu on your left for the individual tests
  • hnlog - Wednesday, July 29, 2020 - link

    > For our testing on the 2020 suite, we have secured three RTX 2080 Ti GPUs direct from NVIDIA.
    Congrats!
  • Koenig168 - Saturday, August 1, 2020 - link

    It would be more efficient to focus on the more popular CPUs. Some of the less popular SKUs which differ only by clock speed can have their performance extrapolated. Testing 900 CPUs sound nice but quickly hit diminishing returns in terms of usefulness after the first few hundred.

    You might also wish to set some minimum performance standards using just a few tests. Any CPU which failed to meet those standards should be marked as "obsolete, upgrade already dude!" and be done with them rather than spend the full 30 to 40 hours testing each of them.

    Finally, you need to ask yourself "How often do I wish to redo this project and how much resources will I be able to devote to it?" Bearing in mind that with new drivers, games etc, the database needs to be updated oeriodically to stay relevant. This will provide a realistic estimate of how many CPUs to include in the database.
  • Meteor2 - Monday, August 3, 2020 - link

    I think it's a labour of love...
  • TrevorX - Thursday, September 3, 2020 - link

    My suggestion would be to bench the highest performing Xeons that supported DDR3 RAM. Why? Because the cost of DDR3 RDIMMs is so amazingly cheap (as in, less than 10%) compared with DDR4. I personally have a Xeon E5-1660v2 @4.1GHz with 128GB DDR3 1866MHz RDIMMs that's the most rock stable PC I've ever had. Moving up to a DDR4 system with similar memory capacity would be eye-wateringly expensive. I currently have 466 tabs open in Chrome, Outlook, Photoshop, Word, several Excel spreadsheets, and I'm only using 31.3% of physical RAM. I don't game, so I would be genuinely interested in what actual benefit would be derived from an upgrade to Ryzen / Threadripper.

    Also very keen to see server/hypervisor testing of something like Xeon E5-2667v2 vs Xeon W-1270P or Xeon Silver 4215R for evaluation of on-prem virtualisation hosts. A lot of server workloads are being shifted to the cloud for very good reasons, but for smaller businesses it might be difficult to justify the monthly expense of cloud hosting (and Azure licensing) when they still have a perfectly serviceable 5yo server with plenty of legs left on it. It would be great to be able to see what performance and efficiency improvements can be had jumping between generations.
  • Tilmitt - Thursday, October 8, 2020 - link

    When is this going to be done?
  • Mil0 - Friday, October 16, 2020 - link

    Well they launched with 12 results if I count correctly, and currently there are 38 listed, that's close to 10/month. With the goal of 900, that would mean over 7 years (in which ofc more CPUs would be released)
  • Mil0 - Friday, October 16, 2020 - link

    Well they launched with 12 results if I count correctly, and currently there are 44 listed, that's about a dozen a month. With the goal of 900, that would mean 6 years (in which ofc more CPUs would be released)
  • Mil0 - Friday, October 16, 2020 - link

    Caching hid my previous comment from me, so instead of a follow up there are now 2 pretty similar ones. However, in the mean time I found Ian is actually updating on twitter, which you can find here: https://twitter.com/IanCutress/status/131350328982...

    He actually did 36 CPU's in 2.5 months, so it should only take 5 years! :D

Log in

Don't have an account? Sign up now