My Experience with Precision Boost Overdrive 2 on a 5900X

Looking for the TL;DR? These are my everyday settings:

  • PPT – 185W, TDC – 125A, EDC – 170A. To run these power settings, you’ll need a beefy cooler. If the CPU gets too hot with these power settings, try PPT- 165W, TDC – 115A, EDC – 150A
  • Negative 11 for the first preferred cores on CCX 0 (as indicated by Ryzen Master)
  • Negative 15 for the second preferred core on CCX 0 (as indicated by Ryzen Master)
  • Negative 17 for the other cores.
  • These moved my multithreaded Cinebench R20 score from 8250 to around 8800-9000 (6-9% gain) and my single threaded Cinebench R20 score from 630 to 650 (3% gain).

__________________________________________________________________________________________________________________

Recently AMD announced a new algorithm for the Precision Boost Overdrive (PBO), aptly named Precision Boost Overdrive 2 (PBO2). You can read more here: AMD Ryzen™ Technology: Precision Boost 2 Performance Enhancement | AMD and here: AMD Introduces Precision Boost Overdrive 2, Boosts Single Thread Performance | Tom’s Hardware. This post is not intended to explain the technicalities of this feature, rather than how to take advantage of it.

To get started, you will need to navigate to the BIOS. Unfortunately, now you cannot use Ryzen Master to do this, but AMD claims that this will be part of Ryzen Master in their future releases. In the PBO section, you will need to adjust some settings.

Navigating to AMD Overclocking in the BIOS

My specs are as following:

  • AMD Ryzen 5900x
  • ASRock X570 Steel Legend
  • 32GB C17 Memory
  • 750w PSU
  • 240MM AIO from BeQuiet.

At first, naively, I’ve set the power limits (PPT, TDC and EDC) to 0, which means unlimited. This in turn has a negative effect. It will let the CPU get as much power as it can. This translates into unnecessary power consumption, which will limit the maximum clock speed achieved. I’d suggest sticking to values which will keep the CPU under (or close to 80C under full load).

In my case, the maximum power settings I manage to sustain are: PPT – 185W, TDC – 125A, EDC – 170A. The recommended values for your CPU will vary according to the silicon quality and the cooling provided. Cooling 185W is not an easy feat, you’ll need a good cooler (such as a good NH-D15 (noctua.at), some good AIO (I am using Pure Loop | 240mm silent essential Water coolers from be quiet!).

Setting the PPT, TDC and EDC in a well balanced value is extremely important, this will help you strike the balance between the power consumption needed by the CPU while maintaining realistic temperatures. If the CPU gets too hot with these power settings, try PPT- 165W, TDC – 115A, EDC – 150A

I have set the PBO scalar to manual and 10x. I will be honest I am not sure what impact this has, but it looks like a setting which needs tweaking. I’ve tried 1X and honestly I did not feel any difference. From what I can understand, this is the length of how much the CPU will remain pumping high voltage / clocks until it dials it down. In burst scenarios, this should not have any impact.

Max CPU Boost Clock Override should be set to 200MHZ. This allows for higher clock speeds on single threaded workloads. My 5900x can hit 5.15 GHz with this setting on a single core. 5.15 GHZ is not a one-off number. I regularly see this during light workloads

Navigating to the Curve Optimizer in BIOS

Now, for the most important part: The Curve Optimizer. For the best and second core for each CCD, I have set this to negative 10, and for the other cores I have set it to minus 15.

The next step is quite difficult to instruct, as it purely depends on your silicon quality. In my case, I found the following settings to work for me:

  • Negative 11 for the first preferred cores on CCX 0 (as indicated by Ryzen Master)
  • Negative 15 for the second preferred core on CCX 0 (as indicated by Ryzen Master)
  • Negative 17 for the other cores

It took quite a lot of testing to arrive to these figures. You can find the first and second preferred cores from Ryzen Master.

Per Clock adjustments in the Curve Optimizer

Firstly, I started with negative 20 on all cores. This resulted in awesome Cinebench R20 scores but poor stability. I have then went to negative 15 on all cores. This was not bad, but I was experiencing a crash every now and then, especially when the PC is running cold and is able to push more clocks. It would run all day, but on boot, pushing it will instantly result a crash. This tells me that the algorithm was trying to push for more clocks, but the undervolting was too aggressive.

I then went to negative 10 on all cores and it is fully stable. Finally, I pushed negative 15 for those cores which are not first or second. This remained stable, and eventually I started changes the values slightly everday. Sometimes I go too much and get a WHEA BSOD (especially when the PC is cool and under light workloads).

These moved my multithreaded Cinebench R20 score from 8250 to around 8800-9000 (6-9% gain) and my single threaded Cinebench R20 score from 630 to 650 (3% gain). These are small gains, but when they are coming at you with no cost, it’s good to take advantage of it. And yes, these do not really translate to any tangible performance uplift in everyday computing.

Preferred Cores (Star is 1st, dot is second)

The performance uplift is thanks to higher sustained clocks. With PBO turned off, I was sustaining around 4.1 GHz core clock and with PBO on, I am sustaining between 4.4-4.5 GHz in Cinebench R20.

Cinebench scores with PBO2
Full load under Cinebench R20

Simpler workloads (non AVX) will clock past 4.5 GHz. I suspect that Ryzen calms down the clocks by a bit during AVX workloads, but I cannot confirm this.

Full load under a synthetic load – Memtest 64

Please let me know your experience with PBO2 and whether you find this post useful. If you got better settings than mine, I appreciate the feedback! Of course, keep in mind that as AMD said, no processor is the same; some might need more voltage than others to remain stable. It also depends on the power delivery quality, the sustained temperatures, the quality of the thermal paste, the overall case temperature and a plethora of other things, as mentioned in the first link to AMD’s site.

C# Micro Optimizations Part 2 – In Parameter Modifier

In this series of posts, we’re investigating micro-optimizations in C#. As previously mentioned, these may not be applicable to all; but it’s still fun looking at these concepts.

Let’s visit back the last post – Ref arguments. Ref arguments gave us the power of passing structs by value in an extremely efficient manner.

Mutability of a ref struct

Passing structs by ref brings a major disadvantage – the callee might mutate the value of the struct without the caller ever knowing. What if we need to pass structs in an efficient manner, whilst having peace of mind that the callee doesn’t mutate the struct?

Meet the in parameter modifier- C# 7.2

What does the in parameter modifier do? It allows us to pass the argument by reference and giving us the guarantee that the arguments cannot be modified by the callee. Excellent! Let’s run a quick test and make sure our performance is still comparable when passing by ref. Let’s have a struct with 2 properties – let’s have some work done using two different methods – passing by ref and passing by in.

All code can be viewed here – https://github.com/albertherd/csharpmopt2-in

public class SixteenBitStructBenchmark
{
    [Benchmark]
    [Arguments(100000000)]
    public void BenchmarkIncrementByRef(int limit)
    {
        SixteenBitStruct sixteenBitStruct = new SixteenBitStruct();
        int counter = 0;
        do
        {
            IncrementByRef(ref sixteenBitStruct);
            counter++;
        }
        while (limit != counter);
    }

    [Benchmark]
    [Arguments(100000000)]
    public void BenchmarkIncrementByIn(int limit)
    {
        SixteenBitStruct sixteenBitStruct = new SixteenBitStruct();
        int counter = 0;
        do
        {
            IncrementIn(sixteenBitStruct);
            counter++;
        }
        while (limit != counter);
    }

    private void IncrementByRef(ref SixteenBitStruct sixteenBitStruct)
    {
        double sum = sixteenBitStruct.D1 + sixteenBitStruct.D2;
    }

    private void IncrementIn(in SixteenBitStruct sixteenBitStruct)
    {
        double sum2 = sixteenBitStruct.D1 + sixteenBitStruct.D2;
    }
}

public struct SixteenBitStruct
{
    public double D1 { get; }
    public double D2 { get; }
}

Let’s see how they perform.

Method limit Mean Error StdDev
BenchmarkIncrementByRef 100000000 23.83 ms 0.0272 ms 0.0241 ms
BenchmarkIncrementByIn 100000000 238.21 ms 0.3108 ms 0.2755 ms

Performance loss?

Wait a second – why is IncrementByIn 10x slower than IncrementByRef when we’re accessing 2 properties in the same struct? Let’s have a look at the generated IL.

IncrementByRef

 IL_0000: ldarg.1
IL_0001: call instance float64 InOperator.SixteenBitStruct::get_D1()
# Loads argument 1 (SixteenBitStruct) and call the getter

IncrementByIn

 IL_0000: ldarg.1
# Prepare a new local variable on the evaluation stack
IL_0001: ldobj InOperator.SixteenBitStruct
# Copies the value of SixteenBitStruct into the loaed argument variable
IL_0006: stloc.0
IL_0007: ldloca.s V_0
IL_0009: call instance float64 InOperator.SixteenBitStruct::get_D1()
# Pops the newly created argument into location 0, loads local variable 0 (new copy of SixteenBitStruct) and call the getter

Interesting! When we’ve called the method by ref, the resultant IL just loads the argument and calls the getter. When we’ve called the method by in, the resultant IL creates a copy of the struct before the getter is called. It seems that each time we’re referencing the property, C# is generating a copy of the object for us? We’re facing a by-design feature – a defensive copy.

Why do we encounter a defensive copy?

When calling the getter of our properties, the compiler doesn’t know if the getter mutates the object. Although this is a getter, it’s only by convention that changes aren’t made; there is no language construct that prevents us from changing values in our getter. The compiler must honor the in keyword and generate a defensive copy, just in case the getter modifies the struct.

In the end of the day, a getter is just syntactic sugar for a method. Of course, defensive copies will be generated if methods are called on the struct since the compiles can’t provide any guarantee that the method call won’t mutate the struct.

How do we get around this?

We’ll need instruct the compiler that our struct is immutable, so the compiler doesn’t need to worry about creating defensive copies since values cannot change. C# provides this exact functionality in fact! We can slap the “readonly” keyword (and drop any setters) so that we can guarantee that our struct is now immutable.

Here’s how it looks now

 public readonly struct SixteenBitStruct
{
    public double D1 { get; }
    public double D2 { get; }
}

Revisiting our performance numbers

Let’s re-run our benchmarks and assess the performance.

Method limit Mean Error StdDev
BenchmarkIncrementByRef 100000000 23.93 ms 0.1226 ms 0.1147 ms
BenchmarkIncrementByIn 100000000 24.06 ms 0.2183 ms 0.2042 ms

Far better! Performance is now equal (within margin of error). Some closing thoughts about this:

  • Using the in operator is an excellent feature – it allows the callers to safely assume that the values they are going to pass will not have their values changed.
  • Using the readonly modifier with a struct is another excellent feature – it allows the the developer to safely say that its value is immutable and no changes are allowed.
  • The performance uplift is should be considered as a bonus – the design and infrastructure wins using the in / readonly keywords in these context carry far more value.
  • Don’t ever use the in keyword in conjunction with non-readonly structs. Chances are that the performance gained from passing by ref will be lost by accessing the struct’s properties and methods.

Until the next one!

C# Micro Optimizations Part 1 – Ref Arguments

In this series of posts, we’ll be investigating key areas for micro-optimizations. As the title implies, these are micro-optimizations and may not be applicable for you unless you are writing some high-performance library of have a piece of code running in a tight loop. Nonetheless, it’s still fun to investigate and find these micro-optimizations. Onwards!

Let’s start with a simple one – the ref keyword in method arguments. For this argument, we’re only concerned with value type method arguments – structs.

Since structs are value types, by default, the entire struct is copied over to the callee, irrelevant of the size of the struct. If the struct is big, this is typically a bottleneck since a copy must be created and passed for each call. C# provides a method of overriding this behavior by using the ref keyword. If an argument is marked as ref, a pointer to the struct will be passed rather than an actual copy!

This brings two major advantages:

  • If the struct is bigger than 4 bytes (on a 32 bit machine) or 8 bytes (on a 64 bit machine), passing a struct by ref means that less data copying is taking place.
  • We avoid copying back the data – we do not need to return the data since a reference is passed rather than a copy of the struct.

Let’s see an example – lets consider a struct containing two doubles – a 16 byte struct. Let’s say we have two methods that increments one of the values for us (just to give the loop something to do and not get it optimised away).

One of them accepts a (copy of a) struct, increments its internal values and returns the copy back. This is passed by value, which is the default behavior for a struct.

The other method accepts a struct by ref and increments its internal values. There is no need to return the data back therefore no extra copies were needed. This is not the default behavior, so we’ll need to accompany it with the ref keyword.

The below is the source code in question – find the whole solution here: https://github.com/albertherd/csharpmopt1-ref

[CoreJob]
public class SixteenBytesStructBenchmark
{
    [Benchmark]
    [Arguments(1000000)]
    public void BenchmarkIncrementByRef(int limit)
    {
        SixteenBytesStruct value = new SixteenBytesStruct();
        int counter = 0;
        do
        {
            IncrementByRef(ref value);
            counter++;
        }
        while (limit != counter);
    }
    [Benchmark]
    [Arguments(1000000)]
    public void BenchmarkIncrementByVal(int limit)
    {
        SixteenBytesStruct value = new SixteenBytesStruct();
        int counter = 0;
        do
        {
            value = IncrementByVal(value);
            counter++;
        }
        while (limit != counter);
    }
    private void IncrementByRef(ref SixteenBytesStruct toIncrement)
    {
        toIncrement.d0++;
    }
    private SixteenBytesStruct IncrementByVal(SixteenBytesStruct toIncrement)
    {
        toIncrement.d0++;
        return toIncrement;
    }
}
public struct SixteenBytesStruct
{
    public long d0, d1;
}

The below is the time taken for 1000000 runs – this was executed using .NET core 2.2.1 – benchmarks done using BenchmarkDotNet

Method limit Mean Error StdDev
BenchmarkIncrementByRef 1000000 1.663 ms 0.0139 ms 0.0130 ms
BenchmarkIncrementByVal 1000000 2.872 ms 0.0155 ms 0.0145 ms

We can see that running this in a tight loop, doing the work by ref, in this case, is 72% faster! To what can we attribute this performance change? Let’s have a look at what’s happening behind the scenes.

Doing the work by value

Calling IncrementByVal

IL_000a: ldarg.0 # Load the “this” parameter on evaluation stack (implicit)
IL_000b: ldloc.0 # Load SixteenBytesStruct value on the stack (16 bytes worth of data) from location 0
IL_000c: call instance valuetype Ref.SixteenBytesStruct Ref.SixteenBytesStructBenchmark::IncrementByVal(valuetype Ref.SixteenBytesStruct) # Call IncrementByVal with the loaded arguments
IL_0011: stloc.0 #Captures the returned value and stores it in location 0

IncrementByVal Implementation

IL_0000: ldarga.s toIncrement # Load the argument’s address so processing can begin
..method work – removed for brevity
IL_001a: ldarg.1 # Load the value of the field back so it can be returned
IL_001b: ret

What’s happening here?

  • Push the value of SixteenBytesStruct ready to be captured by the upcoming method call
  • Call IncrementByVal
  • IncrementByVal loads the address of the received value from the caller and does the required work
  • Push the value of the SixteenByteStruct after the work has been done ready to be captured by the caller
  • IncrementByVal Returns
  • Pop the value from replace the value of SixteenBytesStruct with the new one

Doing the work by ref

Calling IncrementByRef

IL_000a: ldarg.0 # Load the “this” parameter on evaluation stack (implicit)
IL_000b: ldloca.s V_0 # Load SixteenBytesStruct’s address on the stack (8 bytes worth of data)
IL_000d: call instance void Ref.SixteenBytesStructBenchmark::IncrementByRef(valuetype Ref.SixteenBytesStructamp;) # Call IncrementByVal with the loaded arguments

IncrementByVal Implementation

IL_0000: ldarg.1 # Load the argument so processing can begin. We’re not calling ldarga.s since this already the struct’s address rather than the actual value
..method work – removed for brevity
IL_0018: ret # Return

What’s happening here?

  • Push the address of SixteenBytesStruct ready to be captured by the upcoming method call
  • Call IncrementByVal
  • IncrementByVal gets value received from the caller (the value is an address) and does the required work
  • IncrementByVal Returns

What does this mean?

One can obviously note that doing the work by ref has significantly less work to do:

  • The callee is pushing 8 bytes instead of 16 bytes
  • The callee loads 8 bytes onto the evaluation stack instead of 16 bytes
  • The callee doesn’t need to push the new value onto the evaluation stack
  • The callee doesn’t need to pop the stack and stored the updated value

Therefore, doing the work by ref is pushing less data when a method call takes place (maximum of 8 bytes, irrespective of the struct size) and is avoiding two data copy instructions, since it does not need to push and pop the new value since there are no return values.

If you increase the size of the struct, the performance gains would be even bigger, as shown in the below graph.

C#RefBenchmark

We can observe some useful information from this graph

  • When it comes to doing operations by ref, performance is basically equivalent all cross the board, irrelevant to the size of the struct.
  • 16 byte, 8 byte and 4 byte structs carry identical performance – they are just separated by the margin of error.
  • 16 byte, 8 byte and 4 byte structs are faster than 2 byte and 1 byte structs. In fact, 1 byte struct ends up clearly slower than a 2 byte struct! It’s very interesting to explore why 1 and 2 byte structs exhibit performance degradation.
  • The rest of the result show a consistent upward trend – which reflect the amount of data copying take place.

What’s very interesting is that a 4 byte integer operates faster by value when compared to 1 byte and 2 byte integers!