https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/
Skip to main content
[RE1Mu3b] Microsoft
.NET Blog
.NET Blog
.NET Blog
* Home
* DevBlogs
* Developer
+ Visual Studio
+ Visual Studio Code
+ Visual Studio for Mac
+ DevOps
+ Developer support
+ CSE Developer
+ Engineering@Microsoft
+ Azure SDK
+ IoT
+ Command Line
+ Perf and Diagnostics
+ Dr. International
+ Notification Hubs
+ Math in Office
* Technology
+ DirectX
+ PIX
+ SurfaceDuo
+ Startups
+ Sustainable Engineering
+ Windows AI Platform
* Languages
+ C++
+ C#
+ F#
+ Visual Basic
+ TypeScript
+ PowerShell Community
+ PowerShell Team
+ Python
+ Q#
+ JavaScript
+ Java
+ Java Blog in Chinese
* .NET
+ .NET
+ .NET MAUI
+ Blazor
+ ASP.NET
+ NuGet
+ Xamarin
* Platform Development
+ #ifdef Windows
+ Apps for Windows
+ Azure Depth Platform
+ Azure Government
+ Bing Dev Center
+ Microsoft Edge Dev
+ Microsoft Azure
+ Microsoft 365 Developer
+ Old New Thing
+ Windows MIDI and Music dev
+ Windows Search Platform
* Data Development
+ Azure Cosmos DB
+ Azure Data Studio
+ Azure SQL Database
+ OData
+ Revolutions R
+ SQL Server Data Tools
* More
[ ] Search Search
Cancel
Performance Improvements in .NET 7
[png]
Stephen Toub - MSFT
August 31st, 202240 15
A year ago, I published Performance Improvements in .NET 6, following
on the heels of similar posts for .NET 5, .NET Core 3.0, .NET Core
2.1, and .NET Core 2.0. I enjoy writing these posts and love reading
developers' responses to them. One comment in particular last year
resonated with me. The commenter cited the Die Hard movie quote,
"'When Alexander saw the breadth of his domain, he wept for there
were no more worlds to conquer'," and questioned whether .NET
performance improvements were similar. Has the well run dry? Are
there no more "[performance] worlds to conquer"? I'm a bit giddy to
say that, even with how fast .NET 6 is, .NET 7 definitively
highlights how much more can be and has been done.
As with previous versions of .NET, performance is a key focus that
pervades the entire stack, whether it be features created explicitly
for performance or non-performance-related features that are still
designed and implemented with performance keenly in mind. And now
that a .NET 7 release candidate is just around the corner, it's a
good time to discuss much of it. Over the course of the last year,
every time I've reviewed a PR that might positively impact
performance, I've copied that link to a journal I maintain for the
purposes of writing this post. When I sat down to write this a few
weeks ago, I was faced with a list of almost 1000
performance-impacting PRs (out of more than 7000 PRs that went into
the release), and I'm excited to share almost 500 of them here with
you.
One thought before we dive in. In past years, I've received the odd
piece of negative feedback about the length of some of my
performance-focused write-ups, and while I disagree with the
criticism, I respect the opinion. So, this year, consider this a
"choose your own adventure." If you're here just looking for a super
short adventure, one that provides the top-level summary and a core
message to take away from your time here, I'm happy to oblige:
TL;DR: .NET 7 is fast. Really fast. A thousand
performance-impacting PRs went into runtime and core libraries
this release, never mind all the improvements in ASP.NET Core and
Windows Forms and Entity Framework and beyond. It's the fastest
.NET ever. If your manager asks you why your project should
upgrade to .NET 7, you can say "in addition to all the new
functionality in the release, .NET 7 is super fast."
Or, if you prefer a slightly longer adventure, one filled with
interesting nuggets of performance-focused data, consider skimming
through the post, looking for the small code snippets and
corresponding tables showing a wealth of measurable performance
improvements. At that point, you, too, may walk away with your head
held high and my thanks.
Both noted paths achieve one of my primary goals for spending the
time to write these posts, to highlight the greatness of the next
release and to encourage everyone to give it a try. But, I have other
goals for these posts, too. I want everyone interested to walk away
from this post with an upleveled understanding of how .NET is
implemented, why various decisions were made, tradeoffs that were
evaluated, techniques that were employed, algorithms that were
considered, and valuable tools and approaches that were utilized to
make .NET even faster than it was previously. I want developers to
learn from our own learnings and find ways to apply this new-found
knowledge to their own codebases, thereby further increasing the
overall performance of code in the ecosystem. I want developers to
take an extra beat, think about reaching for a profiler the next time
they're working on a gnarly problem, think about looking at the
source for the component they're using in order to better understand
how to work with it, and think about revisiting previous assumptions
and decisions to determine whether they're still accurate and
appropriate. And I want developers to be excited at the prospect of
submitting PRs to improve .NET not only for themselves but for every
developer around the globe using .NET. If any of that sounds
interesting, then I encourage you to choose the last adventure:
prepare a carafe of your favorite hot beverage, get comfortable, and
please enjoy.
(Oh, and please don't print this to paper. "Print to PDF" tells me it
would take a third of a ream.)
Table of Contents
* Setup
* JIT
* GC
* Native AOT
* Mono
* Reflection
* Interop
* Threading
* Primitive Types and Numerics
* Arrays, Strings, and Spans
* Regex
* Collections
* LINQ
* File I/O
* Compression
* Networking
* JSON
* XML
* Cryptography
* Diagnostics
* Exceptions
* Registry
* Analyzers
* What's Next?
Setup
The microbenchmarks throughout this post utilize benchmarkdotnet. To
make it easy for you to follow along with your own validation, I have
a very simple setup for the benchmarks I use. Create a new C#
project:
dotnet new console -o benchmarks
cd benchmarks
Your new benchmarks directory will contain a benchmarks.csproj file
and a Program.cs file. Replace the contents of benchmarks.csproj with
this:
Exenet7.0;net6.0Previewtruetrue
and the contents of Program.cs with this:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using Microsoft.Win32;
using System;
using System.Buffers;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.ComponentModel;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.IO.MemoryMappedFiles;
using System.IO.Pipes;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Net.Security;
using System.Net.Sockets;
using System.Numerics;
using System.Reflection;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Security.Authentication;
using System.Security.Cryptography;
using System.Security.Cryptography.X509Certificates;
using System.Text;
using System.Text.Json;
using System.Text.RegularExpressions;
using System.Threading;
using System.Threading.Tasks;
using System.Xml;
[MemoryDiagnoser(displayGenColumns: false)]
[DisassemblyDiagnoser]
[HideColumns("Error", "StdDev", "Median", "RatioSD")]
public partial class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
// ... copy [Benchmark]s here
}
For each benchmark included in this write-up, you can then just copy
and paste the code into this test class, and run the benchmarks. For
example, to run a benchmark comparing performance on .NET 6 and .NET
7, do:
dotnet run -c Release -f net6.0 --filter '**' --runtimes net6.0 net7.0
This command says "build the benchmarks in release configuration
targeting the .NET 6 surface area, and then run all of the benchmarks
on both .NET 6 and .NET 7." Or to run just on .NET 7:
dotnet run -c Release -f net7.0 --filter '**' --runtimes net7.0
which instead builds targeting the .NET 7 surface area and then only
runs once against .NET 7. You can do this on any of Windows, Linux,
or macOS. Unless otherwise called out (e.g. where the improvements
are specific to Unix and I run the benchmarks on Linux), the results
I share were recorded on Windows 11 64-bit but aren't
Windows-specific and should show similar relative differences on the
other operating systems as well.
The release of the first .NET 7 release candidate is right around the
corner. All of the measurements in this post were gathered with a
recent daily build of .NET 7 RC1.
Also, my standard caveat: These are microbenchmarks. It is expected
that different hardware, different versions of operating systems, and
the way in which the wind is currently blowing can affect the numbers
involved. Your mileage may vary.
JIT
I'd like to kick off a discussion of performance improvements in the
Just-In-Time (JIT) compiler by talking about something that itself
isn't actually a performance improvement. Being able to understand
exactly what assembly code is generated by the JIT is critical when
fine-tuning lower-level, performance-sensitive code. There are
multiple ways to get at that assembly code. The online tool
sharplab.io is incredibly useful for this (thanks to @ashmind for
this tool); however it currently only targets a single release, so as
I write this I'm only able to see the output for .NET 6, which makes
it difficult to use for A/B comparisons. godbolt.org is also valuable
for this, with C# support added in compiler-explorer/
compiler-explorer#3168 from @hez2010, with similar limitations. The
most flexible solutions involve getting at that assembly code
locally, as it enables comparing whatever versions or local builds
you desire with whatever configurations and switches set that you
need.
One common approach is to use the [DisassemblyDiagnoser] in
benchmarkdotnet. Simply slap the [DisassemblyDiagnoser] attribute
onto your test class: benchmarkdotnet will find the assembly code
generated for your tests and some depth of functions they call, and
dump out the found assembly code in a human-readable form. For
example, if I run this test:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
[DisassemblyDiagnoser]
public partial class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
private int _a = 42, _b = 84;
[Benchmark]
public int Min() => Math.Min(_a, _b);
}
with:
dotnet run -c Release -f net7.0 --filter '**'
in addition to doing all of its normal test execution and timing,
benchmarkdotnet also outputs a Program-asm.md file that contains
this:
; Program.Min()
mov eax,[rcx+8]
mov edx,[rcx+0C]
cmp eax,edx
jg short M00_L01
mov edx,eax
M00_L00:
mov eax,edx
ret
M00_L01:
jmp short M00_L00
; Total bytes of code 17
Pretty neat. This support was recently improved further in dotnet/
benchmarkdotnet#2072, which allows passing a filter list on the
command-line to benchmarkdotnet to tell it exactly which methods'
assembly code should be dumped.
If you can get your hands on a "debug" or "checked" build of the .NET
runtime ("checked" is a build that has optimizations enabled but also
still includes asserts), and specifically of clrjit.dll, another
valuable approach is to set an environment variable that causes the
JIT itself to spit out a human-readable description of all of the
assembly code it emits. This can be used with any kind of
application, as it's part of the JIT itself rather than part of any
specific tool or other environment, it supports showing the code the
JIT generates each time it generates code (e.g. if it first compiles
a method without optimization and then later recompiles it with
optimization), and overall it's the most accurate picture of the
assembly code as it comes "straight from the horses mouth," as it
were. The (big) downside of course is that it requires a non-release
build of the runtime, which typically means you need to build it
yourself from the sources in the dotnet/runtime repo.
... until .NET 7, that is. As of dotnet/runtime#73365, this assembly
dumping support is now available in release builds as well, which
means it's simply part of .NET 7 and you don't need anything special
to use it. To see this, try creating a simple "hello world" app like:
using System;
class Program
{
public static void Main() => Console.WriteLine("Hello, world!");
}
and building it (e.g. dotnet build -c Release). Then, set the
DOTNET_JitDisasm environment variable to the name of the method we
care about, in this case "Main" (the exact syntax allowed is more
permissive and allows for some use of wildcards, optional namespace
and class names, etc.). As I'm using PowerShell, that means:
$env:DOTNET_JitDisasm="Main"
and then running the app. You should see code like this output to the
console:
; Assembly listing for method Program:Main()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
488D6C2420 lea rbp, [rsp+20H]
G_M000_IG02: ;; offset=000AH
48B9D820400A8E010000 mov rcx, 0x18E0A4020D8
488B09 mov rcx, gword ptr [rcx]
FF1583B31000 call [Console:WriteLine(String)]
90 nop
G_M000_IG03: ;; offset=001EH
4883C420 add rsp, 32
5D pop rbp
C3 ret
; Total bytes of code 36
Hello, world!
This is immeasurably helpful for performance analysis and tuning,
even for questions as simple as "did my function get inlined" or "is
this code I expected to be optimized away actually getting optimized
away." Throughout the rest of this post, I'll include assembly
snippets generated by one of these two mechanisms, in order to help
exemplify concepts.
Note that it can sometimes be a little confusing figuring out what
name to specify as the value for DOTNET_JitDisasm, especially when
the method you care about is one that the C# compiler names or name
mangles (since the JIT only sees the IL and metadata, not the
original C#), e.g. the name of the entry point method for a program
with top-level statements, the names of local functions, etc. To both
help with this and to provide a really valuable top-level view of the
work the JIT is doing, .NET 7 also supports the new
DOTNET_JitDisasmSummary environment variable (introduced in dotnet/
runtime#74090). Set that to "1", and it'll result in the JIT emitting
a line every time it compiles a method, including the name of that
method which is copy/pasteable with DOTNET_JitDisasm. This feature is
useful in-and-of-itself, however, as it can quickly highlight for you
what's being compiled, when, and with what settings. For example, if
I set the environment variable and then run a "hello, world" console
app, I get this output:
1: JIT compiled CastHelpers:StelemRef(Array,long,Object) [Tier1, IL size=88, code size=93]
2: JIT compiled CastHelpers:LdelemaRef(Array,long,long):byref [Tier1, IL size=44, code size=44]
3: JIT compiled SpanHelpers:IndexOfNullCharacter(byref):int [Tier1, IL size=792, code size=388]
4: JIT compiled Program:Main() [Tier0, IL size=11, code size=36]
5: JIT compiled ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long [Tier0, IL size=490, code size=1187]
Hello, world!
We can see for "hello, world" there's only 5 methods that actually
get JIT compiled. There are of course many more methods that get
executed as part of a simple "hello, world," but almost all of them
have precompiled native code available as part of the "Ready To Run"
(R2R) images of the core libraries. The first three in the above list
(StelemRef, LdelemaRef, and IndexOfNullCharacter) don't because they
explicitly opted-out of R2R via use of the [MethodImpl
(MethodImplOptions.AggressiveOptimization)] attribute (despite the
name, this attribute should almost never be used, and is only used
for very specific reasons in a few very specific places in the core
libraries). Then there's our Main method. And lastly there's the
NarrowUtf16ToAscii method, which doesn't have R2R code, either, due
to using the variable-width Vector (more on that later). Every
other method that's run doesn't require JIT'ing. If we instead first
set the DOTNET_ReadyToRun environment variable to 0, the list is much
longer, and gives you a very good sense of what the JIT needs to do
on startup (and why technologies like R2R are important for startup
time). Note how many methods get compiled before "hello, world" is
output:
1: JIT compiled CastHelpers:StelemRef(Array,long,Object) [Tier1, IL size=88, code size=93]
2: JIT compiled CastHelpers:LdelemaRef(Array,long,long):byref [Tier1, IL size=44, code size=44]
3: JIT compiled AppContext:Setup(long,long,int) [Tier0, IL size=68, code size=275]
4: JIT compiled Dictionary`2:.ctor(int):this [Tier0, IL size=9, code size=40]
5: JIT compiled Dictionary`2:.ctor(int,IEqualityComparer`1):this [Tier0, IL size=102, code size=444]
6: JIT compiled Object:.ctor():this [Tier0, IL size=1, code size=10]
7: JIT compiled Dictionary`2:Initialize(int):int:this [Tier0, IL size=56, code size=231]
8: JIT compiled HashHelpers:GetPrime(int):int [Tier0, IL size=83, code size=379]
9: JIT compiled HashHelpers:.cctor() [Tier0, IL size=24, code size=102]
10: JIT compiled HashHelpers:GetFastModMultiplier(int):long [Tier0, IL size=9, code size=37]
11: JIT compiled Type:GetTypeFromHandle(RuntimeTypeHandle):Type [Tier0, IL size=8, code size=14]
12: JIT compiled Type:op_Equality(Type,Type):bool [Tier0, IL size=38, code size=143]
13: JIT compiled NonRandomizedStringEqualityComparer:GetStringComparer(Object):IEqualityComparer`1 [Tier0, IL size=39, code size=170]
14: JIT compiled NonRandomizedStringEqualityComparer:.cctor() [Tier0, IL size=46, code size=232]
15: JIT compiled EqualityComparer`1:get_Default():EqualityComparer`1 [Tier0, IL size=6, code size=36]
16: JIT compiled EqualityComparer`1:.cctor() [Tier0, IL size=26, code size=125]
17: JIT compiled ComparerHelpers:CreateDefaultEqualityComparer(Type):Object [Tier0, IL size=235, code size=949]
18: JIT compiled CastHelpers:ChkCastClass(long,Object):Object [Tier0, IL size=22, code size=72]
19: JIT compiled RuntimeHelpers:GetMethodTable(Object):long [Tier0, IL size=11, code size=33]
20: JIT compiled CastHelpers:IsInstanceOfClass(long,Object):Object [Tier0, IL size=97, code size=257]
21: JIT compiled GenericEqualityComparer`1:.ctor():this [Tier0, IL size=7, code size=31]
22: JIT compiled EqualityComparer`1:.ctor():this [Tier0, IL size=7, code size=31]
23: JIT compiled CastHelpers:ChkCastClassSpecial(long,Object):Object [Tier0, IL size=87, code size=246]
24: JIT compiled OrdinalComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=8, code size=39]
25: JIT compiled NonRandomizedStringEqualityComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=14, code size=52]
26: JIT compiled StringComparer:get_Ordinal():StringComparer [Tier0, IL size=6, code size=49]
27: JIT compiled OrdinalCaseSensitiveComparer:.cctor() [Tier0, IL size=11, code size=71]
28: JIT compiled OrdinalCaseSensitiveComparer:.ctor():this [Tier0, IL size=8, code size=33]
29: JIT compiled OrdinalComparer:.ctor(bool):this [Tier0, IL size=14, code size=43]
30: JIT compiled StringComparer:.ctor():this [Tier0, IL size=7, code size=31]
31: JIT compiled StringComparer:get_OrdinalIgnoreCase():StringComparer [Tier0, IL size=6, code size=49]
32: JIT compiled OrdinalIgnoreCaseComparer:.cctor() [Tier0, IL size=11, code size=71]
33: JIT compiled OrdinalIgnoreCaseComparer:.ctor():this [Tier0, IL size=8, code size=36]
34: JIT compiled OrdinalIgnoreCaseComparer:.ctor(IEqualityComparer`1):this [Tier0, IL size=8, code size=39]
35: JIT compiled CastHelpers:ChkCastAny(long,Object):Object [Tier0, IL size=38, code size=115]
36: JIT compiled CastHelpers:TryGet(long,long):int [Tier0, IL size=129, code size=308]
37: JIT compiled CastHelpers:TableData(ref):byref [Tier0, IL size=7, code size=31]
38: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24]
39: JIT compiled CastHelpers:KeyToBucket(byref,long,long):int [Tier0, IL size=38, code size=87]
40: JIT compiled CastHelpers:HashShift(byref):int [Tier0, IL size=3, code size=16]
41: JIT compiled BitOperations:RotateLeft(long,int):long [Tier0, IL size=17, code size=23]
42: JIT compiled CastHelpers:Element(byref,int):byref [Tier0, IL size=15, code size=33]
43: JIT compiled Volatile:Read(byref):int [Tier0, IL size=6, code size=16]
44: JIT compiled String:Ctor(long):String [Tier0, IL size=57, code size=155]
45: JIT compiled String:wcslen(long):int [Tier0, IL size=7, code size=31]
46: JIT compiled SpanHelpers:IndexOfNullCharacter(byref):int [Tier1, IL size=792, code size=388]
47: JIT compiled String:get_Length():int:this [Tier0, IL size=7, code size=17]
48: JIT compiled Buffer:Memmove(byref,byref,long) [Tier0, IL size=59, code size=102]
49: JIT compiled RuntimeHelpers:IsReferenceOrContainsReferences():bool [Tier0, IL size=2, code size=8]
50: JIT compiled Buffer:Memmove(byref,byref,long) [Tier0, IL size=480, code size=678]
51: JIT compiled Dictionary`2:Add(__Canon,__Canon):this [Tier0, IL size=11, code size=55]
52: JIT compiled Dictionary`2:TryInsert(__Canon,__Canon,ubyte):bool:this [Tier0, IL size=675, code size=2467]
53: JIT compiled OrdinalComparer:GetHashCode(String):int:this [Tier0, IL size=7, code size=37]
54: JIT compiled String:GetNonRandomizedHashCode():int:this [Tier0, IL size=110, code size=290]
55: JIT compiled BitOperations:RotateLeft(int,int):int [Tier0, IL size=17, code size=20]
56: JIT compiled Dictionary`2:GetBucket(int):byref:this [Tier0, IL size=29, code size=90]
57: JIT compiled HashHelpers:FastMod(int,int,long):int [Tier0, IL size=20, code size=70]
58: JIT compiled Type:get_IsValueType():bool:this [Tier0, IL size=7, code size=39]
59: JIT compiled RuntimeType:IsValueTypeImpl():bool:this [Tier0, IL size=54, code size=158]
60: JIT compiled RuntimeType:GetNativeTypeHandle():TypeHandle:this [Tier0, IL size=12, code size=48]
61: JIT compiled TypeHandle:.ctor(long):this [Tier0, IL size=8, code size=25]
62: JIT compiled TypeHandle:get_IsTypeDesc():bool:this [Tier0, IL size=14, code size=38]
63: JIT compiled TypeHandle:AsMethodTable():long:this [Tier0, IL size=7, code size=17]
64: JIT compiled MethodTable:get_IsValueType():bool:this [Tier0, IL size=20, code size=32]
65: JIT compiled GC:KeepAlive(Object) [Tier0, IL size=1, code size=10]
66: JIT compiled Buffer:_Memmove(byref,byref,long) [Tier0, IL size=25, code size=279]
67: JIT compiled Environment:InitializeCommandLineArgs(long,int,long):ref [Tier0, IL size=75, code size=332]
68: JIT compiled Environment:.cctor() [Tier0, IL size=11, code size=163]
69: JIT compiled StartupHookProvider:ProcessStartupHooks() [Tier-0 switched to FullOpts, IL size=365, code size=1053]
70: JIT compiled StartupHookProvider:get_IsSupported():bool [Tier0, IL size=18, code size=60]
71: JIT compiled AppContext:TryGetSwitch(String,byref):bool [Tier0, IL size=97, code size=322]
72: JIT compiled ArgumentException:ThrowIfNullOrEmpty(String,String) [Tier0, IL size=16, code size=53]
73: JIT compiled String:IsNullOrEmpty(String):bool [Tier0, IL size=15, code size=58]
74: JIT compiled AppContext:GetData(String):Object [Tier0, IL size=64, code size=205]
75: JIT compiled ArgumentNullException:ThrowIfNull(Object,String) [Tier0, IL size=10, code size=42]
76: JIT compiled Monitor:Enter(Object,byref) [Tier0, IL size=17, code size=55]
77: JIT compiled Dictionary`2:TryGetValue(__Canon,byref):bool:this [Tier0, IL size=39, code size=97]
78: JIT compiled Dictionary`2:FindValue(__Canon):byref:this [Tier0, IL size=391, code size=1466]
79: JIT compiled EventSource:.cctor() [Tier0, IL size=34, code size=80]
80: JIT compiled EventSource:InitializeIsSupported():bool [Tier0, IL size=18, code size=60]
81: JIT compiled RuntimeEventSource:.ctor():this [Tier0, IL size=55, code size=184]
82: JIT compiled Guid:.ctor(int,short,short,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):this [Tier0, IL size=86, code size=132]
83: JIT compiled EventSource:.ctor(Guid,String):this [Tier0, IL size=11, code size=90]
84: JIT compiled EventSource:.ctor(Guid,String,int,ref):this [Tier0, IL size=58, code size=187]
85: JIT compiled EventSource:get_IsSupported():bool [Tier0, IL size=6, code size=11]
86: JIT compiled TraceLoggingEventHandleTable:.ctor():this [Tier0, IL size=20, code size=67]
87: JIT compiled EventSource:ValidateSettings(int):int [Tier0, IL size=37, code size=147]
88: JIT compiled EventSource:Initialize(Guid,String,ref):this [Tier0, IL size=418, code size=1584]
89: JIT compiled Guid:op_Equality(Guid,Guid):bool [Tier0, IL size=10, code size=39]
90: JIT compiled Guid:EqualsCore(byref,byref):bool [Tier0, IL size=132, code size=171]
91: JIT compiled ActivityTracker:get_Instance():ActivityTracker [Tier0, IL size=6, code size=49]
92: JIT compiled ActivityTracker:.cctor() [Tier0, IL size=11, code size=71]
93: JIT compiled ActivityTracker:.ctor():this [Tier0, IL size=7, code size=31]
94: JIT compiled RuntimeEventSource:get_ProviderMetadata():ReadOnlySpan`1:this [Tier0, IL size=13, code size=91]
95: JIT compiled ReadOnlySpan`1:.ctor(long,int):this [Tier0, IL size=51, code size=115]
96: JIT compiled RuntimeHelpers:IsReferenceOrContainsReferences():bool [Tier0, IL size=2, code size=8]
97: JIT compiled ReadOnlySpan`1:get_Length():int:this [Tier0, IL size=7, code size=17]
98: JIT compiled OverrideEventProvider:.ctor(EventSource,int):this [Tier0, IL size=22, code size=68]
99: JIT compiled EventProvider:.ctor(int):this [Tier0, IL size=46, code size=194]
100: JIT compiled EtwEventProvider:.ctor():this [Tier0, IL size=7, code size=31]
101: JIT compiled EventProvider:Register(EventSource):this [Tier0, IL size=48, code size=186]
102: JIT compiled MulticastDelegate:CtorClosed(Object,long):this [Tier0, IL size=23, code size=70]
103: JIT compiled EventProvider:EventRegister(EventSource,EtwEnableCallback):int:this [Tier0, IL size=53, code size=154]
104: JIT compiled EventSource:get_Name():String:this [Tier0, IL size=7, code size=18]
105: JIT compiled EventSource:get_Guid():Guid:this [Tier0, IL size=7, code size=41]
106: JIT compiled EtwEventProvider:System.Diagnostics.Tracing.IEventProvider.EventRegister(EventSource,EtwEnableCallback,long,byref):int:this [Tier0, IL size=19, code size=71]
107: JIT compiled Advapi32:EventRegister(byref,EtwEnableCallback,long,byref):int [Tier0, IL size=53, code size=374]
108: JIT compiled Marshal:GetFunctionPointerForDelegate(__Canon):long [Tier0, IL size=17, code size=54]
109: JIT compiled Marshal:GetFunctionPointerForDelegate(Delegate):long [Tier0, IL size=18, code size=53]
110: JIT compiled EventPipeEventProvider:.ctor():this [Tier0, IL size=18, code size=41]
111: JIT compiled EventListener:get_EventListenersLock():Object [Tier0, IL size=41, code size=157]
112: JIT compiled List`1:.ctor(int):this [Tier0, IL size=47, code size=275]
113: JIT compiled Interlocked:CompareExchange(byref,__Canon,__Canon):__Canon [Tier0, IL size=9, code size=50]
114: JIT compiled NativeRuntimeEventSource:.cctor() [Tier0, IL size=11, code size=71]
115: JIT compiled NativeRuntimeEventSource:.ctor():this [Tier0, IL size=63, code size=184]
116: JIT compiled Guid:.ctor(int,ushort,ushort,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte,ubyte):this [Tier0, IL size=88, code size=132]
117: JIT compiled NativeRuntimeEventSource:get_ProviderMetadata():ReadOnlySpan`1:this [Tier0, IL size=13, code size=91]
118: JIT compiled EventPipeEventProvider:System.Diagnostics.Tracing.IEventProvider.EventRegister(EventSource,EtwEnableCallback,long,byref):int:this [Tier0, IL size=44, code size=118]
119: JIT compiled EventPipeInternal:CreateProvider(String,EtwEnableCallback):long [Tier0, IL size=43, code size=320]
120: JIT compiled Utf16StringMarshaller:GetPinnableReference(String):byref [Tier0, IL size=13, code size=50]
121: JIT compiled String:GetPinnableReference():byref:this [Tier0, IL size=7, code size=24]
122: JIT compiled EventListener:AddEventSource(EventSource) [Tier0, IL size=175, code size=560]
123: JIT compiled List`1:get_Count():int:this [Tier0, IL size=7, code size=17]
124: JIT compiled WeakReference`1:.ctor(__Canon):this [Tier0, IL size=9, code size=42]
125: JIT compiled WeakReference`1:.ctor(__Canon,bool):this [Tier0, IL size=15, code size=60]
126: JIT compiled List`1:Add(__Canon):this [Tier0, IL size=60, code size=124]
127: JIT compiled String:op_Inequality(String,String):bool [Tier0, IL size=11, code size=46]
128: JIT compiled String:Equals(String,String):bool [Tier0, IL size=36, code size=114]
129: JIT compiled ReadOnlySpan`1:GetPinnableReference():byref:this [Tier0, IL size=23, code size=57]
130: JIT compiled EventProvider:SetInformation(int,long,int):int:this [Tier0, IL size=38, code size=131]
131: JIT compiled ILStubClass:IL_STUB_PInvoke(long,int,long,int):int [FullOpts, IL size=62, code size=170]
132: JIT compiled Program:Main() [Tier0, IL size=11, code size=36]
133: JIT compiled Console:WriteLine(String) [Tier0, IL size=12, code size=59]
134: JIT compiled Console:get_Out():TextWriter [Tier0, IL size=20, code size=113]
135: JIT compiled Console:.cctor() [Tier0, IL size=11, code size=71]
136: JIT compiled Volatile:Read(byref):__Canon [Tier0, IL size=6, code size=21]
137: JIT compiled Console:g__EnsureInitialized|26_0():TextWriter [Tier0, IL size=63, code size=209]
138: JIT compiled ConsolePal:OpenStandardOutput():Stream [Tier0, IL size=34, code size=130]
139: JIT compiled Console:get_OutputEncoding():Encoding [Tier0, IL size=72, code size=237]
140: JIT compiled ConsolePal:get_OutputEncoding():Encoding [Tier0, IL size=11, code size=200]
141: JIT compiled NativeLibrary:LoadLibraryCallbackStub(String,Assembly,bool,int):long [Tier0, IL size=63, code size=280]
142: JIT compiled EncodingHelper:GetSupportedConsoleEncoding(int):Encoding [Tier0, IL size=53, code size=186]
143: JIT compiled Encoding:GetEncoding(int):Encoding [Tier0, IL size=340, code size=1025]
144: JIT compiled EncodingProvider:GetEncodingFromProvider(int):Encoding [Tier0, IL size=51, code size=232]
145: JIT compiled Encoding:FilterDisallowedEncodings(Encoding):Encoding [Tier0, IL size=29, code size=84]
146: JIT compiled LocalAppContextSwitches:get_EnableUnsafeUTF7Encoding():bool [Tier0, IL size=16, code size=46]
147: JIT compiled LocalAppContextSwitches:GetCachedSwitchValue(String,byref):bool [Tier0, IL size=22, code size=76]
148: JIT compiled LocalAppContextSwitches:GetCachedSwitchValueInternal(String,byref):bool [Tier0, IL size=46, code size=168]
149: JIT compiled LocalAppContextSwitches:GetSwitchDefaultValue(String):bool [Tier0, IL size=32, code size=98]
150: JIT compiled String:op_Equality(String,String):bool [Tier0, IL size=8, code size=39]
151: JIT compiled Encoding:get_Default():Encoding [Tier0, IL size=6, code size=49]
152: JIT compiled Encoding:.cctor() [Tier0, IL size=12, code size=73]
153: JIT compiled UTF8EncodingSealed:.ctor(bool):this [Tier0, IL size=8, code size=40]
154: JIT compiled UTF8Encoding:.ctor(bool):this [Tier0, IL size=14, code size=43]
155: JIT compiled UTF8Encoding:.ctor():this [Tier0, IL size=12, code size=36]
156: JIT compiled Encoding:.ctor(int):this [Tier0, IL size=42, code size=152]
157: JIT compiled UTF8Encoding:SetDefaultFallbacks():this [Tier0, IL size=64, code size=212]
158: JIT compiled EncoderReplacementFallback:.ctor(String):this [Tier0, IL size=110, code size=360]
159: JIT compiled EncoderFallback:.ctor():this [Tier0, IL size=7, code size=31]
160: JIT compiled String:get_Chars(int):ushort:this [Tier0, IL size=29, code size=61]
161: JIT compiled Char:IsSurrogate(ushort):bool [Tier0, IL size=17, code size=43]
162: JIT compiled Char:IsBetween(ushort,ushort,ushort):bool [Tier0, IL size=12, code size=52]
163: JIT compiled DecoderReplacementFallback:.ctor(String):this [Tier0, IL size=110, code size=360]
164: JIT compiled DecoderFallback:.ctor():this [Tier0, IL size=7, code size=31]
165: JIT compiled Encoding:get_CodePage():int:this [Tier0, IL size=7, code size=17]
166: JIT compiled Encoding:get_UTF8():Encoding [Tier0, IL size=6, code size=49]
167: JIT compiled UTF8Encoding:.cctor() [Tier0, IL size=12, code size=76]
168: JIT compiled Volatile:Write(byref,__Canon) [Tier0, IL size=6, code size=32]
169: JIT compiled ConsolePal:GetStandardFile(int,int,bool):Stream [Tier0, IL size=50, code size=183]
170: JIT compiled ConsolePal:get_InvalidHandleValue():long [Tier0, IL size=7, code size=41]
171: JIT compiled IntPtr:.ctor(int):this [Tier0, IL size=9, code size=25]
172: JIT compiled ConsolePal:ConsoleHandleIsWritable(long):bool [Tier0, IL size=26, code size=68]
173: JIT compiled Kernel32:WriteFile(long,long,int,byref,long):int [Tier0, IL size=46, code size=294]
174: JIT compiled Marshal:SetLastSystemError(int) [Tier0, IL size=7, code size=40]
175: JIT compiled Marshal:GetLastSystemError():int [Tier0, IL size=6, code size=34]
176: JIT compiled WindowsConsoleStream:.ctor(long,int,bool):this [Tier0, IL size=37, code size=90]
177: JIT compiled ConsoleStream:.ctor(int):this [Tier0, IL size=31, code size=71]
178: JIT compiled Stream:.ctor():this [Tier0, IL size=7, code size=31]
179: JIT compiled MarshalByRefObject:.ctor():this [Tier0, IL size=7, code size=31]
180: JIT compiled Kernel32:GetFileType(long):int [Tier0, IL size=27, code size=217]
181: JIT compiled Console:CreateOutputWriter(Stream):TextWriter [Tier0, IL size=50, code size=230]
182: JIT compiled Stream:.cctor() [Tier0, IL size=11, code size=71]
183: JIT compiled NullStream:.ctor():this [Tier0, IL size=7, code size=31]
184: JIT compiled EncodingExtensions:RemovePreamble(Encoding):Encoding [Tier0, IL size=25, code size=118]
185: JIT compiled UTF8EncodingSealed:get_Preamble():ReadOnlySpan`1:this [Tier0, IL size=24, code size=99]
186: JIT compiled UTF8Encoding:get_PreambleSpan():ReadOnlySpan`1 [Tier0, IL size=12, code size=87]
187: JIT compiled ConsoleEncoding:.ctor(Encoding):this [Tier0, IL size=14, code size=52]
188: JIT compiled Encoding:.ctor():this [Tier0, IL size=8, code size=33]
189: JIT compiled Encoding:SetDefaultFallbacks():this [Tier0, IL size=23, code size=65]
190: JIT compiled EncoderFallback:get_ReplacementFallback():EncoderFallback [Tier0, IL size=6, code size=49]
191: JIT compiled EncoderReplacementFallback:.cctor() [Tier0, IL size=11, code size=71]
192: JIT compiled EncoderReplacementFallback:.ctor():this [Tier0, IL size=12, code size=44]
193: JIT compiled DecoderFallback:get_ReplacementFallback():DecoderFallback [Tier0, IL size=6, code size=49]
194: JIT compiled DecoderReplacementFallback:.cctor() [Tier0, IL size=11, code size=71]
195: JIT compiled DecoderReplacementFallback:.ctor():this [Tier0, IL size=12, code size=44]
196: JIT compiled StreamWriter:.ctor(Stream,Encoding,int,bool):this [Tier0, IL size=201, code size=564]
197: JIT compiled Task:get_CompletedTask():Task [Tier0, IL size=6, code size=49]
198: JIT compiled Task:.cctor() [Tier0, IL size=76, code size=316]
199: JIT compiled TaskFactory:.ctor():this [Tier0, IL size=7, code size=31]
200: JIT compiled Task`1:.ctor(bool,VoidTaskResult,int,CancellationToken):this [Tier0, IL size=21, code size=75]
201: JIT compiled Task:.ctor(bool,int,CancellationToken):this [Tier0, IL size=70, code size=181]
202: JIT compiled <>c:.cctor() [Tier0, IL size=11, code size=71]
203: JIT compiled <>c:.ctor():this [Tier0, IL size=7, code size=31]
204: JIT compiled TextWriter:.ctor(IFormatProvider):this [Tier0, IL size=36, code size=124]
205: JIT compiled TextWriter:.cctor() [Tier0, IL size=26, code size=108]
206: JIT compiled NullTextWriter:.ctor():this [Tier0, IL size=7, code size=31]
207: JIT compiled TextWriter:.ctor():this [Tier0, IL size=29, code size=103]
208: JIT compiled String:ToCharArray():ref:this [Tier0, IL size=52, code size=173]
209: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24]
210: JIT compiled ConsoleStream:get_CanWrite():bool:this [Tier0, IL size=7, code size=18]
211: JIT compiled ConsoleEncoding:GetEncoder():Encoder:this [Tier0, IL size=12, code size=57]
212: JIT compiled UTF8Encoding:GetEncoder():Encoder:this [Tier0, IL size=7, code size=63]
213: JIT compiled EncoderNLS:.ctor(Encoding):this [Tier0, IL size=37, code size=102]
214: JIT compiled Encoder:.ctor():this [Tier0, IL size=7, code size=31]
215: JIT compiled Encoding:get_EncoderFallback():EncoderFallback:this [Tier0, IL size=7, code size=18]
216: JIT compiled EncoderNLS:Reset():this [Tier0, IL size=24, code size=92]
217: JIT compiled ConsoleStream:get_CanSeek():bool:this [Tier0, IL size=2, code size=12]
218: JIT compiled StreamWriter:set_AutoFlush(bool):this [Tier0, IL size=25, code size=72]
219: JIT compiled StreamWriter:CheckAsyncTaskInProgress():this [Tier0, IL size=19, code size=47]
220: JIT compiled Task:get_IsCompleted():bool:this [Tier0, IL size=16, code size=40]
221: JIT compiled Task:IsCompletedMethod(int):bool [Tier0, IL size=11, code size=25]
222: JIT compiled StreamWriter:Flush(bool,bool):this [Tier0, IL size=272, code size=1127]
223: JIT compiled StreamWriter:ThrowIfDisposed():this [Tier0, IL size=15, code size=43]
224: JIT compiled Encoding:get_Preamble():ReadOnlySpan`1:this [Tier0, IL size=12, code size=70]
225: JIT compiled ConsoleEncoding:GetPreamble():ref:this [Tier0, IL size=6, code size=27]
226: JIT compiled Array:Empty():ref [Tier0, IL size=6, code size=49]
227: JIT compiled EmptyArray`1:.cctor() [Tier0, IL size=12, code size=52]
228: JIT compiled ReadOnlySpan`1:op_Implicit(ref):ReadOnlySpan`1 [Tier0, IL size=7, code size=79]
229: JIT compiled ReadOnlySpan`1:.ctor(ref):this [Tier0, IL size=33, code size=81]
230: JIT compiled MemoryMarshal:GetArrayDataReference(ref):byref [Tier0, IL size=7, code size=24]
231: JIT compiled ConsoleEncoding:GetMaxByteCount(int):int:this [Tier0, IL size=13, code size=63]
232: JIT compiled UTF8EncodingSealed:GetMaxByteCount(int):int:this [Tier0, IL size=20, code size=50]
233: JIT compiled Span`1:.ctor(long,int):this [Tier0, IL size=51, code size=115]
234: JIT compiled ReadOnlySpan`1:.ctor(ref,int,int):this [Tier0, IL size=65, code size=147]
235: JIT compiled Encoder:GetBytes(ReadOnlySpan`1,Span`1,bool):int:this [Tier0, IL size=44, code size=234]
236: JIT compiled MemoryMarshal:GetNonNullPinnableReference(ReadOnlySpan`1):byref [Tier0, IL size=30, code size=54]
237: JIT compiled ReadOnlySpan`1:get_Length():int:this [Tier0, IL size=7, code size=17]
238: JIT compiled MemoryMarshal:GetNonNullPinnableReference(Span`1):byref [Tier0, IL size=30, code size=54]
239: JIT compiled Span`1:get_Length():int:this [Tier0, IL size=7, code size=17]
240: JIT compiled EncoderNLS:GetBytes(long,int,long,int,bool):int:this [Tier0, IL size=92, code size=279]
241: JIT compiled ArgumentNullException:ThrowIfNull(long,String) [Tier0, IL size=12, code size=45]
242: JIT compiled Encoding:GetBytes(long,int,long,int,EncoderNLS):int:this [Tier0, IL size=57, code size=187]
243: JIT compiled EncoderNLS:get_HasLeftoverData():bool:this [Tier0, IL size=35, code size=105]
244: JIT compiled UTF8Encoding:GetBytesFast(long,int,long,int,byref):int:this [Tier0, IL size=33, code size=119]
245: JIT compiled Utf8Utility:TranscodeToUtf8(long,int,long,int,byref,byref):int [Tier0, IL size=1446, code size=3208]
246: JIT compiled Math:Min(int,int):int [Tier0, IL size=8, code size=28]
247: JIT compiled ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long [Tier0, IL size=490, code size=1187]
248: JIT compiled WindowsConsoleStream:Flush():this [Tier0, IL size=26, code size=56]
249: JIT compiled ConsoleStream:Flush():this [Tier0, IL size=1, code size=10]
250: JIT compiled TextWriter:Synchronized(TextWriter):TextWriter [Tier0, IL size=28, code size=121]
251: JIT compiled SyncTextWriter:.ctor(TextWriter):this [Tier0, IL size=14, code size=52]
252: JIT compiled SyncTextWriter:WriteLine(String):this [Tier0, IL size=13, code size=140]
253: JIT compiled StreamWriter:WriteLine(String):this [Tier0, IL size=20, code size=110]
254: JIT compiled String:op_Implicit(String):ReadOnlySpan`1 [Tier0, IL size=31, code size=171]
255: JIT compiled String:GetRawStringData():byref:this [Tier0, IL size=7, code size=24]
256: JIT compiled ReadOnlySpan`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39]
257: JIT compiled StreamWriter:WriteSpan(ReadOnlySpan`1,bool):this [Tier0, IL size=368, code size=1036]
258: JIT compiled MemoryMarshal:GetReference(ReadOnlySpan`1):byref [Tier0, IL size=8, code size=17]
259: JIT compiled Buffer:MemoryCopy(long,long,long,long) [Tier0, IL size=21, code size=83]
260: JIT compiled Unsafe:ReadUnaligned(long):long [Tier0, IL size=10, code size=17]
261: JIT compiled ASCIIUtility:AllCharsInUInt64AreAscii(long):bool [Tier0, IL size=16, code size=38]
262: JIT compiled ASCIIUtility:NarrowFourUtf16CharsToAsciiAndWriteToBuffer(byref,long) [Tier0, IL size=107, code size=171]
263: JIT compiled Unsafe:WriteUnaligned(byref,int) [Tier0, IL size=11, code size=22]
264: JIT compiled Unsafe:ReadUnaligned(long):int [Tier0, IL size=10, code size=16]
265: JIT compiled ASCIIUtility:AllCharsInUInt32AreAscii(int):bool [Tier0, IL size=11, code size=25]
266: JIT compiled ASCIIUtility:NarrowTwoUtf16CharsToAsciiAndWriteToBuffer(byref,int) [Tier0, IL size=24, code size=35]
267: JIT compiled Span`1:Slice(int,int):Span`1:this [Tier0, IL size=39, code size=135]
268: JIT compiled Span`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39]
269: JIT compiled Span`1:op_Implicit(Span`1):ReadOnlySpan`1 [Tier0, IL size=19, code size=90]
270: JIT compiled ReadOnlySpan`1:.ctor(byref,int):this [Tier0, IL size=15, code size=39]
271: JIT compiled WindowsConsoleStream:Write(ReadOnlySpan`1):this [Tier0, IL size=35, code size=149]
272: JIT compiled WindowsConsoleStream:WriteFileNative(long,ReadOnlySpan`1,bool):int [Tier0, IL size=107, code size=272]
273: JIT compiled ReadOnlySpan`1:get_IsEmpty():bool:this [Tier0, IL size=10, code size=24]
Hello, world!
274: JIT compiled AppContext:OnProcessExit() [Tier0, IL size=43, code size=161]
275: JIT compiled AssemblyLoadContext:OnProcessExit() [Tier0, IL size=101, code size=442]
276: JIT compiled EventListener:DisposeOnShutdown() [Tier0, IL size=150, code size=618]
277: JIT compiled List`1:.ctor():this [Tier0, IL size=18, code size=133]
278: JIT compiled List`1:.cctor() [Tier0, IL size=12, code size=129]
279: JIT compiled List`1:GetEnumerator():Enumerator:this [Tier0, IL size=7, code size=162]
280: JIT compiled Enumerator:.ctor(List`1):this [Tier0, IL size=39, code size=64]
281: JIT compiled Enumerator:MoveNext():bool:this [Tier0, IL size=81, code size=159]
282: JIT compiled Enumerator:get_Current():__Canon:this [Tier0, IL size=7, code size=22]
283: JIT compiled WeakReference`1:TryGetTarget(byref):bool:this [Tier0, IL size=24, code size=66]
284: JIT compiled List`1:AddWithResize(__Canon):this [Tier0, IL size=39, code size=85]
285: JIT compiled List`1:Grow(int):this [Tier0, IL size=53, code size=121]
286: JIT compiled List`1:set_Capacity(int):this [Tier0, IL size=86, code size=342]
287: JIT compiled CastHelpers:StelemRef_Helper(byref,long,Object) [Tier0, IL size=34, code size=104]
288: JIT compiled CastHelpers:StelemRef_Helper_NoCacheLookup(byref,long,Object) [Tier0, IL size=26, code size=111]
289: JIT compiled Enumerator:MoveNextRare():bool:this [Tier0, IL size=57, code size=80]
290: JIT compiled Enumerator:Dispose():this [Tier0, IL size=1, code size=14]
291: JIT compiled EventSource:Dispose():this [Tier0, IL size=14, code size=54]
292: JIT compiled EventSource:Dispose(bool):this [Tier0, IL size=124, code size=236]
293: JIT compiled EventProvider:Dispose():this [Tier0, IL size=14, code size=54]
294: JIT compiled EventProvider:Dispose(bool):this [Tier0, IL size=90, code size=230]
295: JIT compiled EventProvider:EventUnregister(long):this [Tier0, IL size=14, code size=50]
296: JIT compiled EtwEventProvider:System.Diagnostics.Tracing.IEventProvider.EventUnregister(long):int:this [Tier0, IL size=7, code size=181]
297: JIT compiled GC:SuppressFinalize(Object) [Tier0, IL size=18, code size=53]
298: JIT compiled EventPipeEventProvider:System.Diagnostics.Tracing.IEventProvider.EventUnregister(long):int:this [Tier0, IL size=13, code size=187]
With that out of the way, let's move on to actual performance
improvements, starting with on-stack replacement.
On-Stack Replacement
On-stack replacement (OSR) is one of the coolest features to hit the
JIT in .NET 7. But to really understand OSR, we first need to
understand tiered compilation, so a quick recap...
One of the issues a managed environment with a JIT compiler has to
deal with is tradeoffs between startup and throughput. Historically,
the job of an optimizing compiler is to, well, optimize, in order to
enable the best possible throughput of the application or service
once running. But such optimization takes analysis, takes time, and
performing all of that work then leads to increased startup time, as
all of the code on the startup path (e.g. all of the code that needs
to be run before a web server can serve the first request) needs to
be compiled. So a JIT compiler needs to make tradeoffs: better
throughput at the expense of longer startup time, or better startup
time at the expense of decreased throughput. For some kinds of apps
and services, the tradeoff is an easy call, e.g. if your service
starts up once and then runs for days, several extra seconds of
startup time doesn't matter, or if you're a console application
that's going to do a quick computation and exit, startup time is all
that matters. But how can the JIT know which scenario it's in, and do
we really want every developer having to know about these kinds of
settings and tradeoffs and configure every one of their applications
accordingly? One answer to this has been ahead-of-time compilation,
which has taken various forms in .NET. For example, all of the core
libraries are "crossgen"'d, meaning they've been run through a tool
that produces the previously mentioned R2R format, yielding binaries
that contain assembly code that needs only minor tweaks to actually
execute; not every method can have code generated for it, but enough
that it significantly reduces startup time. Of course, such
approaches have their own downsides, e.g. one of the promises of a
JIT compiler is it can take advantage of knowledge of the current
machine / process in order to best optimize, so for example the R2R
images have to assume a certain baseline instruction set (e.g. what
vectorizing instructions are available) whereas the JIT can see
what's actually available and use the best. "Tiered compilation"
provides another answer, one that's usable with or without these
other ahead-of-time (AOT) compilation solutions.
Tiered compilation enables the JIT to have its proverbial cake and
eat it, too. The idea is simple: allow the JIT to compile the same
code multiple times. The first time, the JIT can use as a few
optimizations as make sense (a handful of optimizations can actually
make the JIT's own throughput faster, so those still make sense to
apply), producing fairly unoptimized assembly code but doing so
really quickly. And when it does so, it can add some instrumentation
into the assembly to track how often the methods are called. As it
turns out, many functions used on a startup path are invoked once or
maybe only a handful of times, and it would take more time to
optimize them than it does to just execute them unoptimized. Then,
when the method's instrumentation triggers some threshold, for
example a method having been executed 30 times, a work item gets
queued to recompile that method, but this time with all the
optimizations the JIT can throw at it. This is lovingly referred to
as "tiering up." Once that recompilation has completed, call sites to
the method are patched with the address of the newly highly optimized
assembly code, and future invocations will then take the fast path.
So, we get faster startup and faster sustained throughput. At least,
that's the hope.
A problem, however, is methods that don't fit this mold. While it's
certainly the case that many performance-sensitive methods are
relatively quick and executed many, many, many times, there's also a
large number of performance-sensitive methods that are executed just
a handful of times, or maybe even only once, but that take a very
long time to execute, maybe even the duration of the whole process:
methods with loops. As a result, by default tiered compilation hasn't
applied to loops, though it can be enabled by setting the
DOTNET_TC_QuickJitForLoops environment variable to 1. We can see the
effect of this by trying this simple console app with .NET 6. With
the default settings, run this app:
class Program
{
static void Main()
{
var sw = new System.Diagnostics.Stopwatch();
while (true)
{
sw.Restart();
for (int trial = 0; trial < 10_000; trial++)
{
int count = 0;
for (int i = 0; i < char.MaxValue; i++)
if (IsAsciiDigit((char)i))
count++;
}
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
static bool IsAsciiDigit(char c) => (uint)(c - '0') <= 9;
}
}
I get numbers printed out like:
00:00:00.5734352
00:00:00.5526667
00:00:00.5675267
00:00:00.5588724
00:00:00.5616028
Now, try setting DOTNET_TC_QuickJitForLoops to 1. When I then run it
again, I get numbers like this:
00:00:01.2841397
00:00:01.2693485
00:00:01.2755646
00:00:01.2656678
00:00:01.2679925
In other words, with DOTNET_TC_QuickJitForLoops enabled, it's taking
2.5x as long as without (the default in .NET 6). That's because this
main function never gets optimizations applied to it. By setting
DOTNET_TC_QuickJitForLoops to 1, we're saying "JIT, please apply
tiering to methods with loops as well," but this method with a loop
is only ever invoked once, so for the duration of the process it ends
up remaining at "tier-0," aka unoptimized. Now, let's try the same
thing with .NET 7. Regardless of whether that environment variable is
set, I again get numbers like this:
00:00:00.5528889
00:00:00.5562563
00:00:00.5622086
00:00:00.5668220
00:00:00.5589112
but importantly, this method was still participating in tiering. In
fact, we can get confirmation of that by using the aforementioned
DOTNET_JitDisasmSummary=1 environment variable. When I set that and
run again, I see these lines in the output:
4: JIT compiled Program:Main() [Tier0, IL size=83, code size=319]
...
6: JIT compiled Program:Main() [Tier1-OSR @0x27, IL size=83, code size=380]
highlighting that Main was indeed compiled twice. How is that
possible? On-stack replacement.
The idea behind on-stack replacement is a method can be replaced not
just between invocations but even while it's executing, while it's
"on the stack." In addition to the tier-0 code being instrumented for
call counts, loops are also instrumented for iteration counts. When
the iterations surpass a certain limit, the JIT compiles a new highly
optimized version of that method, transfers all the local/register
state from the current invocation to the new invocation, and then
jumps to the appropriate location in the new method. We can see this
in action by using the previously discussed DOTNET_JitDisasm
environment variable. Set that to Program:* in order to see the
assembly code generated for all of the methods in the Program class,
and then run the app again. You should see output like the following:
; Assembly listing for method Program:Main()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0000H
55 push rbp
4881EC80000000 sub rsp, 128
488DAC2480000000 lea rbp, [rsp+80H]
C5D857E4 vxorps xmm4, xmm4
C5F97F65B0 vmovdqa xmmword ptr [rbp-50H], xmm4
33C0 xor eax, eax
488945C0 mov qword ptr [rbp-40H], rax
G_M000_IG02: ;; offset=001FH
48B9002F0B50FC7F0000 mov rcx, 0x7FFC500B2F00
E8721FB25F call CORINFO_HELP_NEWSFAST
488945B0 mov gword ptr [rbp-50H], rax
488B4DB0 mov rcx, gword ptr [rbp-50H]
FF1544C70D00 call [Stopwatch:.ctor():this]
488B4DB0 mov rcx, gword ptr [rbp-50H]
48894DC0 mov gword ptr [rbp-40H], rcx
C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8
G_M000_IG03: ;; offset=004BH
8B4DA8 mov ecx, dword ptr [rbp-58H]
FFC9 dec ecx
894DA8 mov dword ptr [rbp-58H], ecx
837DA800 cmp dword ptr [rbp-58H], 0
7F0E jg SHORT G_M000_IG05
G_M000_IG04: ;; offset=0059H
488D4DA8 lea rcx, [rbp-58H]
BA06000000 mov edx, 6
E8B985AB5F call CORINFO_HELP_PATCHPOINT
G_M000_IG05: ;; offset=0067H
488B4DC0 mov rcx, gword ptr [rbp-40H]
3909 cmp dword ptr [rcx], ecx
FF1585C70D00 call [Stopwatch:Restart():this]
33C9 xor ecx, ecx
894DBC mov dword ptr [rbp-44H], ecx
33C9 xor ecx, ecx
894DB8 mov dword ptr [rbp-48H], ecx
EB20 jmp SHORT G_M000_IG08
G_M000_IG06: ;; offset=007FH
8B4DB8 mov ecx, dword ptr [rbp-48H]
0FB7C9 movzx rcx, cx
FF152DD40B00 call [Program:g__IsAsciiDigit|0_0(ushort):bool]
85C0 test eax, eax
7408 je SHORT G_M000_IG07
8B4DBC mov ecx, dword ptr [rbp-44H]
FFC1 inc ecx
894DBC mov dword ptr [rbp-44H], ecx
G_M000_IG07: ;; offset=0097H
8B4DB8 mov ecx, dword ptr [rbp-48H]
FFC1 inc ecx
894DB8 mov dword ptr [rbp-48H], ecx
G_M000_IG08: ;; offset=009FH
8B4DA8 mov ecx, dword ptr [rbp-58H]
FFC9 dec ecx
894DA8 mov dword ptr [rbp-58H], ecx
837DA800 cmp dword ptr [rbp-58H], 0
7F0E jg SHORT G_M000_IG10
G_M000_IG09: ;; offset=00ADH
488D4DA8 lea rcx, [rbp-58H]
BA23000000 mov edx, 35
E86585AB5F call CORINFO_HELP_PATCHPOINT
G_M000_IG10: ;; offset=00BBH
817DB800CA9A3B cmp dword ptr [rbp-48H], 0x3B9ACA00
7CBB jl SHORT G_M000_IG06
488B4DC0 mov rcx, gword ptr [rbp-40H]
3909 cmp dword ptr [rcx], ecx
FF1570C70D00 call [Stopwatch:get_ElapsedMilliseconds():long:this]
488BC8 mov rcx, rax
FF1507D00D00 call [Console:WriteLine(long)]
E96DFFFFFF jmp G_M000_IG03
; Total bytes of code 222
; Assembly listing for method Program:g__IsAsciiDigit|0_0(ushort):bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0000H
55 push rbp
488BEC mov rbp, rsp
894D10 mov dword ptr [rbp+10H], ecx
G_M000_IG02: ;; offset=0007H
8B4510 mov eax, dword ptr [rbp+10H]
0FB7C0 movzx rax, ax
83C0D0 add eax, -48
83F809 cmp eax, 9
0F96C0 setbe al
0FB6C0 movzx rax, al
G_M000_IG03: ;; offset=0019H
5D pop rbp
C3 ret
A few relevant things to notice here. First, the comments at the top
highlight how this code was compiled:
; Tier-0 compilation
; MinOpts code
So, we know this is the initial version ("Tier-0") of the method
compiled with minimal optimization ("MinOpts"). Second, note this
line of the assembly:
FF152DD40B00 call [Program:g__IsAsciiDigit|0_0(ushort):bool]
Our IsAsciiDigit helper method is trivially inlineable, but it's not
getting inlined; instead, the assembly has a call to it, and indeed
we can see below the generated code (also "MinOpts") for
IsAsciiDigit. Why? Because inlining is an optimization (a really
important one) that's disabled as part of tier-0 (because the
analysis for doing inlining well is also quite costly). Third, we can
see the code the JIT is outputting to instrument this method. This is
a bit more involved, but I'll point out the relevant parts. First, we
see:
C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8
That 0x3E8 is the hex value for the decimal 1,000, which is the
default number of iterations a loop needs to iterate before the JIT
will generate the optimized version of the method (this is
configurable via the DOTNET_TC_OnStackReplacement_InitialCounter
environment variable). So we see 1,000 being stored into this stack
location. Then a bit later in the method we see this:
G_M000_IG03: ;; offset=004BH
8B4DA8 mov ecx, dword ptr [rbp-58H]
FFC9 dec ecx
894DA8 mov dword ptr [rbp-58H], ecx
837DA800 cmp dword ptr [rbp-58H], 0
7F0E jg SHORT G_M000_IG05
G_M000_IG04: ;; offset=0059H
488D4DA8 lea rcx, [rbp-58H]
BA06000000 mov edx, 6
E8B985AB5F call CORINFO_HELP_PATCHPOINT
G_M000_IG05: ;; offset=0067H
The generated code is loading that counter into the ecx register,
decrementing it, storing it back, and then seeing whether the counter
dropped to 0. If it didn't, the code skips to G_M000_IG05, which is
the label for the actual code in the rest of the loop. But if the
counter did drop to 0, the JIT proceeds to store relevant state into
the the rcx and edx registers and then calls the
CORINFO_HELP_PATCHPOINT helper method. That helper is responsible for
triggering the creation of the optimized method if it doesn't yet
exist, fixing up all appropriate tracking state, and jumping to the
new method. And indeed, if you look again at your console output from
running the program, you'll see yet another output for the Main
method:
; Assembly listing for method Program:Main()
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; OSR variant for entry point 0x23
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
; 1 inlinees with PGO data; 8 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4883EC58 sub rsp, 88
4889BC24D8000000 mov qword ptr [rsp+D8H], rdi
4889B424D0000000 mov qword ptr [rsp+D0H], rsi
48899C24C8000000 mov qword ptr [rsp+C8H], rbx
C5F877 vzeroupper
33C0 xor eax, eax
4889442428 mov qword ptr [rsp+28H], rax
4889442420 mov qword ptr [rsp+20H], rax
488B9C24A0000000 mov rbx, gword ptr [rsp+A0H]
8BBC249C000000 mov edi, dword ptr [rsp+9CH]
8BB42498000000 mov esi, dword ptr [rsp+98H]
G_M000_IG02: ;; offset=0041H
EB45 jmp SHORT G_M000_IG05
align [0 bytes for IG06]
G_M000_IG03: ;; offset=0043H
33C9 xor ecx, ecx
488B9C24A0000000 mov rbx, gword ptr [rsp+A0H]
48894B08 mov qword ptr [rbx+08H], rcx
488D4C2428 lea rcx, [rsp+28H]
48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670
G_M000_IG04: ;; offset=0060H
FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int
488B442428 mov rax, qword ptr [rsp+28H]
488B9C24A0000000 mov rbx, gword ptr [rsp+A0H]
48894310 mov qword ptr [rbx+10H], rax
C6431801 mov byte ptr [rbx+18H], 1
33FF xor edi, edi
33F6 xor esi, esi
833D92A1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0
0F85CA000000 jne G_M000_IG13
G_M000_IG05: ;; offset=0088H
81FE00CA9A3B cmp esi, 0x3B9ACA00
7D17 jge SHORT G_M000_IG09
G_M000_IG06: ;; offset=0090H
0FB7CE movzx rcx, si
83C1D0 add ecx, -48
83F909 cmp ecx, 9
7702 ja SHORT G_M000_IG08
G_M000_IG07: ;; offset=009BH
FFC7 inc edi
G_M000_IG08: ;; offset=009DH
FFC6 inc esi
81FE00CA9A3B cmp esi, 0x3B9ACA00
7CE9 jl SHORT G_M000_IG06
G_M000_IG09: ;; offset=00A7H
488B6B08 mov rbp, qword ptr [rbx+08H]
48899C24A0000000 mov gword ptr [rsp+A0H], rbx
807B1800 cmp byte ptr [rbx+18H], 0
7436 je SHORT G_M000_IG12
G_M000_IG10: ;; offset=00B9H
488D4C2420 lea rcx, [rsp+20H]
48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670
G_M000_IG11: ;; offset=00C8H
FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int
488B4C2420 mov rcx, qword ptr [rsp+20H]
488B9C24A0000000 mov rbx, gword ptr [rsp+A0H]
482B4B10 sub rcx, qword ptr [rbx+10H]
4803E9 add rbp, rcx
833D2FA1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0
48899C24A0000000 mov gword ptr [rsp+A0H], rbx
756D jne SHORT G_M000_IG14
G_M000_IG12: ;; offset=00EFH
C5F857C0 vxorps xmm0, xmm0
C4E1FB2AC5 vcvtsi2sd xmm0, rbp
C5FB11442430 vmovsd qword ptr [rsp+30H], xmm0
48B9F04BF24FFC7F0000 mov rcx, 0x7FFC4FF24BF0
BAE7070000 mov edx, 0x7E7
E82E1FB25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
C5FB10442430 vmovsd xmm0, qword ptr [rsp+30H]
C5FB5905E049F6FF vmulsd xmm0, xmm0, qword ptr [(reloc 0x7ffc4ff25720)]
C4E1FB2CD0 vcvttsd2si rdx, xmm0
48B94B598638D6C56D34 mov rcx, 0x346DC5D63886594B
488BC1 mov rax, rcx
48F7EA imul rdx:rax, rdx
488BCA mov rcx, rdx
48C1E93F shr rcx, 63
48C1FA0B sar rdx, 11
4803CA add rcx, rdx
FF1567CE0D00 call [Console:WriteLine(long)]
E9F5FEFFFF jmp G_M000_IG03
G_M000_IG13: ;; offset=014EH
E8DDCBAC5F call CORINFO_HELP_POLL_GC
E930FFFFFF jmp G_M000_IG05
G_M000_IG14: ;; offset=0158H
E8D3CBAC5F call CORINFO_HELP_POLL_GC
EB90 jmp SHORT G_M000_IG12
; Total bytes of code 351
Here, again, we notice a few interesting things. First, in the header
we see this:
; Tier-1 compilation
; OSR variant for entry point 0x23
; optimized code
so we know this is both optimized "tier-1" code and is the "OSR
variant" for this method. Second, notice there's no longer a call to
the IsAsciiDigit helper. Instead, where that call would have been, we
see this:
G_M000_IG06: ;; offset=0090H
0FB7CE movzx rcx, si
83C1D0 add ecx, -48
83F909 cmp ecx, 9
7702 ja SHORT G_M000_IG08
This is loading a value into rcx, subtracting 48 from it (48 is the
decimal ASCII value of the '0' character) and comparing the resulting
value to 9. Sounds an awful lot like our IsAsciiDigit implementation
((uint)(c - '0') <= 9), doesn't it? That's because it is. The helper
was successfully inlined in this now-optimized code.
Great, so now in .NET 7, we can largely avoid the tradeoffs between
startup and throughput, as OSR enables tiered compilation to apply to
all methods, even those that are long-running. A multitude of PRs
went into enabling this, including many over the last few years, but
all of the functionality was disabled in the shipping bits. Thanks to
improvements like dotnet/runtime#62831 which implemented support for
OSR on Arm64 (previously only x64 support was implemented), and
dotnet/runtime#63406 and dotnet/runtime#65609 which revised how OSR
imports and epilogs are handled, dotnet/runtime#65675 enables OSR
(and as a result DOTNET_TC_QuickJitForLoops) by default.
But, tiered compilation and OSR aren't just about startup (though
they're of course very valuable there). They're also about further
improving throughput. Even though tiered compilation was originally
envisioned as a way to optimize startup while not hurting throughput,
it's become much more than that. There are various things the JIT can
learn about a method during tier-0 that it can then use for tier-1.
For example, the very fact that the tier-0 code executed means that
any statics accessed by the method will have been initialized, and
that means that any readonly statics will not only have been
initialized by the time the tier-1 code executes but their values
won't ever change. And that in turn means that any readonly statics
of primitive types (e.g. bool, int, etc.) can be treated like consts
instead of static readonly fields, and during tier-1 compilation the
JIT can optimize them just as it would have optimized a const. For
example, try running this simple program after setting
DOTNET_JitDisasm to Program:Test:
using System.Runtime.CompilerServices;
class Program
{
static readonly bool Is64Bit = Environment.Is64BitProcess;
static int Main()
{
int count = 0;
for (int i = 0; i < 1_000_000_000; i++)
if (Test())
count++;
return count;
}
[MethodImpl(MethodImplOptions.NoInlining)]
static bool Test() => Is64Bit;
}
When I do so, I get this output:
; Assembly listing for method Program:Test():bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-0 compilation
; MinOpts code
; rbp based frame
; partially interruptible
G_M000_IG01: ;; offset=0000H
55 push rbp
4883EC20 sub rsp, 32
488D6C2420 lea rbp, [rsp+20H]
G_M000_IG02: ;; offset=000AH
48B9B8639A3FFC7F0000 mov rcx, 0x7FFC3F9A63B8
BA01000000 mov edx, 1
E8C220B25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
0FB60545580C00 movzx rax, byte ptr [(reloc 0x7ffc3f9a63ea)]
G_M000_IG03: ;; offset=0025H
4883C420 add rsp, 32
5D pop rbp
C3 ret
; Total bytes of code 43
; Assembly listing for method Program:Test():bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
G_M000_IG02: ;; offset=0000H
B801000000 mov eax, 1
G_M000_IG03: ;; offset=0005H
C3 ret
; Total bytes of code 6
Note, again, we see two outputs for Program:Test. First, we see the
"Tier-0" code, which is accessing a static (note the call
CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE instruction). But then we see
the "Tier-1" code, where all of that overhead has vanished and is
instead replaced simply by mov eax, 1. Since the "Tier-0" code had to
have executed in order for it to tier up, the "Tier-1" code was
generated knowing that the value of the static readonly bool Is64Bit
field was true (1), and so the entirety of this method is storing the
value 1 into the eax register used for the return value.
This is so useful that components are now written with tiering in
mind. Consider the new Regex source generator, which is discussed
later in this post (Roslyn source generators were introduced a couple
of years ago; just as how Roslyn analyzers are able to plug into the
compiler and surface additional diagnostics based on all of the data
the compiler learns from the source code, Roslyn source generators
are able to analyze that same data and then further augment the
compilation unit with additional source). The Regex source generator
applies a technique based on this in dotnet/runtime#67775. Regex
supports setting a process-wide timeout that gets applied to Regex
instances that don't explicitly set a timeout. That means, even
though it's super rare for such a process-wide timeout to be set, the
Regex source generator still needs to output timeout-related code
just in case it's needed. It does so by outputting some helpers like
this:
static class Utilities
{
internal static readonly TimeSpan s_defaultTimeout = AppContext.GetData("REGEX_DEFAULT_MATCH_TIMEOUT") is TimeSpan timeout ? timeout : Timeout.InfiniteTimeSpan;
internal static readonly bool s_hasTimeout = s_defaultTimeout != Timeout.InfiniteTimeSpan;
}
which it then uses at call sites like this:
if (Utilities.s_hasTimeout)
{
base.CheckTimeout();
}
In tier-0, these checks will still be emitted in the assembly code,
but in tier-1 where throughput matters, if the relevant AppContext
switch hasn't been set, then s_defaultTimeout will be
Timeout.InfiniteTimeSpan, at which point s_hasTimeout will be false.
And since s_hasTimeout is a static readonly bool, the JIT will be
able to treat that as a const, and all conditions like if
(Utilities.s_hasTimeout) will be treated equal to if (false) and be
eliminated from the assembly code entirely as dead code.
But, this is somewhat old news. The JIT has been able to do such an
optimization since tiered compilation was introduced in .NET Core
3.0. Now in .NET 7, though, with OSR it's also able to do so by
default for methods with loops (and thus enable cases like the regex
one). However, the real magic of OSR comes into play when combined
with another exciting feature: dynamic PGO.
PGO
I wrote about profile-guided optimization (PGO) in my Performance
Improvements in .NET 6 post, but I'll cover it again here as it's
seen a multitude of improvements for .NET 7.
PGO has been around for a long time, in any number of languages and
compilers. The basic idea is you compile your app, asking the
compiler to inject instrumentation into the application to track
various pieces of interesting information. You then put your app
through its paces, running through various common scenarios, causing
that instrumentation to "profile" what happens when the app is
executed, and the results of that are then saved out. The app is then
recompiled, feeding those instrumentation results back into the
compiler, and allowing it to optimize the app for exactly how it's
expected to be used. This approach to PGO is referred to as "static
PGO," as the information is all gleaned ahead of actual deployment,
and it's something .NET has been doing in various forms for years.
From my perspective, though, the really interesting development in
.NET is "dynamic PGO," which was introduced in .NET 6, but off by
default.
Dynamic PGO takes advantage of tiered compilation. I noted that the
JIT instruments the tier-0 code to track how many times the method is
called, or in the case of loops, how many times the loop executes. It
can instrument it for other things as well. For example, it can track
exactly which concrete types are used as the target of an interface
dispatch, and then in tier-1 specialize the code to expect the most
common types (this is referred to as "guarded devirtualization," or
GDV). You can see this in this little example. Set the
DOTNET_TieredPGO environment variable to 1, and then run this on .NET
7:
class Program
{
static void Main()
{
IPrinter printer = new Printer();
for (int i = 0; ; i++)
{
DoWork(printer, i);
}
}
static void DoWork(IPrinter printer, int i)
{
printer.PrintIfTrue(i == int.MaxValue);
}
interface IPrinter
{
void PrintIfTrue(bool condition);
}
class Printer : IPrinter
{
public void PrintIfTrue(bool condition)
{
if (condition) Console.WriteLine("Print!");
}
}
}
The tier-0 code for DoWork ends up looking like this:
G_M000_IG01: ;; offset=0000H
55 push rbp
4883EC30 sub rsp, 48
488D6C2430 lea rbp, [rsp+30H]
33C0 xor eax, eax
488945F8 mov qword ptr [rbp-08H], rax
488945F0 mov qword ptr [rbp-10H], rax
48894D10 mov gword ptr [rbp+10H], rcx
895518 mov dword ptr [rbp+18H], edx
G_M000_IG02: ;; offset=001BH
FF059F220F00 inc dword ptr [(reloc 0x7ffc3f1b2ea0)]
488B4D10 mov rcx, gword ptr [rbp+10H]
48894DF8 mov gword ptr [rbp-08H], rcx
488B4DF8 mov rcx, gword ptr [rbp-08H]
48BAA82E1B3FFC7F0000 mov rdx, 0x7FFC3F1B2EA8
E8B47EC55F call CORINFO_HELP_CLASSPROFILE32
488B4DF8 mov rcx, gword ptr [rbp-08H]
48894DF0 mov gword ptr [rbp-10H], rcx
488B4DF0 mov rcx, gword ptr [rbp-10H]
33D2 xor edx, edx
817D18FFFFFF7F cmp dword ptr [rbp+18H], 0x7FFFFFFF
0F94C2 sete dl
49BB0800F13EFC7F0000 mov r11, 0x7FFC3EF10008
41FF13 call [r11]IPrinter:PrintIfTrue(bool):this
90 nop
G_M000_IG03: ;; offset=0062H
4883C430 add rsp, 48
5D pop rbp
C3 ret
and most notably, you can see the call [r11]IPrinter:PrintIfTrue
(bool):this doing the interface dispatch. But, then look at the code
generated for tier-1. We still see the call [r11]IPrinter:PrintIfTrue
(bool):this, but we also see this:
G_M000_IG02: ;; offset=0020H
48B9982D1B3FFC7F0000 mov rcx, 0x7FFC3F1B2D98
48390F cmp qword ptr [rdi], rcx
7521 jne SHORT G_M000_IG05
81FEFFFFFF7F cmp esi, 0x7FFFFFFF
7404 je SHORT G_M000_IG04
G_M000_IG03: ;; offset=0037H
FFC6 inc esi
EBE5 jmp SHORT G_M000_IG02
G_M000_IG04: ;; offset=003BH
48B9D820801A24020000 mov rcx, 0x2241A8020D8
488B09 mov rcx, gword ptr [rcx]
FF1572CD0D00 call [Console:WriteLine(String)]
EBE7 jmp SHORT G_M000_IG03
That first block is checking the concrete type of the IPrinter
(stored in rdi) and comparing it against the known type for Printer
(0x7FFC3F1B2D98). If they're different, it just jumps to the same
interface dispatch it was doing in the unoptimized version. But if
they're the same, it then jumps directly to an inlined version of
Printer.PrintIfTrue (you can see the call to Console:WriteLine right
there in this method). Thus, the common case (the only case in this
example) is super efficient at the expense of a single comparison and
branch.
That all existed in .NET 6, so why are we talking about it now?
Several things have improved. First, PGO now works with OSR, thanks
to improvements like dotnet/runtime#61453. That's a big deal, as it
means hot long-running methods that do this kind of interface
dispatch (which are fairly common) can get these kinds of
devirtualization/inlining optimizations. Second, while PGO isn't
currently enabled by default, we've made it much easier to turn on.
Between dotnet/runtime#71438 and dotnet/sdk#26350, it's now possible
to simply put true into your .csproj, and
it'll have the same effect as if you set DOTNET_TieredPGO=1 prior to
every invocation of the app, enabling dynamic PGO (note that it
doesn't disable use of R2R images, so if you want the entirety of the
core libraries also employing dynamic PGO, you'll also need to set
DOTNET_ReadyToRun=0). Third, however, is dynamic PGO has been taught
how to instrument and optimize additional things.
PGO already knew how to instrument virtual dispatch. Now in .NET 7,
thanks in large part to dotnet/runtime#68703, it can do so for
delegates as well (at least for delegates to instance methods).
Consider this simple console app:
using System.Runtime.CompilerServices;
class Program
{
static int[] s_values = Enumerable.Range(0, 1_000).ToArray();
static void Main()
{
for (int i = 0; i < 1_000_000; i++)
Sum(s_values, i => i * 42);
}
[MethodImpl(MethodImplOptions.NoInlining)]
static int Sum(int[] values, Func func)
{
int sum = 0;
foreach (int value in values)
sum += func(value);
return sum;
}
}
Without PGO enabled, I get generated optimized assembly like this:
; Assembly listing for method Program:Sum(ref,Func`2):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC20 sub rsp, 32
488BF2 mov rsi, rdx
G_M000_IG02: ;; offset=000DH
33FF xor edi, edi
488BD9 mov rbx, rcx
33ED xor ebp, ebp
448B7308 mov r14d, dword ptr [rbx+08H]
4585F6 test r14d, r14d
7E16 jle SHORT G_M000_IG04
G_M000_IG03: ;; offset=001DH
8BD5 mov edx, ebp
8B549310 mov edx, dword ptr [rbx+4*rdx+10H]
488B4E08 mov rcx, gword ptr [rsi+08H]
FF5618 call [rsi+18H]Func`2:Invoke(int):int:this
03F8 add edi, eax
FFC5 inc ebp
443BF5 cmp r14d, ebp
7FEA jg SHORT G_M000_IG03
G_M000_IG04: ;; offset=0033H
8BC7 mov eax, edi
G_M000_IG05: ;; offset=0035H
4883C420 add rsp, 32
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
C3 ret
; Total bytes of code 64
Note the call [rsi+18H]Func`2:Invoke(int):int:this in there that's
invoking the delegate. Now with PGO enabled:
; Assembly listing for method Program:Sum(ref,Func`2):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 5628
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
4157 push r15
4156 push r14
57 push rdi
56 push rsi
55 push rbp
53 push rbx
4883EC28 sub rsp, 40
488BF2 mov rsi, rdx
G_M000_IG02: ;; offset=000FH
33FF xor edi, edi
488BD9 mov rbx, rcx
33ED xor ebp, ebp
448B7308 mov r14d, dword ptr [rbx+08H]
4585F6 test r14d, r14d
7E27 jle SHORT G_M000_IG05
G_M000_IG03: ;; offset=001FH
8BC5 mov eax, ebp
8B548310 mov edx, dword ptr [rbx+4*rax+10H]
4C8B4618 mov r8, qword ptr [rsi+18H]
48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0
4C3BC0 cmp r8, rax
751D jne SHORT G_M000_IG07
446BFA2A imul r15d, edx, 42
G_M000_IG04: ;; offset=003CH
4103FF add edi, r15d
FFC5 inc ebp
443BF5 cmp r14d, ebp
7FD9 jg SHORT G_M000_IG03
G_M000_IG05: ;; offset=0046H
8BC7 mov eax, edi
G_M000_IG06: ;; offset=0048H
4883C428 add rsp, 40
5B pop rbx
5D pop rbp
5E pop rsi
5F pop rdi
415E pop r14
415F pop r15
C3 ret
G_M000_IG07: ;; offset=0055H
488B4E08 mov rcx, gword ptr [rsi+08H]
41FFD0 call r8
448BF8 mov r15d, eax
EBDB jmp SHORT G_M000_IG04
I chose the 42 constant in i => i * 42 to make it easy to see in the
assembly, and sure enough, there it is:
G_M000_IG03: ;; offset=001FH
8BC5 mov eax, ebp
8B548310 mov edx, dword ptr [rbx+4*rax+10H]
4C8B4618 mov r8, qword ptr [rsi+18H]
48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0
4C3BC0 cmp r8, rax
751D jne SHORT G_M000_IG07
446BFA2A imul r15d, edx, 42
This is loading the target address from the delegate into r8 and is
loading the address of the expected target into rax. If they're the
same, it then simply performs the inlined operation (imul r15d, edx,
42), and otherwise it jumps to G_M000_IG07 which calls to the
function in r8. The effect of this is obvious if we run this as a
benchmark:
static int[] s_values = Enumerable.Range(0, 1_000).ToArray();
[Benchmark]
public int DelegatePGO() => Sum(s_values, i => i * 42);
static int Sum(int[] values, Func? func)
{
int sum = 0;
foreach (int value in values)
{
sum += func(value);
}
return sum;
}
With PGO disabled, we get the same performance throughput for .NET 6
and .NET 7:
Method Runtime Mean Ratio
DelegatePGO .NET 6.0 1.665 us 1.00
DelegatePGO .NET 7.0 1.659 us 1.00
But the picture changes when we enable dynamic PGO (DOTNET_TieredPGO=
1). .NET 6 gets ~14% faster, but .NET 7 gets ~3x faster!
Method Runtime Mean Ratio
DelegatePGO .NET 6.0 1,427.7 ns 1.00
DelegatePGO .NET 7.0 539.0 ns 0.38
dotnet/runtime#70377 is another valuable improvement with dynamic
PGO, which enables PGO to play nicely with loop cloning and invariant
hoisting. To understand this better, a brief digression into what
those are. Loop cloning is a mechanism the JIT employs to avoid
various overheads in the fast path of a loop. Consider the Test
method in this example:
using System.Runtime.CompilerServices;
class Program
{
static void Main()
{
int[] array = new int[10_000_000];
for (int i = 0; i < 1_000_000; i++)
{
Test(array);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static bool Test(int[] array)
{
for (int i = 0; i < 0x12345; i++)
{
if (array[i] == 42)
{
return true;
}
}
return false;
}
}
The JIT doesn't know whether the passed in array is of sufficient
length that all accesses to array[i] inside the loop will be in
bounds, and thus it would need to inject bounds checks for every
access. While it'd be nice to simply do the length check up front and
simply throw an exception early if it wasn't long enough, doing so
could also change behavior (imagine the method were writing into the
array as it went, or otherwise mutating some shared state). Instead,
the JIT employs "loop cloning." It essentially rewrites this Test
method to be more like this:
if (array is not null && array.Length >= 0x12345)
{
for (int i = 0; i < 0x12345; i++)
{
if (array[i] == 42) // no bounds checks emitted for this access :-)
{
return true;
}
}
}
else
{
for (int i = 0; i < 0x12345; i++)
{
if (array[i] == 42) // bounds checks emitted for this access :-(
{
return true;
}
}
}
return false;
That way, at the expense of some code duplication, we get our fast
loop without bounds checks and only pay for the bounds checks in the
slow path. You can see this in the generated assembly (if you can't
already tell, DOTNET_JitDisasm is one of my favorite features in .NET
7):
; Assembly listing for method Program:Test(ref):bool
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; No PGO data
G_M000_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
G_M000_IG02: ;; offset=0004H
33C0 xor eax, eax
4885C9 test rcx, rcx
7429 je SHORT G_M000_IG05
81790845230100 cmp dword ptr [rcx+08H], 0x12345
7C20 jl SHORT G_M000_IG05
0F1F40000F1F840000000000 align [12 bytes for IG03]
G_M000_IG03: ;; offset=0020H
8BD0 mov edx, eax
837C91102A cmp dword ptr [rcx+4*rdx+10H], 42
7429 je SHORT G_M000_IG08
FFC0 inc eax
3D45230100 cmp eax, 0x12345
7CEE jl SHORT G_M000_IG03
G_M000_IG04: ;; offset=0032H
EB17 jmp SHORT G_M000_IG06
G_M000_IG05: ;; offset=0034H
3B4108 cmp eax, dword ptr [rcx+08H]
7323 jae SHORT G_M000_IG10
8BD0 mov edx, eax
837C91102A cmp dword ptr [rcx+4*rdx+10H], 42
7410 je SHORT G_M000_IG08
FFC0 inc eax
3D45230100 cmp eax, 0x12345
7CE9 jl SHORT G_M000_IG05
G_M000_IG06: ;; offset=004BH
33C0 xor eax, eax
G_M000_IG07: ;; offset=004DH
4883C428 add rsp, 40
C3 ret
G_M000_IG08: ;; offset=0052H
B801000000 mov eax, 1
G_M000_IG09: ;; offset=0057H
4883C428 add rsp, 40
C3 ret
G_M000_IG10: ;; offset=005CH
E81FA0C15F call CORINFO_HELP_RNGCHKFAIL
CC int3
; Total bytes of code 98
That G_M000_IG02 section is doing the null check and the length
check, jumping to the G_M000_IG05 block if either fails. If both
succeed, it's then executing the loop (block G_M000_IG03) without
bounds checks:
G_M000_IG03: ;; offset=0020H
8BD0 mov edx, eax
837C91102A cmp dword ptr [rcx+4*rdx+10H], 42
7429 je SHORT G_M000_IG08
FFC0 inc eax
3D45230100 cmp eax, 0x12345
7CEE jl SHORT G_M000_IG03
with the bounds checks only showing up in the slow-path block:
G_M000_IG05: ;; offset=0034H
3B4108 cmp eax, dword ptr [rcx+08H]
7323 jae SHORT G_M000_IG10
8BD0 mov edx, eax
837C91102A cmp dword ptr [rcx+4*rdx+10H], 42
7410 je SHORT G_M000_IG08
FFC0 inc eax
3D45230100 cmp eax, 0x12345
7CE9 jl SHORT G_M000_IG05
That's "loop cloning." What about "invariant hoisting"? Hoisting
means pulling something out of a loop to be before the loop, and
invariants are things that don't change. Thus invariant hoisting is
pulling something out of a loop to before the loop in order to avoid
recomputing every iteration of the loop an answer that won't change.
Effectively, the previous example already showed invariant hoisting,
in that the bounds check is moved to be before the loop rather than
in the loop, but a more concrete example would be something like
this:
[MethodImpl(MethodImplOptions.NoInlining)]
private static bool Test(int[] array)
{
for (int i = 0; i < 0x12345; i++)
{
if (array[i] == array.Length - 42)
{
return true;
}
}
return false;
}
Note that the value of array.Length - 42 doesn't change on each
iteration of the loop, so it's "invariant" to the loop iteration and
can be lifted out, which the generated code does:
G_M000_IG02: ;; offset=0004H
33D2 xor edx, edx
4885C9 test rcx, rcx
742A je SHORT G_M000_IG05
448B4108 mov r8d, dword ptr [rcx+08H]
4181F845230100 cmp r8d, 0x12345
7C1D jl SHORT G_M000_IG05
4183C0D6 add r8d, -42
0F1F4000 align [4 bytes for IG03]
G_M000_IG03: ;; offset=0020H
8BC2 mov eax, edx
4439448110 cmp dword ptr [rcx+4*rax+10H], r8d
7433 je SHORT G_M000_IG08
FFC2 inc edx
81FA45230100 cmp edx, 0x12345
7CED jl SHORT G_M000_IG03
Here again we see the array being tested for null (test rcx, rcx) and
the array's length being checked (mov r8d, dword ptr [rcx+08H] then
cmp r8d, 0x12345), but then with the array's length in r8d, we then
see this up-front block subtracting 42 from the length (add r8d,
-42), and that's before we continue into the fast-path loop in the
G_M000_IG03 block. This keeps that additional set of operations out
of the loop, thereby avoiding the overhead of recomputing the value
per iteration.
Ok, so how does this apply to dynamic PGO? Remember that with the
interface/virtual dispatch avoidance PGO is able to do, it does so by
doing a type check to see whether the type in use is the most common
type; if it is, it uses a fast path that calls directly to that
type's method (and in doing so that call is then potentially
inlined), and if it isn't, it falls back to normal interface/virtual
dispatch. That check can be invariant to a loop. So when a method is
tiered up and PGO kicks in, the type check can now be hoisted out of
the loop, making it even cheaper to handle the common case. Consider
this variation of our original example:
using System.Runtime.CompilerServices;
class Program
{
static void Main()
{
IPrinter printer = new BlankPrinter();
while (true)
{
DoWork(printer);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void DoWork(IPrinter printer)
{
for (int j = 0; j < 123; j++)
{
printer.Print(j);
}
}
interface IPrinter
{
void Print(int i);
}
class BlankPrinter : IPrinter
{
public void Print(int i)
{
Console.Write("");
}
}
}
When we look at the optimized assembly generated for this with
dynamic PGO enabled, we see this:
; Assembly listing for method Program:DoWork(IPrinter)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 12187
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
57 push rdi
56 push rsi
4883EC28 sub rsp, 40
488BF1 mov rsi, rcx
G_M000_IG02: ;; offset=0009H
33FF xor edi, edi
4885F6 test rsi, rsi
742B je SHORT G_M000_IG05
48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98
48390E cmp qword ptr [rsi], rcx
751C jne SHORT G_M000_IG05
G_M000_IG03: ;; offset=001FH
48B9282040F948020000 mov rcx, 0x248F9402028
488B09 mov rcx, gword ptr [rcx]
FF1526A80D00 call [Console:Write(String)]
FFC7 inc edi
83FF7B cmp edi, 123
7CE6 jl SHORT G_M000_IG03
G_M000_IG04: ;; offset=0039H
EB29 jmp SHORT G_M000_IG07
G_M000_IG05: ;; offset=003BH
48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98
48390E cmp qword ptr [rsi], rcx
7521 jne SHORT G_M000_IG08
48B9282040F948020000 mov rcx, 0x248F9402028
488B09 mov rcx, gword ptr [rcx]
FF15FBA70D00 call [Console:Write(String)]
G_M000_IG06: ;; offset=005DH
FFC7 inc edi
83FF7B cmp edi, 123
7CD7 jl SHORT G_M000_IG05
G_M000_IG07: ;; offset=0064H
4883C428 add rsp, 40
5E pop rsi
5F pop rdi
C3 ret
G_M000_IG08: ;; offset=006BH
488BCE mov rcx, rsi
8BD7 mov edx, edi
49BB1000AA3CFC7F0000 mov r11, 0x7FFC3CAA0010
41FF13 call [r11]IPrinter:Print(int):this
EBDE jmp SHORT G_M000_IG06
; Total bytes of code 127
We can see in the G_M000_IG02 block that it's doing the type check on
the IPrinter instance and jumping to G_M000_IG05 if the check fails
(mov rcx, 0x7FFC3CD42D98 then cmp qword ptr [rsi], rcx then jne SHORT
G_M000_IG05), otherwise falling through to G_M000_IG03 which is a
tight fast-path loop that's inlined BlankPrinter.Print with no type
checks in sight!
Interestingly, improvements like this can bring with them their own
challenges. PGO leads to a significant increase in the number of type
checks, since call sites that specialize for a given type need to
compare against that type. However, common subexpression elimination
(CSE) hasn't historically worked for such type handles (CSE is a
compiler optimization where duplicate expressions are eliminated by
computing the result once and then storing it for subsequent use
rather than recomputing it each time). dotnet/runtime#70580 fixes
this by enabling CSE for such constant handles. For example, consider
this method:
[Benchmark]
[Arguments("", "", "", "")]
public bool AllAreStrings(object o1, object o2, object o3, object o4) =>
o1 is string && o2 is string && o3 is string && o4 is string;
On .NET 6, the JIT produced this assembly code:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object)
test rdx,rdx
je short M00_L01
mov rax,offset MT_System.String
cmp [rdx],rax
jne short M00_L01
test r8,r8
je short M00_L01
mov rax,offset MT_System.String
cmp [r8],rax
jne short M00_L01
test r9,r9
je short M00_L01
mov rax,offset MT_System.String
cmp [r9],rax
jne short M00_L01
mov rax,[rsp+28]
test rax,rax
je short M00_L00
mov rdx,offset MT_System.String
cmp [rax],rdx
je short M00_L00
xor eax,eax
M00_L00:
test rax,rax
setne al
movzx eax,al
ret
M00_L01:
xor eax,eax
ret
; Total bytes of code 100
Note the C# has four tests for string and the assembly code has four
loads with mov rax,offset MT_System.String. Now on .NET 7, the load
is performed just once:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object)
test rdx,rdx
je short M00_L01
mov rax,offset MT_System.String
cmp [rdx],rax
jne short M00_L01
test r8,r8
je short M00_L01
cmp [r8],rax
jne short M00_L01
test r9,r9
je short M00_L01
cmp [r9],rax
jne short M00_L01
mov rdx,[rsp+28]
test rdx,rdx
je short M00_L00
cmp [rdx],rax
je short M00_L00
xor edx,edx
M00_L00:
xor eax,eax
test rdx,rdx
setne al
ret
M00_L01:
xor eax,eax
ret
; Total bytes of code 69
Bounds Check Elimination
One of the things that makes .NET attractive is its safety. The
runtime guards access to arrays, strings, and spans such that you
can't accidentally corrupt memory by walking off either end; if you
do, rather than reading/writing arbitrary memory, you'll get
exceptions. Of course, that's not magic; it's done by the JIT
inserting bounds checks every time one of these data structures is
indexed. For example, this:
[MethodImpl(MethodImplOptions.NoInlining)]
static int Read0thElement(int[] array) => array[0];
results in:
G_M000_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
G_M000_IG02: ;; offset=0004H
83790800 cmp dword ptr [rcx+08H], 0
7608 jbe SHORT G_M000_IG04
8B4110 mov eax, dword ptr [rcx+10H]
G_M000_IG03: ;; offset=000DH
4883C428 add rsp, 40
C3 ret
G_M000_IG04: ;; offset=0012H
E8E9A0C25F call CORINFO_HELP_RNGCHKFAIL
CC int3
The array is passed into this method in the rcx register, pointing to
the method table pointer in the object, and the length of an array is
stored in the object just after that method table pointer (which is 8
bytes in a 64-bit process). Thus the cmp dword ptr [rcx+08H], 0
instruction is reading the length of the array and comparing the
length to 0; that makes sense, since the length can't be negative,
and we're trying to access the 0th element, so as long as the length
isn't 0, the array has enough elements for us to access its 0th
element. In the event that the length was 0, the code jumps to the
end of the function, which contains call CORINFO_HELP_RNGCHKFAIL;
that's a JIT helper function that throws an IndexOutOfRangeException.
If the length was sufficient, however, it then reads the int stored
at the beginning of the array's data, which on 64-bit is 16 bytes
(0x10) past the pointer (mov eax, dword ptr [rcx+10H]).
While these bounds checks in and of themselves aren't super
expensive, do a lot of them and their costs add up. So while the JIT
needs to ensure that "safe" accesses don't go out of bounds, it also
tries to prove that certain accesses won't, in which case it needn't
emit the bounds check that it knows will be superfluous. In every
release of .NET, more and more cases have been added to find places
these bounds checks can be eliminated, and .NET 7 is no exception.
For example, dotnet/runtime#61662 from @anthonycanino enabled the JIT
to understand various forms of binary operations as part of range
checks. Consider this method:
[MethodImpl(MethodImplOptions.NoInlining)]
private static ushort[]? Convert(ReadOnlySpan bytes)
{
if (bytes.Length != 16)
{
return null;
}
var result = new ushort[8];
for (int i = 0; i < result.Length; i++)
{
result[i] = (ushort)(bytes[i * 2] * 256 + bytes[i * 2 + 1]);
}
return result;
}
It's validating that the input span is 16 bytes long and then
creating a new ushort[8] where each ushort in the array combines two
of the input bytes. To do that, it's looping over the output array,
and indexing into the bytes array using i * 2 and i * 2 + 1 as the
indices. On .NET 6, each of those indexing operations would result in
a bounds check, with assembly like:
cmp r8d,10
jae short G_M000_IG04
movsxd r8,r8d
where that G_M000_IG04 is the call CORINFO_HELP_RNGCHKFAIL we're now
familiar with. But on .NET 7, we get this assembly for the method:
G_M000_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
G_M000_IG02: ;; offset=0005H
488B31 mov rsi, bword ptr [rcx]
8B4908 mov ecx, dword ptr [rcx+08H]
83F910 cmp ecx, 16
754C jne SHORT G_M000_IG05
48B9302F542FFC7F0000 mov rcx, 0x7FFC2F542F30
BA08000000 mov edx, 8
E80C1EB05F call CORINFO_HELP_NEWARR_1_VC
33D2 xor edx, edx
align [0 bytes for IG03]
G_M000_IG03: ;; offset=0026H
8D0C12 lea ecx, [rdx+rdx]
448BC1 mov r8d, ecx
FFC1 inc ecx
458BC0 mov r8d, r8d
460FB60406 movzx r8, byte ptr [rsi+r8]
41C1E008 shl r8d, 8
8BC9 mov ecx, ecx
0FB60C0E movzx rcx, byte ptr [rsi+rcx]
4103C8 add ecx, r8d
0FB7C9 movzx rcx, cx
448BC2 mov r8d, edx
6642894C4010 mov word ptr [rax+2*r8+10H], cx
FFC2 inc edx
83FA08 cmp edx, 8
7CD0 jl SHORT G_M000_IG03
G_M000_IG04: ;; offset=0056H
4883C420 add rsp, 32
5E pop rsi
C3 ret
G_M000_IG05: ;; offset=005CH
33C0 xor rax, rax
G_M000_IG06: ;; offset=005EH
4883C420 add rsp, 32
5E pop rsi
C3 ret
; Total bytes of code 100
No bounds checks, which is most easily seen by the lack of the
telltale call CORINFO_HELP_RNGCHKFAIL at the end of the method. With
this PR, the JIT is able to understand the impact of certain
multiplication and shift operations and their relationships to the
bounds of the data structure. Since it can see that the result
array's length is 8 and the loop is iterating from 0 to that
exclusive upper bound, it knows that i will always be in the range
[0, 7], which means that i * 2 will always be in the range [0, 14]
and i * 2 + 1 will always be in the range [0, 15]. As such, it's able
to prove that the bounds checks aren't needed.
dotnet/runtime#61569 and dotnet/runtime#62864 also help to eliminate
bounds checks when dealing with constant strings and spans
initialized from RVA statics ("Relative Virtual Address" static
fields, basically a static field that lives in a module's data
section). For example, consider this benchmark:
[Benchmark]
[Arguments(1)]
public char GetChar(int i)
{
const string Text = "hello";
return (uint)i < Text.Length ? Text[i] : '\0';
}
On .NET 6, we get this assembly:
; Program.GetChar(Int32)
sub rsp,28
mov eax,edx
cmp rax,5
jl short M00_L00
xor eax,eax
add rsp,28
ret
M00_L00:
cmp edx,5
jae short M00_L01
mov rax,2278B331450
mov rax,[rax]
movsxd rdx,edx
movzx eax,word ptr [rax+rdx*2+0C]
add rsp,28
ret
M00_L01:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 56
The beginning of this makes sense: the JIT was obviously able to see
that the length of Text is 5, so it's implementing the (uint)i <
Text.Length check by doing cmp rax,5, and if i as an unsigned value
is greater than or equal to 5, it's then zero'ing out the return
value (to return the '\0') and exiting. If the length is less than 5
(in which case it's also at least 0 due to the unsigned comparison),
it then jumps to M00_L00 to read the value from the string... but we
then see another cmp against 5, this time as part of a range check.
So even though the JIT knew the index was in bounds, it wasn't able
to remove the bounds check. Now it is; in .NET 7, we get this:
; Program.GetChar(Int32)
cmp edx,5
jb short M00_L00
xor eax,eax
ret
M00_L00:
mov rax,2B0AF002530
mov rax,[rax]
mov edx,edx
movzx eax,word ptr [rax+rdx*2+0C]
ret
; Total bytes of code 29
So much nicer.
dotnet/runtime#67141 is a great example of how evolving ecosystem
needs drives specific optimizations into the JIT. The Regex compiler
and source generator handle some cases of regular expression
character classes by using a bitmap lookup stored in strings. For
example, to determine whether a char c is in the character class "
[A-Za-z0-9_]" (which will match an underscore or any ASCII letter or
digit), the implementation ends up generating an expression like the
body of the following method:
[Benchmark]
[Arguments('a')]
public bool IsInSet(char c) =>
c < 128 && ("\0\0\0\u03FF\uFFFE\u87FF\uFFFE\u07FF"[c >> 4] & (1 << (c & 0xF))) != 0;
The implementation is treating an 8-character string as a 128-bit
lookup table. If the character is known to be in range (such that
it's effectively a 7-bit value), it's then using the top 3 bits of
the value to index into the 8 elements of the string, and the bottom
4 bits to select one of the 16 bits in that element, giving us an
answer as to whether this input character is in the set or not. In
.NET 6, even though we know the character is in range of the string,
the JIT couldn't see through either the length comparison or the bit
shift.
; Program.IsInSet(Char)
sub rsp,28
movzx eax,dx
cmp eax,80
jge short M00_L00
mov edx,eax
sar edx,4
cmp edx,8
jae short M00_L01
mov rcx,299835A1518
mov rcx,[rcx]
movsxd rdx,edx
movzx edx,word ptr [rcx+rdx*2+0C]
and eax,0F
bt edx,eax
setb al
movzx eax,al
add rsp,28
ret
M00_L00:
xor eax,eax
add rsp,28
ret
M00_L01:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 75
The previously mentioned PR takes care of the length check. And this
PR takes care of the bit shift. So in .NET 7, we get this loveliness:
; Program.IsInSet(Char)
movzx eax,dx
cmp eax,80
jge short M00_L00
mov edx,eax
sar edx,4
mov rcx,197D4800608
mov rcx,[rcx]
mov edx,edx
movzx edx,word ptr [rcx+rdx*2+0C]
and eax,0F
bt edx,eax
setb al
movzx eax,al
ret
M00_L00:
xor eax,eax
ret
; Total bytes of code 51
Note the distinct lack of a call CORINFO_HELP_RNGCHKFAIL. And as you
might guess, this check can happen a lot in a Regex, making this a
very useful addition.
Bounds checks are an obvious source of overhead when talking about
array access, but they're not the only ones. There's also the need to
use the cheapest instructions possible. In .NET 6, with a method
like:
[MethodImpl(MethodImplOptions.NoInlining)]
private static int Get(int[] values, int i) => values[i];
assembly code like the following would be generated:
; Program.Get(Int32[], Int32)
sub rsp,28
cmp edx,[rcx+8]
jae short M01_L00
movsxd rax,edx
mov eax,[rcx+rax*4+10]
add rsp,28
ret
M01_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 27
This should look fairly familiar from our previous discussion; the
JIT is loading the array's length ([rcx+8]) and comparing that with
the value of i (in edx), and then jumping to the end to throw an
exception if i is out of bounds. Immediately after that jump we see a
movsxd rax, edx instruction, which is taking the 32-bit value of i
from edx and moving it into the 64-bit register rax. And as part of
moving it, it's sign-extending it; that's the "sxd" part of the
instruction name (sign-extending means the upper 32 bits of the new
64-bit value will be set to the value of the upper bit of the 32-bit
value, so that the number retains its signed value). The interesting
thing is, though, we know that the Length of an array and of a span
is non-negative, and since we just bounds checked i against the
Length, we also know that i is non-negative. That makes such
sign-extension useless, since the upper bit is guaranteed to be 0.
Since the mov instruction that zero-extends is a tad cheaper than
movsxd, we can simply use that instead. And that's exactly what
dotnet/runtime#57970 from @pentp does for both arrays and spans (
dotnet/runtime#70884 also similarly avoids some signed casts in other
situations). Now on .NET 7, we get this:
; Program.Get(Int32[], Int32)
sub rsp,28
cmp edx,[rcx+8]
jae short M01_L00
mov eax,edx
mov eax,[rcx+rax*4+10]
add rsp,28
ret
M01_L00:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 26
That's not the only source of overhead with array access, though. In
fact, there's a very large category of array access overhead that's
been there forever, but that's so well known there are even old FxCop
rules and newer Roslyn analyzers that warn against it:
multidimensional array accesses. The overhead in the case of a
multidimensional array isn't just an extra branch on every indexing
operation, or additional math required to compute the location of the
element, but rather that they currently pass through the JIT's
optimization phases largely unmodified. dotnet/runtime#70271 improves
the state of the world here by doing an expansion of a
multidimensional array access early in the JIT's pipeline, such that
later optimization phases can improve multidimensional accesses as
they would other code, including CSE and loop invariant hoisting. The
impact of this is visible in a simple benchmark that sums all the
elements of a multidimensional array.
private int[,] _square;
[Params(1000)]
public int Size { get; set; }
[GlobalSetup]
public void Setup()
{
int count = 0;
_square = new int[Size, Size];
for (int i = 0; i < Size; i++)
{
for (int j = 0; j < Size; j++)
{
_square[i, j] = count++;
}
}
}
[Benchmark]
public int Sum()
{
int[,] square = _square;
int sum = 0;
for (int i = 0; i < Size; i++)
{
for (int j = 0; j < Size; j++)
{
sum += square[i, j];
}
}
return sum;
}
Method Runtime Mean Ratio
Sum .NET 6.0 964.1 us 1.00
Sum .NET 7.0 674.7 us 0.70
This previous example assumes you know the size of each dimension of
the multidimensional array (it's referring to the Size directly in
the loops). That's obviously not always (or maybe even rarely) the
case. In such situations, you'd be more likely to use the
Array.GetUpperBound method, and because multidimensional arrays can
have a non-zero lower bound, Array.GetLowerBound. That would lead to
code like this:
private int[,] _square;
[Params(1000)]
public int Size { get; set; }
[GlobalSetup]
public void Setup()
{
int count = 0;
_square = new int[Size, Size];
for (int i = 0; i < Size; i++)
{
for (int j = 0; j < Size; j++)
{
_square[i, j] = count++;
}
}
}
[Benchmark]
public int Sum()
{
int[,] square = _square;
int sum = 0;
for (int i = square.GetLowerBound(0); i < square.GetUpperBound(0); i++)
{
for (int j = square.GetLowerBound(1); j < square.GetUpperBound(1); j++)
{
sum += square[i, j];
}
}
return sum;
}
In .NET 7, thanks to dotnet/runtime#60816, those GetLowerBound and
GetUpperBound calls become JIT intrinsics. An "intrinsic" to a
compiler is something the compiler has intrinsic knowledge of, such
that rather than relying solely on a method's defined implementation
(if it even has one), the compiler can substitute in something it
considers to be better. There are literally thousands of methods in
.NET known in this manner to the JIT, with GetLowerBound and
GetUpperBound being two of the most recent. Now as intrinsics, when
they're passed a constant value (e.g. 0 for the 0th rank), the JIT
can substitute the necessary assembly instructions to read directly
from the memory location that houses the bounds. Here's what the
assembly code for this benchmark looked like with .NET 6; the main
thing to see here are all of the calls out to GetLowerBound and
GetUpperBound:
; Program.Sum()
push rdi
push rsi
push rbp
push rbx
sub rsp,28
mov rsi,[rcx+8]
xor edi,edi
mov rcx,rsi
xor edx,edx
cmp [rcx],ecx
call System.Array.GetLowerBound(Int32)
mov ebx,eax
mov rcx,rsi
xor edx,edx
call System.Array.GetUpperBound(Int32)
cmp eax,ebx
jle short M00_L03
M00_L00:
mov rcx,[rsi]
mov ecx,[rcx+4]
add ecx,0FFFFFFE8
shr ecx,3
cmp ecx,1
jbe short M00_L05
lea rdx,[rsi+10]
inc ecx
movsxd rcx,ecx
mov ebp,[rdx+rcx*4]
mov rcx,rsi
mov edx,1
call System.Array.GetUpperBound(Int32)
cmp eax,ebp
jle short M00_L02
M00_L01:
mov ecx,ebx
sub ecx,[rsi+18]
cmp ecx,[rsi+10]
jae short M00_L04
mov edx,ebp
sub edx,[rsi+1C]
cmp edx,[rsi+14]
jae short M00_L04
mov eax,[rsi+14]
imul rax,rcx
mov rcx,rdx
add rcx,rax
add edi,[rsi+rcx*4+20]
inc ebp
mov rcx,rsi
mov edx,1
call System.Array.GetUpperBound(Int32)
cmp eax,ebp
jg short M00_L01
M00_L02:
inc ebx
mov rcx,rsi
xor edx,edx
call System.Array.GetUpperBound(Int32)
cmp eax,ebx
jg short M00_L00
M00_L03:
mov eax,edi
add rsp,28
pop rbx
pop rbp
pop rsi
pop rdi
ret
M00_L04:
call CORINFO_HELP_RNGCHKFAIL
M00_L05:
mov rcx,offset MT_System.IndexOutOfRangeException
call CORINFO_HELP_NEWSFAST
mov rsi,rax
call System.SR.get_IndexOutOfRange_ArrayRankIndex()
mov rdx,rax
mov rcx,rsi
call System.IndexOutOfRangeException..ctor(System.String)
mov rcx,rsi
call CORINFO_HELP_THROW
int 3
; Total bytes of code 219
Now here's what it is for .NET 7:
; Program.Sum()
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,20
mov rdx,[rcx+8]
xor eax,eax
mov ecx,[rdx+18]
mov r8d,ecx
mov r9d,[rdx+10]
lea ecx,[rcx+r9+0FFFF]
cmp ecx,r8d
jle short M00_L03
mov r9d,[rdx+1C]
mov r10d,[rdx+14]
lea r10d,[r9+r10+0FFFF]
M00_L00:
mov r11d,r9d
cmp r10d,r11d
jle short M00_L02
mov esi,r8d
sub esi,[rdx+18]
mov edi,[rdx+10]
M00_L01:
mov ebx,esi
cmp ebx,edi
jae short M00_L04
mov ebp,[rdx+14]
imul ebx,ebp
mov r14d,r11d
sub r14d,[rdx+1C]
cmp r14d,ebp
jae short M00_L04
add ebx,r14d
add eax,[rdx+rbx*4+20]
inc r11d
cmp r10d,r11d
jg short M00_L01
M00_L02:
inc r8d
cmp ecx,r8d
jg short M00_L00
M00_L03:
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
ret
M00_L04:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 130
Importantly, note there are no more calls (other than for the bounds
check exception at the end). For example, instead of that first
GetUpperBound call:
call System.Array.GetUpperBound(Int32)
we get:
mov r9d,[rdx+1C]
mov r10d,[rdx+14]
lea r10d,[r9+r10+0FFFF]
and it ends up being much faster:
Method Runtime Mean Ratio
Sum .NET 6.0 2,657.5 us 1.00
Sum .NET 7.0 676.3 us 0.25
Loop Hoisting and Cloning
We previously saw how PGO interacts with loop hoisting and cloning,
and those optimizations have seen other improvements, as well.
Historically, the JIT's support for hoisting has been limited to
lifting an invariant out one level. Consider this example:
[Benchmark]
public void Compute()
{
for (int thousands = 0; thousands < 10; thousands++)
{
for (int hundreds = 0; hundreds < 10; hundreds++)
{
for (int tens = 0; tens < 10; tens++)
{
for (int ones = 0; ones < 10; ones++)
{
int n = ComputeNumber(thousands, hundreds, tens, ones);
Process(n);
}
}
}
}
}
static int ComputeNumber(int thousands, int hundreds, int tens, int ones) =>
(thousands * 1000) +
(hundreds * 100) +
(tens * 10) +
ones;
[MethodImpl(MethodImplOptions.NoInlining)]
static void Process(int n) { }
At first glance, you might look at this and say "what could be
hoisted, the computation of n requires all of the loop inputs, and
all of that computation is in ComputeNumber." But from a compiler's
perspective, the ComputeNumber function is inlineable and thus
logically can be part of its caller, the computation of n is actually
split into multiple pieces, and each of those pieces can be hoisted
to different levels, e.g. the tens computation can be hoisted out one
level, the hundreds out two levels, and the thousands out three
levels. Here's what [DisassemblyDiagnoser] outputs for .NET 6:
; Program.Compute()
push r14
push rdi
push rsi
push rbp
push rbx
sub rsp,20
xor esi,esi
M00_L00:
xor edi,edi
M00_L01:
xor ebx,ebx
M00_L02:
xor ebp,ebp
imul ecx,esi,3E8
imul eax,edi,64
add ecx,eax
lea eax,[rbx+rbx*4]
lea r14d,[rcx+rax*2]
M00_L03:
lea ecx,[r14+rbp]
call Program.Process(Int32)
inc ebp
cmp ebp,0A
jl short M00_L03
inc ebx
cmp ebx,0A
jl short M00_L02
inc edi
cmp edi,0A
jl short M00_L01
inc esi
cmp esi,0A
jl short M00_L00
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r14
ret
; Total bytes of code 84
We can see that some hoisting has happened here. After all, the inner
most loop (tagged M00_L03) is only five instructions: increment ebp
(which at this point is the ones counter value), and if it's still
less than 0xA (10), jump back to M00_L03 which adds whatever is in
r14 to ones. Great, so we've hoisted all of the unnecessary
computation out of the inner loop, being left only with adding the
ones position to the rest of the number. Let's go out a level.
M00_L02 is the label for the tens loop. What do we see there?
Trouble. The two instructions imul ecx,esi,3E8 and imul eax,edi,64
are performing the thousands * 1000 and hundreds * 100 operations,
highlighting that these operations which could have been hoisted out
further were left stuck in the next-to-innermost loop. Now, here's
what we get for .NET 7, where this was improved in dotnet/runtime#
68061:
; Program.Compute()
push r15
push r14
push r12
push rdi
push rsi
push rbp
push rbx
sub rsp,20
xor esi,esi
M00_L00:
xor edi,edi
imul ebx,esi,3E8
M00_L01:
xor ebp,ebp
imul r14d,edi,64
add r14d,ebx
M00_L02:
xor r15d,r15d
lea ecx,[rbp+rbp*4]
lea r12d,[r14+rcx*2]
M00_L03:
lea ecx,[r12+r15]
call qword ptr [Program.Process(Int32)]
inc r15d
cmp r15d,0A
jl short M00_L03
inc ebp
cmp ebp,0A
jl short M00_L02
inc edi
cmp edi,0A
jl short M00_L01
inc esi
cmp esi,0A
jl short M00_L00
add rsp,20
pop rbx
pop rbp
pop rsi
pop rdi
pop r12
pop r14
pop r15
ret
; Total bytes of code 99
Notice now where those imul instructions live. There are four labels,
each one corresponding to one of the loops, and we can see the
outermost loop has the imul ebx,esi,3E8 (for the thousands
computation) and the next loop has the imul r14d,edi,64 (for the
hundreds computation), highlighting that these computations were
hoisted out to the appropriate level (the tens and ones computation
are still in the right places).
More improvements have gone in on the cloning side. Previously, loop
cloning would only apply for loops iterating by 1 from a low to a
high value. With dotnet/runtime#60148, the comparison against the
upper value can be <= rather than just <. And with dotnet/runtime#
67930, loops that iterate downward can also be cloned, as can loops
that have increments and decrements larger than 1. Consider this
benchmark:
private int[] _values = Enumerable.Range(0, 1000).ToArray();
[Benchmark]
[Arguments(0, 0, 1000)]
public int LastIndexOf(int arg, int offset, int count)
{
int[] values = _values;
for (int i = offset + count - 1; i >= offset; i--)
if (values[i] == arg)
return i;
return 0;
}
Without loop cloning, the JIT can't assume that offset through
offset+count are in range, and thus every access to the array needs
to be bounds checked. With loop cloning, the JIT could generate one
version of the loop without bounds checks and only use that when it
knows all accesses will be valid. That's exactly what happens now in
.NET 7. Here's what we got with .NET 6:
; Program.LastIndexOf(Int32, Int32, Int32)
sub rsp,28
mov rcx,[rcx+8]
lea eax,[r8+r9+0FFFF]
cmp eax,r8d
jl short M00_L01
mov r9d,[rcx+8]
nop word ptr [rax+rax]
M00_L00:
cmp eax,r9d
jae short M00_L03
movsxd r10,eax
cmp [rcx+r10*4+10],edx
je short M00_L02
dec eax
cmp eax,r8d
jge short M00_L00
M00_L01:
xor eax,eax
add rsp,28
ret
M00_L02:
add rsp,28
ret
M00_L03:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 72
Notice how in the core loop, at label M00_L00, there's a bounds check
(cmp eax,r9d and jae short M00_L03, which jumps to a call
CORINFO_HELP_RNGCHKFAIL). And here's what we get with .NET 7:
; Program.LastIndexOf(Int32, Int32, Int32)
sub rsp,28
mov rax,[rcx+8]
lea ecx,[r8+r9+0FFFF]
cmp ecx,r8d
jl short M00_L02
test rax,rax
je short M00_L01
test ecx,ecx
jl short M00_L01
test r8d,r8d
jl short M00_L01
cmp [rax+8],ecx
jle short M00_L01
M00_L00:
mov r9d,ecx
cmp [rax+r9*4+10],edx
je short M00_L03
dec ecx
cmp ecx,r8d
jge short M00_L00
jmp short M00_L02
M00_L01:
cmp ecx,[rax+8]
jae short M00_L04
mov r9d,ecx
cmp [rax+r9*4+10],edx
je short M00_L03
dec ecx
cmp ecx,r8d
jge short M00_L01
M00_L02:
xor eax,eax
add rsp,28
ret
M00_L03:
mov eax,ecx
add rsp,28
ret
M00_L04:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 98
Notice how the code size is larger, and how there are now two
variations of the loop: one at M00_L00 and one at M00_L01. The second
one, M00_L01, has a branch to that same call CORINFO_HELP_RNGCHKFAIL,
but the first one doesn't, because that loop will only end up being
used after proving that the offset, count, and _values.Length are
such that the indexing will always be in bounds.
Other changes also improved loop cloning. dotnet/runtime#59886
enables the JIT to choose different forms for how to emit the the
conditions for choosing the fast or slow loop path, e.g. whether to
emit all the conditions, & them together, and then branch (if (!
(cond1 & cond2)) goto slowPath), or whether to emit each condition on
its own (if (!cond1) goto slowPath; if (!cond2) goto slowPath).
dotnet/runtime#66257 enables loop cloning to kick in when the loop
variable is initialized to more kinds of expressions (e.g. for (int
fromindex = lastIndex - lengthToClear; ...)). And dotnet/runtime#
70232 increases the JIT's willingness to clone loops with bodies that
do a broader set of operations.
Folding, propagation, and substitution
Constant folding is an optimization where a compiler computes the
value of an expression involving only constants at compile-time
rather than generating the code to compute the value at run-time.
There are multiple levels of constant folding in .NET, with some
constant folding performed by the C# compiler and some constant
folding performed by the JIT compiler. For example, given the C#
code:
[Benchmark]
public int A() => 3 + (4 * 5);
[Benchmark]
public int B() => A() * 2;
the C# compiler will generate IL for these methods like the
following:
.method public hidebysig instance int32 A () cil managed
{
.maxstack 8
IL_0000: ldc.i4.s 23
IL_0002: ret
}
.method public hidebysig instance int32 B () cil managed
{
.maxstack 8
IL_0000: ldarg.0
IL_0001: call instance int32 Program::A()
IL_0006: ldc.i4.2
IL_0007: mul
IL_0008: ret
}
You can see that the C# compiler has computed the value of 3 + (4*5),
as the IL for method A simply contains the equivalent of return 23;.
However, method B contains the equivalent of return A() * 2;,
highlighting that the constant folding performed by the C# compiler
was intramethod only. Now here's what the JIT generates:
; Program.A()
mov eax,17
ret
; Total bytes of code 6
; Program.B()
mov eax,2E
ret
; Total bytes of code 6
The assembly for method A isn't particularly interesting; it's just
returning that same value 23 (hex 0x17). But method B is more
interesting. The JIT has inlined the call from B to A, exposing the
contents of A to B, such that the JIT effectively sees the body of B
as the equivalent of return 23 * 2;. At that point, the JIT can do
its own constant folding, and it transforms the body of B to simply
return 46 (hex 0x2e). Constant propagation is intricately linked to
constant folding and is essentially just the idea that you can
substitute a constant value (typically one computed via constant
folding) into further expressions, at which point they may also be
able to be folded.
The JIT has long performed constant folding, but it improves further
in .NET 7. One of the ways constant folding can improve is by
exposing more values to be folded, which often means more inlining.
dotnet/runtime#55745 helped the inliner to understand that a method
call like M(constant + constant) (noting that those constants might
be the result of some other method call) is itself passing a constant
to M, and a constant being passed to a method call is a hint to the
inliner that it should consider being more aggressive about inlining,
since exposing that constant to the body of the callee can
potentially significantly reduce the amount of code required to
implement the callee. The JIT might have previously inlined such a
method anyway, but when it comes to inlining, the JIT is all about
heuristics and generating enough evidence that it's worthwhile to
inline something; this contributes to that evidence. This pattern
shows up, for example, in the various FromXx methods on TimeSpan. For
example, TimeSpan.FromSeconds is implemented as:
public static TimeSpan FromSeconds(double value) => Interval(value, TicksPerSecond); // TicksPerSecond is a constant
and, eschewing argument validation for the purposes of this example,
Interval is:
private static TimeSpan Interval(double value, double scale) => IntervalFromDoubleTicks(value * scale);
private static TimeSpan IntervalFromDoubleTicks(double ticks) => ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks);
which if everything gets inlined means FromSeconds is essentially:
public static TimeSpan FromSeconds(double value)
{
double ticks = value * 10_000_000;
return ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks);
}
and if value is a constant, let's say 5, that whole thing can be
constant folded (with dead code elimination on the ticks ==
long.MaxValue branch) to simply:
return new TimeSpan(50_000_000);
I'll spare you the .NET 6 assembly for this, but on .NET 7 with a
benchmark like:
[Benchmark]
public TimeSpan FromSeconds() => TimeSpan.FromSeconds(5);
we now get the simple and clean:
; Program.FromSeconds()
mov eax,2FAF080
ret
; Total bytes of code 6
Another change improving constant folding included dotnet/runtime#
57726 from @SingleAccretion, which unblocked constant folding in a
particular scenario that sometimes manifests when doing
field-by-field assignment of structs being returned from method
calls. As a small example, consider this trivial property, which
access the Color.DarkOrange property, which in turn does new Color
(KnownColor.DarkOrange):
[Benchmark]
public Color DarkOrange() => Color.DarkOrange;
In .NET 6, the JIT generated this:
; Program.DarkOrange()
mov eax,1
mov ecx,39
xor r8d,r8d
mov [rdx],r8
mov [rdx+8],r8
mov [rdx+10],cx
mov [rdx+12],ax
mov rax,rdx
ret
; Total bytes of code 32
The interesting thing here is that some constants (39, which is the
value of KnownColor.DarkOrange, and 1, which is a private
StateKnownColorValid constant) are being loaded into registers (mov
eax, 1 then mov ecx, 39) and then later being stored into the
relevant location for the Color struct being returned (mov
[rdx+12],ax and mov [rdx+10],cx). In .NET 7, it now generates:
; Program.DarkOrange()
xor eax,eax
mov [rdx],rax
mov [rdx+8],rax
mov word ptr [rdx+10],39
mov word ptr [rdx+12],1
mov rax,rdx
ret
; Total bytes of code 25
with direct assignment of these constant values into their
destination locations (mov word ptr [rdx+12],1 and mov word ptr
[rdx+10],39). Other changes contributing to constant folding included
dotnet/runtime#58171 from @SingleAccretion and dotnet/runtime#57605
from @SingleAccretion.
However, a large category of improvement came from an optimization
related to propagation, that of forward substitution. Consider this
silly benchmark:
[Benchmark]
public int Compute1() => Value + Value + Value + Value + Value;
[Benchmark]
public int Compute2() => SomethingElse() + Value + Value + Value + Value + Value;
private static int Value => 16;
[MethodImpl(MethodImplOptions.NoInlining)]
private static int SomethingElse() => 42;
If we look at the assembly code generated for Compute1 on .NET 6, it
looks like what we'd hope for. We're adding Value 5 times, Value is
trivially inlined and returns a constant value 16, and so we'd hope
that the assembly code generated for Compute1 would effectively just
be returning the value 80 (hex 0x50), which is exactly what happens:
; Program.Compute1()
mov eax,50
ret
; Total bytes of code 6
But Compute2 is a bit different. The structure of the code is such
that the additional call to SomethingElse ends up slightly perturbing
something about the JIT's analysis, and .NET 6 ends up with this
assembly code:
; Program.Compute2()
sub rsp,28
call Program.SomethingElse()
add eax,10
add eax,10
add eax,10
add eax,10
add eax,10
add rsp,28
ret
; Total bytes of code 29
Rather than a single mov eax, 50 to put the value 0x50 into the
return register, we have 5 separate add eax, 10 to build up that same
0x50 (80) value. That's... not ideal.
It turns out that many of the JIT's optimizations operate on the tree
data structures created as part of parsing the IL. In some cases,
optimizations can do better when they're exposed to more of the
program, in other words when the tree they're operating on is larger
and contains more to be analyzed. However, various operations can
break up these trees into smaller, individual ones, such as with
temporary variables created as part of inlining, and in doing so can
inhibit these operations. Something is needed in order to effectively
stitch these trees back together, and that's forward substitution.
You can think of forward substitution almost like an inverse of CSE;
rather than trying to find duplicate expressions and eliminate them
by computing the value once and storing it into a temporary, forward
substitution eliminates that temporary and effectively moves the
expression tree into its use site. Obviously you don't want to do
this if it would then negate CSE and result in duplicate work, but
for expressions that are defined once and used once, this kind of
forward propagation is valuable. dotnet/runtime#61023 added an
initial limited version of forward substitution, and then dotnet/
runtime#63720 added a more robust generalized implementation.
Subsequently, dotnet/runtime#70587 expanded it to also cover some
SIMD vectors, and then dotnet/runtime#71161 improved it further to
enable substitutions into more places (in this case into call
arguments). And with those, our silly benchmark now produces the
following on .NET 7:
; Program.Compute2()
sub rsp,28
call qword ptr [7FFCB8DAF9A8]
add eax,50
add rsp,28
ret
; Total bytes of code 18
Vectorization
SIMD, or Single Instruction Multiple Data, is a kind of processing in
which one instruction applies to multiple pieces of data at the same
time. You've got a list of numbers and you want to find the index of
a particular value? You could walk the list comparing one element at
a time, and that would be fine functionally. But what if in the same
amount of time it takes you to read and compare one element, you
could instead read and compare two elements, or four elements, or 32
elements? That's SIMD, and the art of utilizing SIMD instructions is
lovingly referred to as "vectorization," where operations are applied
to all of the elements in a "vector" at the same time.
.NET has long had support for vectorization in the form of Vector,
which is an easy-to-use type with first-class JIT support to enable a
developer to write vectorized implementations. One of Vector's
greatest strengths is also one of its greatest weaknesses. The type
is designed to adapt to whatever width vector instructions are
available in your hardware. If the machine supports 256-bit width
vectors, great, that's what Vector will target. If not, if the
machine supports 128-bit width vectors, great, that's what Vector
targets. But that flexibility comes with various downsides, at least
today; for example, the operations you can perform on a Vector end
up needing to be agnostic to the width of the vectors used, since the
width is variable based on the hardware on which the code actually
runs. And that means the operations that can be exposed on Vector
are limited, which in turn limits the kinds of operations that can be
vectorized with it. Also, because it's only ever a single size in a
given process, some data set sizes that fall in between 128 bits and
256 bits might not be processed as well as you'd hope. You write your
Vector-based algorithm, and you run it on a machine with
support for 256-bit vectors, which means it can process 32 bytes at a
time, but then you feed it an input with 31 bytes. Had Vector
mapped to 128-bit vectors, it could have been used to improve the
processing of that input, but as its vector size is larger than the
input data size, the implementation ends up falling back to one
that's not accelerated. There are also issues related to R2R and
Native AOT, since ahead-of-time compilation needs to know in advance
what instructions should be used for Vector operations. You
already saw this earlier when discussing the output of
DOTNET_JitDisasmSummary; we saw that the NarrowUtf16ToAscii method
was one of only a few methods that was JIT compiled in a "hello,
world" console app, and that this was because it lacked R2R code due
to its use of Vector.
Starting in .NET Core 3.0, .NET gained literally thousands of new
"hardware intrinsics" methods, most of which are .NET APIs that map
down to one of these SIMD instructions. These intrinsics enable an
expert to write an implementation tuned to a specific instruction
set, and if done well, get the best possible performance, but it also
requires the developer to understand each instruction set and to
implement their algorithm for each instruction set that might be
relevant, e.g. an AVX2 implementation if it's supported, or an SSE2
implementation if it's supported, or an ArmBase implementation if
it's supported, and so on.
.NET 7 has introduced a middle ground. Previous releases saw the
introduction of the Vector128 and Vector256 types, but purely
as the vehicle by which data moved in and out of the hardware
intrinsics, since they're all tied to specific width vectors. Now in
.NET 7, exposed via dotnet/runtime#53450, dotnet/runtime#63414,
dotnet/runtime#60094, and dotnet/runtime#68559, a very large set of
cross-platform operations is defined over these types as well, e.g.
Vector128.ExtractMostSignificantBits, Vector256.ConditionalSelect,
and so on. A developer who wants or needs to go beyond what the
high-level Vector offers can choose to target one or more of these
two types. Typically this would amount to a developer writing one
code path based on Vector128, as that has the broadest reach and
achieves a significant amount of the gains from vectorization, and
then if is motivated to do so can add a second path for Vector256
in order to potentially double throughput further on platforms that
have 256-bit width vectors. Think of these types and methods as a
platform-abstraction layer: you code to these methods, and then the
JIT translates them into the most appropriate instructions for the
underlying platform. Consider this simple code as an example:
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
internal class Program
{
private static void Main()
{
Vector128 v = Vector128.Create((byte)123);
while (true)
{
WithIntrinsics(v);
WithVector(v);
}
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static int WithIntrinsics(Vector128 v) => Sse2.MoveMask(v);
[MethodImpl(MethodImplOptions.NoInlining)]
private static uint WithVector(Vector128 v) => v.ExtractMostSignificantBits();
}
I have two functions: one that directly uses the Sse2.MoveMask
hardware intrinsic and one that uses the new Vector128
.ExtractMostSignificantBits method. Using DOTNET_JitDisasm=
Program.*, here's what the optimized tier-1 code for these looks like
on my x64 Windows machine:
; Assembly listing for method Program:WithIntrinsics(Vector128`1):int
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91001 vmovupd xmm0, xmmword ptr [rcx]
C5F9D7C0 vpmovmskb eax, xmm0
G_M000_IG03: ;; offset=000BH
C3 ret
; Total bytes of code 12
; Assembly listing for method Program:WithVector(Vector128`1):int
G_M000_IG01: ;; offset=0000H
C5F877 vzeroupper
G_M000_IG02: ;; offset=0003H
C5F91001 vmovupd xmm0, xmmword ptr [rcx]
C5F9D7C0 vpmovmskb eax, xmm0
G_M000_IG03: ;; offset=000BH
C3 ret
; Total bytes of code 12
Notice anything? The code for the two methods is identical, both
resulting in a vpmovmskb (Move Byte Mask) instruction. Yet the former
code will only work on a platform that supports SSE2 whereas the
latter code will work on any platform with support for 128-bit
vectors, including Arm64 and WASM (and any future platforms
on-boarded that also support SIMD); it'll just result in different
instructions being emitted on those platforms.
To explore this a bit more, let's take a simple example and vectorize
it. We'll implement a Contains method, where we want to search a span
of bytes for a specific value and return whether it was found:
static bool Contains(ReadOnlySpan haystack, byte needle)
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
return false;
}
How would we vectorize this with Vector? First things first, we
need to check whether it's even supported, and fall back to our
existing implementation if it's not (Vector.IsHardwareAccelerated).
We also need to fall back if the length of the input is less than the
size of a vector (Vector.Count).
static bool Contains(ReadOnlySpan haystack, byte needle)
{
if (Vector.IsHardwareAccelerated && haystack.Length >= Vector.Count)
{
// ...
}
else
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
}
return false;
}
Now that we know we have enough data, we can get to coding our
vectorized loop. In this loop, we'll be searching for the needle,
which means we need a vector that contains that value for every
element; the Vector's constructor provides that (new Vector
(needle)). And we need to be able to slice off a vector's width of
data at a time; for a bit more efficiency, I'll use pointers. We need
a current iteration pointer, and we need to iterate until the point
where we couldn't form another vector because we're too close to the
end, and a straightforward way to do that is to get a pointer that's
exactly one vector's width from the end; that way, we can just
iterate until our current pointer is equal to or greater than that
threshold. And finally, in our loop body, we need to compare our
current vector with the target vector to see if any elements are the
same (Vector.EqualsAny), if any is returning true, and if not bumping
our current pointer to the next location. At this point we have:
static unsafe bool Contains(ReadOnlySpan haystack, byte needle)
{
if (Vector.IsHardwareAccelerated && haystack.Length >= Vector.Count)
{
fixed (byte* haystackPtr = &MemoryMarshal.GetReference(haystack))
{
Vector target = new Vector(needle);
byte* current = haystackPtr;
byte* endMinusOneVector = haystackPtr + haystack.Length - Vector.Count;
do
{
if (Vector.EqualsAny(target, *(Vector*)current))
{
return true;
}
current += Vector.Count;
}
while (current < endMinusOneVector);
// ...
}
}
else
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
}
return false;
}
And we're almost done. The last issue to handle is we may still have
a few elements at the end we haven't searched. There are a couple of
ways we could handle that. One would be to just continue with our
fall back implementation and process each of the remaining elements
one at a time. Another would be to employ a trick that's common when
vectorizing idempotent operations. Our operation isn't mutating
anything, which means it doesn't matter if we compare the same
element multiple times, which means we can just do one final vector
compare for the last vector in the search space; that might or might
not overlap with elements we've already looked at, but it won't hurt
anything if it does. And with that, our implementation is complete:
static unsafe bool Contains(ReadOnlySpan haystack, byte needle)
{
if (Vector.IsHardwareAccelerated && haystack.Length >= Vector.Count)
{
fixed (byte* haystackPtr = &MemoryMarshal.GetReference(haystack))
{
Vector target = new Vector(needle);
byte* current = haystackPtr;
byte* endMinusOneVector = haystackPtr + haystack.Length - Vector.Count;
do
{
if (Vector.EqualsAny(target, *(Vector*)current))
{
return true;
}
current += Vector.Count;
}
while (current < endMinusOneVector);
if (Vector.EqualsAny(target, *(Vector*)endMinusOneVector))
{
return true;
}
}
}
else
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
}
return false;
}
Congratulations, we've vectorized this operation, and fairly decently
at that. We can throw this into benchmarkdotnet and see really nice
speedups:
private byte[] _data = Enumerable.Repeat((byte)123, 999).Append((byte)42).ToArray();
[Benchmark(Baseline = true)]
[Arguments((byte)42)]
public bool Find(byte value) => Contains(_data, value); // just the fallback path in its own method
[Benchmark]
[Arguments((byte)42)]
public bool FindVectorized(byte value) => Contains_Vectorized(_data, value); // the implementation we just wrote
Method Mean Ratio
Find 484.05 ns 1.00
FindVectorized 20.21 ns 0.04
A 24x speedup! Woo hoo, victory, all your performance are belong to
us!
You deploy this in your service, and you see Contains being called on
your hot path, but you don't see the improvements you were expecting.
You dig in a little more, and you discover that while you tested this
with an input array with 1000 elements, typical inputs had more like
30 elements. What happens if we change our benchmark to have just 30
elements? That's not long enough to form a vector, so we fall back to
the one-at-a-time path, and we don't get any speedups at all.
One thing we can now do is switch from using Vector to Vector128
. That will then lower the threshold from 32 bytes to 16 bytes,
such that inputs in that range will still have some amount of
vectorization applied. As these Vector128 and Vector256 types
have been designed very recently, they also utilize all the cool new
toys, and thus we can use refs instead of pointers. Other than that,
we can keep the shape of our implementation almost the same,
substituting Vector128 where we were using Vector, and using some
methods on Unsafe to manipulate our refs instead of pointer
arithmetic on the span we fixed.
static unsafe bool Contains(ReadOnlySpan haystack, byte needle)
{
if (Vector128.IsHardwareAccelerated && haystack.Length >= Vector128.Count)
{
ref byte current = ref MemoryMarshal.GetReference(haystack);
Vector128 target = Vector128.Create(needle);
ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128.Count);
do
{
if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current)))
{
return true;
}
current = ref Unsafe.Add(ref current, Vector128.Count);
}
while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));
if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector)))
{
return true;
}
}
else
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
}
return false;
}
With that in hand, we can now try it on our smaller 30 element data
set:
private byte[] _data = Enumerable.Repeat((byte)123, 29).Append((byte)42).ToArray();
[Benchmark(Baseline = true)]
[Arguments((byte)42)]
public bool Find(byte value) => Contains(_data, value);
[Benchmark]
[Arguments((byte)42)]
public bool FindVectorized(byte value) => Contains_Vectorized(_data, value);
Method Mean Ratio
Find 15.388 ns 1.00
FindVectorized 1.747 ns 0.11
Woo hoo, victory, all your performance are belong to us... again!
What about on the larger data set again? Previously with Vector we
had a 24x speedup, but now:
Method Mean Ratio
Find 484.25 ns 1.00
FindVectorized 32.92 ns 0.07
... closer to 15x. Nothing to sneeze at, but it's not the 24x we
previously saw. What if we want to have our cake and eat it, too?
Let's also add a Vector256 path. To do that, we literally copy/
paste our Vector128 code, search/replace all references to
Vector128 in the copied code with Vector256, and just put it into an
additional condition that uses the Vector256 path if it's
supported and there are enough elements to utilize it.
static unsafe bool Contains(ReadOnlySpan haystack, byte needle)
{
if (Vector128.IsHardwareAccelerated && haystack.Length >= Vector128.Count)
{
ref byte current = ref MemoryMarshal.GetReference(haystack);
if (Vector256.IsHardwareAccelerated && haystack.Length >= Vector256.Count)
{
Vector256 target = Vector256.Create(needle);
ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector256.Count);
do
{
if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref current)))
{
return true;
}
current = ref Unsafe.Add(ref current, Vector256.Count);
}
while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));
if (Vector256.EqualsAny(target, Vector256.LoadUnsafe(ref endMinusOneVector)))
{
return true;
}
}
else
{
Vector128 target = Vector128.Create(needle);
ref byte endMinusOneVector = ref Unsafe.Add(ref current, haystack.Length - Vector128.Count);
do
{
if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref current)))
{
return true;
}
current = ref Unsafe.Add(ref current, Vector128.Count);
}
while (Unsafe.IsAddressLessThan(ref current, ref endMinusOneVector));
if (Vector128.EqualsAny(target, Vector128.LoadUnsafe(ref endMinusOneVector)))
{
return true;
}
}
}
else
{
for (int i = 0; i < haystack.Length; i++)
{
if (haystack[i] == needle)
{
return true;
}
}
}
return false;
}
And, boom, we're back:
Method Mean Ratio
Find 484.53 ns 1.00
FindVectorized 20.08 ns 0.04
We now have an implementation that is vectorized on any platform with
either 128-bit or 256-bit vector instructions (x86, x64, Arm64, WASM,
etc.), that can use either based on the input length, and that can be
included in an R2R image if that's of interest.
There are many factors that impact which path you go down, and I
expect we'll have guidance forthcoming to help navigate all the
factors and approaches. But the capabilities are all there, and
whether you choose to use Vector, Vector128 and/or Vector256
, or the hardware intrinsics directly, there are some amazing
performance opportunities ready for the taking.
I already mentioned several PRs that exposed the new cross-platform
vector support, but that only scratches the surface of the work done
to actually enable these operations and to enable them to produce
high-quality code. As just one example of a category of such work, a
set of changes went in to help ensure that zero vector constants are
handled well, such as dotnet/runtime#63821 that "morphed" (changed)
Vector128/256.Create(default) into Vector128/256.Zero, which
then enables subsequent optimizations to focus only on Zero; dotnet/
runtime#65028 that enabled constant propagation of Vector128/256
.Zero; dotnet/runtime#68874 and dotnet/runtime#70171 that add
first-class knowledge of vector constants to the JIT's intermediate
representation; and dotnet/runtime#62933, dotnet/runtime#65632,
dotnet/runtime#55875, dotnet/runtime#67502, and dotnet/runtime#64783
that all improve the code quality of instructions generated for zero
vector comparisons.
Inlining
Inlining is one of the most important optimizations the JIT can do.
The concept is simple: instead of making a call to some method, take
the code from that method and bake it into the call site. This has
the obvious advantage of avoiding the overhead of a method call, but
except for really small methods on really hot paths, that's often on
the smaller side of the wins inlining brings. The bigger wins are due
to the callee's code being exposed to the caller's code, and vice
versa. So, for example, if the caller is passing a constant as an
argument to the callee, if the method isn't inlined, the compilation
of the callee has no knowledge of that constant, but if the callee is
inlined, all of the code in the callee is then aware of its argument
being a constant value, and can do all of the optimizations possible
with such a constant, like dead code elimination, branch elimination,
constant folding and propagation, and so on. Of course, if it were
all rainbows and unicorns, everything possible to be inlined would be
inlined, and that's obviously not happening. Inlining brings with it
the cost of potentially increased binary size. If the code being
inlined would result in the same amount or less assembly code in the
caller than it takes to call the callee (and if the JIT can quickly
determine that), then inlining is a no-brainer. But if the code being
inlined would increase the size of the callee non-trivially, now the
JIT needs to weigh that increase in code size against the throughput
benefits that could come from it. That code size increase can itself
result in throughput regressions, due to increasing the number of
distinct instructions to be executed and thereby putting more
pressure on the instruction cache. As with any cache, the more times
you need to read from memory to populate it, the less effective the
cache will be. If you have a function that gets inlined into 100
different call sites, every one of those call sites' copies of the
callee's instructions are unique, and calling each of those 100
functions could end up thrashing the instruction cache; in contrast,
if all of those 100 functions "shared" the same instructions by
simply calling the single instance of the callee, it's likely the
instruction cache would be much more effective and lead to fewer
trips to memory.
All that is to say, inlining is really important, it's important that
the "right" things be inlined and that it not overinline, and as such
every release of .NET in recent memory has seen nice improvements
around inlining. .NET 7 is no exception.
One really interesting improvement around inlining is dotnet/runtime#
64521, and it might be surprising. Consider the Boolean.ToString
method; here's its full implementation:
public override string ToString()
{
if (!m_value) return "False";
return "True";
}
Pretty simple, right? You'd expect something this trivial to be
inlined. Alas, on .NET 6, this benchmark:
private bool _value = true;
[Benchmark]
public int BoolStringLength() => _value.ToString().Length;
produces this assembly code:
; Program.BoolStringLength()
sub rsp,28
cmp [rcx],ecx
add rcx,8
call System.Boolean.ToString()
mov eax,[rax+8]
add rsp,28
ret
; Total bytes of code 23
Note the call System.Boolean.ToString(). The reason for this is,
historically, the JIT has been unable to inline methods across
assembly boundaries if those methods contain string literals (like
the "False" and "True" in that Boolean.ToString implementation). This
restriction had to do with string interning and the possibility that
such inlining could lead to visible behavioral differences. Those
concerns are no longer valid, and so this PR removes the restriction.
As a result, that same benchmark on .NET 7 now produces this:
; Program.BoolStringLength()
cmp byte ptr [rcx+8],0
je short M00_L01
mov rax,1DB54800D20
mov rax,[rax]
M00_L00:
mov eax,[rax+8]
ret
M00_L01:
mov rax,1DB54800D18
mov rax,[rax]
jmp short M00_L00
; Total bytes of code 38
No more call System.Boolean.ToString().
dotnet/runtime#61408 made two changes related to inlining. First, it
taught the inliner how to better see the what methods were being
called in an inlining candidate, and in particular when tiered
compilation is disabled or when a method would bypass tier-0 (such as
a method with loops before OSR existed or with OSR disabled); by
understanding what methods are being called, it can better understand
the cost of the method, e.g. if those method calls are actually
hardware intrinsics with a very low cost. Second, it enabled CSE in
more cases with SIMD vectors.
dotnet/runtime#71778 also impacted inlining, and in particular in
situations where a typeof() could be propagated to the callee (e.g.
via a method argument). In previous releases of .NET, various members
on Type like IsValueType were turned into JIT intrinsics, such that
the JIT could substitute a constant value for calls where it could
compute the answer at compile time. For example, this:
[Benchmark]
public bool IsValueType() => IsValueType();
private static bool IsValueType() => typeof(T).IsValueType;
results in this assembly code on .NET 6:
; Program.IsValueType()
mov eax,1
ret
; Total bytes of code 6
However, change the benchmark slightly:
[Benchmark]
public bool IsValueType() => IsValueType(typeof(int));
private static bool IsValueType(Type t) => t.IsValueType;
and it's no longer as simple:
; Program.IsValueType()
sub rsp,28
mov rcx,offset MT_System.Int32
call CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE
mov rcx,rax
mov rax,[7FFCA47C9560]
cmp [rcx],ecx
add rsp,28
jmp rax
; Total bytes of code 38
Effectively, as part of inlining the JIT loses the notion that the
argument is a constant and fails to propagate it. This PR fixes that,
such that on .NET 7, we now get what we expect:
; Program.IsValueType()
mov eax,1
ret
; Total bytes of code 6
Arm64
A huge amount of effort in .NET 7 went into making code gen for Arm64
as good or better than its x64 counterpart. I've already discussed a
bunch of PRs that are relevant regardless of architecture, and others
that are specific to Arm, but there are plenty more. To rattle off
some of them:
* Addressing modes. "Addressing mode" is the term used to refer to
how the operand of instructions are specified. It could be the
actual value, it could be the address from where a value should
be loaded, it could be the register containing the value, and so
on. Arm supports a "scaled" addressing mode, typically used for
indexing into an array, where the size of each element is
supplied and the instruction "scales" the provided offset by the
specified scale. dotnet/runtime#60808 enables the JIT to utilize
this addressing mode. More generally, dotnet/runtime#70749
enables the JIT to use addressing modes when accessing elements
of managed arrays. dotnet/runtime#66902 improves the use of
addressing modes when the element type is byte. dotnet/runtime#
65468 improves addressing modes used for floating point. And
dotnet/runtime#67490 implements addressing modes for SIMD
vectors, specifically for loads with unscaled indices.
* Better instruction selection. Various techniques go into ensuring
that the best instructions are selected to represent input code.
dotnet/runtime#61037 teaches the JIT how to recognize the pattern
(a * b) + c with integers and fold that into a single madd or
msub instruction, while dotnet/runtime#66621 does the same for a
- (b * c) and msub. dotnet/runtime#61045 enables the JIT to
recognize certain constant bit shift operations (either explicit
in the code or implicit to various forms of managed array access)
and emit sbfiz/ubfiz instructions. dotnet/runtime#70599, dotnet/
runtime#66407, and dotnet/runtime#65535 all handle various forms
of optimizing a % b. dotnet/runtime#61847 from @SeanWoo removes
an unnecessary movi emitted as part of setting a dereferenced
pointer to a constant value. dotnet/runtime#57926 from
@SingleAccretion enables computing a 64-bit result as the
multiplication of two 32-bit integers to be done with smull/
umull. And dotnet/runtime#61549 folds adds with sign extension or
zero extension into a single add instruction with uxtw/sxtw/lsl,
while dotnet/runtime#62630 drops redundant zero extensions after
a ldr instruction.
* Vectorization. dotnet/runtime#64864 adds new
AdvSimd.LoadPairVector64/AdvSimd.LoadPairVector128 hardware
intrinsics.
* Zeroing. Lots of operations require state to be set to zero, such
as initializing all reference locals in a method to zero as part
of the method's prologue (so that the GC doesn't see and try to
follow garbage references). While such functionality was
previously vectorized, dotnet/runtime#63422 enables this to be
implemented using 128-bit width vector instructions on Arm. And
dotnet/runtime#64481 changes the instruction sequences used for
zeroing in order to avoid unnecessary zeroing, free up additional
registers, and enable the CPU to recognize various instruction
sequences and better optimize.
* Memory Model. dotnet/runtime#62895 enables store barriers to be
used wherever possible instead of full barriers, and uses one-way
barriers for volatile variables. dotnet/runtime#67384 enables
volatile reads/writes to be implemented with the ldapr
instruction, while dotnet/runtime#64354 uses a cheaper
instruction sequence to handle volatile indirections. There's
dotnet/runtime#70600, which enables LSE Atomics to be used for
Interlocked operations; dotnet/runtime#71512, which enables using
the atomics instruction on Unix machines; and dotnet/runtime#
70921, which enables the same but on Windows.
JIT helpers
While logically part of the runtime, the JIT is actually isolated
from the rest of the runtime, only interacting with it through an
interface that enables communication between the JIT and the rest of
the VM (Virtual Machine). There's a large amount of VM functionality
then that the JIT relies on for good performance.
dotnet/runtime#65738 rewrote various "stubs" to be more efficient.
Stubs are tiny bits of code that serve to perform some check and then
redirect execution somewhere else. For example, when an interface
dispatch call site is expected to only ever be used with a single
implementation of that interface, the JIT might employ a "dispatch
stub" that compares the type of the object against the single one
it's cached, and if they're equal simply jumps to the right target.
You know you're in the corest of the core areas of the runtime when a
PR contains lots of assembly code for every architecture the runtime
targets. And it paid off; there's a virtual group of folks from
around .NET that review performance improvements and regressions in
our automated performance test suites, and attribute these back to
the PRs likely to be the cause (this is mostly automated but requires
some human oversight). It's always nice then when a few days after a
PR is merged and performance information has stabilized that you see
a rash of comments like there were on this PR:
Comments on GitHub PR about performance test suite improvements
For anyone familiar with generics and interested in performance, you
may have heard the refrain that generic virtual methods are
relatively expensive. They are, comparatively. For example on .NET 6,
this code:
private Example _example = new Example();
[Benchmark(Baseline = true)] public void GenericNonVirtual() => _example.GenericNonVirtual();
[Benchmark] public void GenericVirtual() => _example.GenericVirtual();
class Example
{
[MethodImpl(MethodImplOptions.NoInlining)]
public void GenericNonVirtual() { }
[MethodImpl(MethodImplOptions.NoInlining)]
public virtual void GenericVirtual() { }
}
results in:
Method Mean Ratio
GenericNonVirtual 0.4866 ns 1.00
GenericVirtual 6.4552 ns 13.28
dotnet/runtime#65926 eases the pain a tad. Some of the cost comes
from looking up some cached information in a hash table in the
runtime, and as is the case with many map implementations, this one
involves computing a hash code and using a mod operation to map to
the right bucket. Other hash table implementations around dotnet/
runtime, including Dictionary<,>, HashSet<,>, and
ConcurrentDictionary<,> previously switched to a "fastmod"
implementation; this PR does the same for this EEHashtable, which is
used as part of the CORINFO_GENERIC_HANDLE JIT helper function
employed:
Method Runtime Mean Ratio
GenericVirtual .NET 6.0 6.475 ns 1.00
GenericVirtual .NET 7.0 6.119 ns 0.95
Not enough of an improvement for us to start recommending people use
them, but a 5% improvement takes a bit of the edge off the sting.
Grab Bag
It's near impossible to cover every performance change that goes into
the JIT, and I'm not going to try. But there were so many more PRs, I
couldn't just leave them all unsung, so here's a few more quickies:
* dotnet/runtime#58727 from @benjamin-hodgson. Given an expression
like (byte)x | (byte)y, that can be morphed into (byte)(x | y),
which can optimize away some movs.
private int _x, _y;
[Benchmark]
public int Test() => (byte)_x | (byte)_y;
; *** .NET 6 ***
; Program.Test(Int32, Int32)
movzx eax,dl
movzx edx,r8b
or eax,edx
ret
; Total bytes of code 10
; *** .NET 7 ***
; Program.Test(Int32, Int32)
or edx,r8d
movzx eax,dl
ret
; Total bytes of code 7
* dotnet/runtime#67182. On a machine with support for BMI2, 64-bit
shifts can be performed with the shlx, sarx, and shrx
instructions.
[Benchmark]
[Arguments(123, 1)]
public ulong Shift(ulong x, int y) => x << y;
; *** .NET 6 ***
; Program.Shift(UInt64, Int32)
mov ecx,r8d
mov rax,rdx
shl rax,cl
ret
; Total bytes of code 10
; *** .NET 7 ***
; Program.Shift(UInt64, Int32)
shlx rax,rdx,r8
ret
; Total bytes of code 6
* dotnet/runtime#69003 from @SkiFoD. The pattern ~x + 1 can be
changed into a two's-complement negation.
[Benchmark]
[Arguments(42)]
public int Neg(int i) => ~i + 1;
; *** .NET 6 ***
; Program.Neg(Int32)
mov eax,edx
not eax
inc eax
ret
; Total bytes of code 7
; *** .NET 7 ***
; Program.Neg(Int32)
mov eax,edx
neg eax
ret
; Total bytes of code 5
* dotnet/runtime#61412 from @SkiFoD. An expression X & 1 == 1 to
test whether the bottom bit of a number is set can changed to the
cheaper X & 1 (which isn't actually expressible without a
following != 0 in C#).
[Benchmark]
[Arguments(42)]
public bool BitSet(int x) => (x & 1) == 1;
; *** .NET 6 ***
; Program.BitSet(Int32)
test dl,1
setne al
movzx eax,al
ret
; Total bytes of code 10
; *** .NET 7 ***
; Program.BitSet(Int32)
mov eax,edx
and eax,1
ret
; Total bytes of code 6
* dotnet/runtime#63545 from @Wraith2. The expression x & (x - 1)
can be lowered to the blsr instruction.
[Benchmark]
[Arguments(42)]
public int ResetLowestSetBit(int x) => x & (x - 1);
; *** .NET 6 ***
; Program.ResetLowestSetBit(Int32)
lea eax,[rdx+0FFFF]
and eax,edx
ret
; Total bytes of code 6
; *** .NET 7 ***
; Program.ResetLowestSetBit(Int32)
blsr eax,edx
ret
; Total bytes of code 6
* dotnet/runtime#62394. / and % by a vector's .Count wasn't
recognizing that Count can be unsigned, but doing so leads to
better code gen.
[Benchmark]
[Arguments(42u)]
public long DivideByVectorCount(uint i) => i / Vector.Count;
; *** .NET 6 ***
; Program.DivideByVectorCount(UInt32)
mov eax,edx
mov rdx,rax
sar rdx,3F
and rdx,1F
add rax,rdx
sar rax,5
ret
; Total bytes of code 21
; *** .NET 7 ***
; Program.DivideByVectorCount(UInt32)
mov eax,edx
shr rax,5
ret
; Total bytes of code 7
* dotnet/runtime#60787. Loop alignment in .NET 6 provides a very
nice exploration of why and how the JIT handles loop alignment.
This PR extends that further by trying to "hide" an emitted align
instruction behind an unconditional jmp that might already exist,
in order to minimize the impact of the processor having to fetch
and decode nops.
GC
"Regions" is a feature of the garbage collector (GC) that's been in
the works for multiple years. It's enabled by default in 64-bit
processes in .NET 7 as of dotnet/runtime#64688, but as with other
multi-year features, a multitude of PRs went into making it a
reality. At a 30,000 foot level, "regions" replaces the current
"segments" approach to managing memory on the GC heap; rather than
having a few gigantic segments of memory (e.g. each 1GB), often
associated 1:1 with a generation, the GC instead maintains many, many
smaller regions (e.g. each 4MB) as their own entity. This enables the
GC to be more agile with regards to operations like repurposing
regions of memory from one generation to another. For more
information on regions, the blog post Put a DPAD on that GC! from the
primary developer on the GC is still the best resource.
Native AOT
To many people, the word "performance" in the context of software is
about throughput. How fast does something execute? How much data per
second can it process? How many requests per second can it process?
And so on. But there are many other facets to performance. How much
memory does it consume? How fast does it start up and get to the
point of doing something useful? How much space does it consume on
disk? How long would it take to download? And then there are related
concerns. In order to achieve these goals, what dependencies are
required? What kinds of operations does it need to perform to achieve
these goals, and are all of those operations permitted in the target
environment? If any of this paragraph resonates with you, you are the
target audience for the Native AOT support now shipping in .NET 7.
.NET has long had support for AOT code generation. For example, .NET
Framework had it in the form of ngen, and .NET Core has it in the
form of crossgen. Both of those solutions involve a standard .NET
executable that has some of its IL already compiled to assembly code,
but not all methods will have assembly code generated for them,
various things can invalidate the assembly code that was generated,
external .NET assemblies without any native assembly code can be
loaded, and so on, and in all of those cases, the runtime continues
to utilize a JIT compiler. Native AOT is different. It's an evolution
of CoreRT, which itself was an evolution of .NET Native, and it's
entirely free of a JIT. The binary that results from publishing a
build is a completely standalone executable in the target platform's
platform-specific file format (e.g. COFF on Windows, ELF on Linux,
Mach-O on macOS) with no external dependencies other than ones
standard to that platform (e.g. libc). And it's entirely native: no
IL in sight, no JIT, no nothing. All required code is compiled and/or
linked in to the executable, including the same GC that's used with
standard .NET apps and services, and a minimal runtime that provides
services around threading and the like. All of that brings great
benefits: super fast startup time, small and entirely-self contained
deployment, and ability to run in places JIT compilers aren't allowed
(e.g. because memory pages that were writable can't then be
executable). It also brings limitations: no JIT means no dynamic
loading of arbitrary assemblies (e.g. Assembly.LoadFile) and no
reflection emit (e.g. DynamicMethod), everything compiled and linked
in to the app means the more functionality that's used (or might be
used) the larger is your deployment, etc. Even with those
limitations, for a certain class of application, Native AOT is an
incredibly exciting and welcome addition to .NET 7.
Too many PRs to mention have gone into bringing up the Native AOT
stack, in part because it's been in the works for years (as part of
the archived dotnet/corert project and then as part of dotnet/
runtimelab/feature/NativeAOT) and in part because there have been
over a hundred PRs just in dotnet/runtime that have gone into
bringing Native AOT up to a shippable state since the code was
originally brought over from dotnet/runtimelab in dotnet/runtime#
62563 and dotnet/runtime#62611. Between that and there not being a
previous version to compare its performance to, instead of focusing
PR by PR on improvements, let's just look at how to use it and the
benefits it brings.
Today, Native AOT is focused on console applications, so let's create
a console app:
dotnet new console -o nativeaotexample
We now have our nativeaotexample directory containing a
nativeaotexample.csproj and a "hello, world" Program.cs. To enable
publishing the application with Native AOT, edit the .csproj to
include this in the existing ....
true
And then... actually, that's it. Our app is now fully configured to be
able to target Native AOT. All that's left is to publish. As I'm
currently writing this on my Windows x64 machine, I'll target that:
dotnet publish -r win-x64 -c Release
I now have my generated executable in the output publish directory:
Directory: C:\nativeaotexample\bin\Release\net7.0\win-x64\publish
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 8/27/2022 6:18 PM 3648512 nativeaotexample.exe
-a--- 8/27/2022 6:18 PM 14290944 nativeaotexample.pdb
That ~3.5MB .exe is the executable, and the .pdb next to it is debug
information, which needn't actually be deployed with the app. I can
now copy that nativeaotexample.exe to any 64-bit Windows machine,
regardless of what .NET may or may not be installed anywhere on the
box, and my app will run. Now, if what you really care about is size,
and 3.5MB is too big for you, you can start making more tradeoffs.
There are a bunch of switches you can pass to the Native AOT compiler
(ILC) and to the trimmer that impact what code gets included in the
resulting image. Let me turn the dial up a bit:
truetruetrueSizefalsefalsefalsefalsefalsefalse
I republish, and now I have:
Directory: C:\nativeaotexample\bin\Release\net7.0\win-x64\publish
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a--- 8/27/2022 6:19 PM 2061824 nativeaotexample.exe
-a--- 8/27/2022 6:19 PM 14290944 nativeaotexample.pdb
so 2M instead of 3.5MB. Of course, for that significant reduction
I've given up some things:
* Setting InvariantGlobalization to true means I'm now not
respecting culture information and am instead using a set of
invariant data for most globalization operations.
* Setting UseSystemResourceKeys to true means nice exception
messages are stripped away.
* Setting IlcGenerateStackTraceData to false means I'm going to get
fairly poor stack traces should I need to debug an exception.
* Setting DebuggerSupport to false... good luck debugging things.
* ... you get the idea.
One of the potentially mind-boggling aspects of Native AOT for a
developer used to .NET is that, as it says on the tin, it really is
native. After publishing the app, there is no IL involved, and
there's no JIT that could even process it. This makes some of the
other investments in .NET 7 all the more valuable, for example
everywhere investments are happening in source generators. Code that
previously relied on reflection emit for good performance will need
another scheme. We can see that, for example, with Regex.
Historically for optimal throughput with Regex, it's been recommended
to use RegexOptions.Compiled, which uses reflection emit at run-time
to generate an optimized implementation of the specified pattern. But
if you look at the implementation of the Regex constructor, you'll
find this nugget:
if (RuntimeFeature.IsDynamicCodeCompiled)
{
factory = Compile(pattern, tree, options, matchTimeout != InfiniteMatchTimeout);
}
With the JIT, IsDynamicCodeCompiled is true. But with Native AOT,
it's false. Thus, with Native AOT and Regex, there's no difference
between specifying RegexOptions.Compiled and not, and another
mechanism is required to get the throughput benefits promised by
RegexOptions.Compiled. Enter [GeneratedRegex(...)], which, along with
the new regex source generator shipping in the .NET 7 SDK, emits C#
code into the assembly using it. That C# code takes the place of the
reflection emit that would have happened at run-time, and is thus
able to work successfully with Native AOT.
private static readonly string s_haystack = new HttpClient().GetStringAsync("https://www.gutenberg.org/files/1661/1661-0.txt").Result;
private Regex _interpreter = new Regex(@"^.*elementary.*$", RegexOptions.Multiline);
private Regex _compiled = new Regex(@"^.*elementary.*$", RegexOptions.Compiled | RegexOptions.Multiline);
[GeneratedRegex(@"^.*elementary.*$", RegexOptions.Multiline)]
private partial Regex SG();
[Benchmark(Baseline = true)] public int Interpreter() => _interpreter.Count(s_haystack);
[Benchmark] public int Compiled() => _compiled.Count(s_haystack);
[Benchmark] public int SourceGenerator() => SG().Count(s_haystack);
Method Mean Ratio
Interpreter 9,036.7 us 1.00
Compiled 9,064.8 us 1.00
SourceGenerator 426.1 us 0.05
So, yes, there are some constraints associated with Native AOT, but
there are also solutions for working with those constraints. And
further, those constraints can actually bring further benefits.
Consider dotnet/runtime#64497. Remember how we talked about "guarded
devirtualization" in dynamic PGO, where via instrumentation the JIT
can determine the most likely type to be used at a given call site
and special-case it? With Native AOT, the entirety of the program is
known at compile time, with no support for Assembly.LoadFrom or the
like. That means at compile time, the compiler can do whole-program
analysis to determine what types implement what interfaces. If a
given interface only has a single type that implements it, then every
call site through that interface can be unconditionally
devirtualized, without any type-check guards.
This is a really exciting space, one we expect to see flourish in
coming releases.
Mono
Up until now I've referred to "the JIT," "the GC," and "the runtime,"
but in reality there are actually multiple runtimes in .NET. I've
been talking about "coreclr," which is the runtime that's recommended
for use on Linux, macOS, and Windows. However, there's also "mono,"
which powers Blazor wasm applications, Android apps, and iOS apps.
It's also seen significant improvements in .NET 7.
Just as with coreclr (which can JIT compile, AOT compile partially
with JIT fallback, and fully Native AOT compile), mono has multiple
ways of actually executing code. One of those ways is an interpreter,
which enables mono to execute .NET code in environments that don't
permit JIT'ing and without requiring ahead-of-time compilation or
incurring any limitations it may bring. Interestingly, though, the
interpreter is itself almost a full-fledged compiler, parsing the IL,
generating its own intermediate representation (IR) for it, and doing
one or more optimization passes over that IR; it's just that at the
end of the pipeline when a compiler would normally emit code, the
interpreter instead saves off that data for it to interpret when the
time comes to run. As such, the interpreter has a very similar
conundrum to the one we discussed with coreclr's JIT: the time it
takes to optimize vs the desire to start up quickly. And in .NET 7,
the interpreter employs a similar solution: tiered compilation.
dotnet/runtime#68823 adds the ability for the interpreter to
initially compile with minimal optimization of that IR, and then once
a certain threshold of call counts has been hit, then take the time
to do as much optimization on the IR as possible for all future
invocations of that method. This yields the same benefits as it does
for coreclr: improved startup time while also having efficient
sustained throughput. When this merged, we saw improvements in Blazor
wasm app startup time improve by 10-20%. Here's one example from an
app being tracked in our benchmarking system:
Time to first UI (ms)
The interpreter isn't just used for entire apps, though. Just as how
coreclr can use the JIT when an R2R image doesn't contain code for a
method, mono can use the interpreter when there's no AOT code for a
method. Once such case that occurred on mono was with generic
delegate invocation, where the presence of a generic delegate being
invoked would trigger falling back to the interpreter; for .NET 7,
that gap was addressed with dotnet/runtime#70653. A more impactful
case, however, is dotnet/runtime#64867. Previously, any methods with
catch or filter exception handling clauses couldn't be AOT compiled
and would fall back to being interpreted. With this PR, the method is
now able to be AOT compiled, and it only falls back to using the
interpreter when an exception actually occurs, switching over to the
interpreter for the remainder of that method call's execution. Since
many methods contain such clauses, this can make a big difference in
throughput and CPU consumption. In the same vein, dotnet/runtime#
63065 enabled methods with finally exception handling clauses to be
AOT compiled; just the finally block gets interpreted rather than the
entire method being interpreted.
Beyond such backend improvements, another class of improvement came
from further unification between coreclr and mono. Years ago, coreclr
and mono had their own entire library stack built on top of them.
Over time, as .NET was open sourced, portions of mono's stack got
replaced by shared components, bit by bit. Fast forward to today, all
of the core .NET libraries above System.Private.CoreLib are the same
regardless of which runtime is being employed. In fact, the source
for CoreLib itself is almost entirely shared, with ~95% of the source
files being compiled into the CoreLib that's built for each runtime,
and just a few percent of the source specialized for each (these
statements means that the vast majority of the performance
improvements discussed in the rest of this post apply equally whether
running on mono and coreclr). Even so, every release now we try to
chip away at that few remaining percent, for reasons of
maintainability, but also because the source used for coreclr's
CoreLib has generally had more attention paid to it from a
performance perspective. dotnet/runtime#71325, for example, moves
mono's array and span sorting generic sorting utility class over to
the more efficient implementation used by coreclr.
One of the biggest categories of improvements, however, is in
vectorization. This comes in two pieces. First, Vector and
Vector128 are now fully accelerated on both x64 and Arm64, thanks
to PRs like dotnet/runtime#64961, dotnet/runtime#65086, dotnet/
runtime#65128, dotnet/runtime#66317, dotnet/runtime#66391, dotnet/
runtime#66409, dotnet/runtime#66512, dotnet/runtime#66586, dotnet/
runtime#66589, dotnet/runtime#66597, dotnet/runtime#66476, and dotnet
/runtime#67125; that significant amount of work means all that code
that gets vectorized using these abstractions will light-up on mono
and coreclr alike. Second, thanks primarily to dotnet/runtime#70086,
mono now knows how to translate Vector128 operations to WASM's
SIMD instruction set, such that code vectorized with Vector128
will also be accelerated when running in Blazor wasm applications and
anywhere else WASM might be executed.
Reflection
Reflection is one of those areas you either love or hate (I find it a
bit humorous to be writing this section immediately after writing the
Native AOT section). It's immensely powerful, providing the ability
to query all of the metadata for code in your process and for
arbitrary assemblies you might encounter, to invoke arbitrary
functionality dynamically, and even to emit dynamically-generated IL
at run-time. It's also difficult to handle well in the face of
tooling like a linker or a solution like Native AOT that needs to be
able to determine at build time exactly what code will be executed,
and it's generally quite expensive at run-time; thus it's both
something we strive to avoid when possible but also invest in
reducing the costs of, as it's so popular in so many different kinds
of applications because it is incredibly useful. As with most
releases, it's seen some nice improvements in .NET 7.
One of the most impacted areas is reflection invoke. Available via
MethodBase.Invoke, this functionality let's you take a MethodBase
(e.g. MethodInfo) object that represents some method for which the
caller previously queried, and call it, with arbitrary arguments that
the runtime needs to marshal through to the callee, and with an
arbitrary return value that needs to be marshaled back. If you know
the signature of the method ahead of time, the best way to optimize
invocation speed is to create a delegate from the MethodBase via
CreateDelegate and then use that delegate for all future
invocations. But in some circumstances, you don't know the signature
at compile time, and thus can't easily rely on delegates with known
matching signatures. To address this, some libraries have taken to
using reflection emit to generate code at run-time specific to the
target method. This is extremely complicated and it's not something
we want apps to have to do. Instead, in .NET 7 via dotnet/runtime#
66357, dotnet/runtime#69575, and dotnet/runtime#74614, Invoke will
itself use reflection emit (in the form of DynamicMethod) to generate
a delegate that is customized for invoking the target, and then
future invocation via that MethodInfo will utilize that generated
method. This gives developers most of the performance benefits of a
custom reflection emit-based implementation but without having the
complexity or challenges of such an implementation in their own code
base.
private MethodInfo _method;
[GlobalSetup]
public void Setup() => _method = typeof(Program).GetMethod("MyMethod", BindingFlags.NonPublic | BindingFlags.Static);
[Benchmark]
public void MethodInfoInvoke() => _method.Invoke(null, null);
private static void MyMethod() { }
Method Runtime Mean Ratio
MethodInfoInvoke .NET 6.0 43.846 ns 1.00
MethodInfoInvoke .NET 7.0 8.078 ns 0.18
Reflection also involves lots of manipulation of objects that
represent types, methods, properties, and so on, and tweaks here and
there can add up to a measurable difference when using these APIs.
For example, I've talked in past performance posts about how,
potentially counterintuitively, one of the ways we've achieved
performance boosts is by porting native code from the runtime back
into managed C#. There are a variety of ways in which doing so can
help performance, but one is that there is some overhead associated
with calling from managed code into the runtime, and eliminating such
hops avoids that overhead. This can be seen in full effect in dotnet/
runtime#71873, which moves several of these "FCalls" related to Type,
RuntimeType (the Type-derived class used by the runtime to represent
its types), and Enum out of native into managed.
[Benchmark]
public Type GetUnderlyingType() => Enum.GetUnderlyingType(typeof(DayOfWeek));
Method Runtime Mean Ratio
GetUnderlyingType .NET 6.0 27.413 ns 1.00
GetUnderlyingType .NET 7.0 5.115 ns 0.19
Another example of this phenomenon comes in dotnet/runtime#62866,
which moved much of the underlying support for AssemblyName out of
native runtime code into managed code in CoreLib. That in turn has an
impact on anything that uses it, such as when using
Activator.CreateInstance overloads that take assembly names that need
to be parsed.
private readonly string _assemblyName = typeof(MyClass).Assembly.FullName;
private readonly string _typeName = typeof(MyClass).FullName;
public class MyClass { }
[Benchmark]
public object CreateInstance() => Activator.CreateInstance(_assemblyName, _typeName);
Method Runtime Mean Ratio
CreateInstance .NET 6.0 3.827 us 1.00
CreateInstance .NET 7.0 2.276 us 0.60
Other changes contributed to Activator.CreateInstance improvements as
well. dotnet/runtime#67148 removed several array and list allocations
from inside of the RuntimeType.CreateInstanceImpl method that's used
by CreateInstance (using Type.EmptyTypes instead of allocating a new
Type[0], avoiding unnecessarily turning a builder into an array,
etc.), resulting in less allocation and faster throughput.
[Benchmark]
public void CreateInstance() => Activator.CreateInstance(typeof(MyClass), BindingFlags.NonPublic | BindingFlags.Instance, null, Array.Empty