R&D|Engineering

Optimising Entity Framework for fun and profit

June 9, 2020

Published by Rick Bolton

Entity Framework

Ye cannae break the laws of physics

The core of any modern line-of-business application is a collection of behaviours and operational data and - despite recent trends toward document databases to contend with new problems of massive horizontal scaling - the traditional relational database with a SQL interface resoundingly continues to be the system of choice for managing and keeping data. The shift to cloud infrastructure and language improvements such as async/await have pushed towards easy parallelism and horizontal scaling in the application domain, but this push also means that increasingly our systems are spending less and less time running business logic and push more load downstream onto the database and the network link. For the database we can scale up, and possibly shard and replicate out, and thereby hopefully deal with increased workload, but the network is subject to the laws of physics and consequently there is a limit to how much data we can pass through it, and how long we can expect to wait for a response. This manifests as network latency, and because of this it is in our interests to minimise the number of database calls we make from the application, and the amount of data we pass over the wire.

As .NET developers we're often likely to be using Entity Framework as our ORM, and that's what I'm going to be focusing on in this article. Entity Framework does a lot for us, placing as it does both the Repository and Unit Of Work patterns on top of the core ADO.NET libraries. This level of abstraction helps rapid development by insulating us from the concern of writing SQL and mapping data tables into in-memory objects. The cost of this broad-stroke approach, though, is that we cede control of the queries we're actually running against the database, and they are frequently not going to be as efficient as hand-crafted queries (incidentally, if you're not familiar with SQL as a developer, I strongly recommend putting aside some time to learn it). EF is very much a trade-off in terms of development time against operational efficiency, though in a complex system managing hundreds or thousands of hand written queries may prove unpalatable, so we don't necessarily want to switch to a so-called Micro ORM like Dapper or, god forbid, go back to working directly with ADO.NET. Fortunately, there's actually quite a lot we can do with Enttiy Framework to slim down our queries, and hopefully strike a good balance between ease of development and performance.

Querying entities

Let's begin with probably the most basic operation - loading a single entity from a table. Let's say we have a table Holiday and we have a DbContext with a collection of Holiday that maps to it. Loading the entity into memory is as simple running a find against the primary key:

var holiday = _context.Holidays.Find(12345);

The generated query will look something like a SELECT of every column in the table, filtered by the key column in the WHERE statement. This is, of course, perfectly fine if we're using most of the entity, and actually pretty useful if we're going to modify the entity, as we want EF to correctly identity what has changed when crafting UPDATE statements. If our table grows wide, the amount of data we retrieve here will obviously increase too. There are scenarios where we could intentionally not map the full set of columns into our application entity, though caution should be taken with this approach as EF will fail on insert operations where it is not aware of all non-nullable columns to set their values. Most of the time, we're going to have a one-to-one mapping between the table columns and entity properties.

Let's say the Holiday table has a many-to-one relationship to an Employee table, and we want to retrieve the employee's name for display in our UI. Our Holiday entity has an Employee nav property and the Employee a corresponding ICollection<Holiday>, so by loading the holiday entity we've actually hooked into an object graph. If we access the Employee property getter, we've not actually loaded the corresponding table row into memory yet, so EF has to generate and send another query, and then the getter returns the newly loaded entity. Note that this second database call was not explicity in our code - the property getter had a black box behaviour to lazy load the Employee entity, and each subsequent time accessed (over the lifetime of the DbContext) it will return the same in-memory instance.

Expanding on this, it's likely we'll want to display more than one holiday at a time in a list view, so let's start by loading all of them:

var holidays = _context.Holidays.ToList();

This is just the previous SQL statement without the filter, so we're loading all of the columns for more than one row now. Again, let's say we want to display the name of the associated employee for each item in a display. We can just iterate through each holiday and interrogate the Employee property of each to extract the name. This works as before, and there's actually a hidden efficiency as any time we come across a holiday belonging to an Employee we have already loaded, the DbContext knows that Employee is in memory and links up the navigation property to the same instance. The issue with this is that the number of database requests now scales with the number of holiday records we load. As our database grows, the performance of this code is only going to get worse. This is the "N+1 Problem" - loading all the data we want takes the initial (1) load, plus another N loads (where N is the number of rows loaded).

Solutions to the N+1 problem

The N+1 Problem is a very well-known concern in application development, and so people have developed a number of ways to counter it. There is no one silver bullet that is the best solution in every situation, so it pays for us to be aware of our options so we can take them into account for any specific situation. If we boil the problem down to its essence, we want to load a number of Employee records associated with our Holiday data set, so how about we explicitly retrieve all of those employees as a second query. We want to move away from loading entire tables, so let's try just loading the employee entities we require:

var employeeIds = holidays.Select(x => x.EmployeeId).Distinct();
var employees = _context.Employee.Where(x => userIds.Contains(x.EmployeeId));

"Contains" here is an extension method on IEnumerable<T> that Linq-to-Entities is aware of, and EF knows to map this filter lambda to something more like this:

FROM Employee
WHERE Id IN (4, 7, 9)

So now we have an employee set in memory, it's trivial to put the names we want into the display. We could simply search this in-memory set by each employee id as we need them, but it's interesting to note that each of these entities is now loaded into and managed by the DbContext. If we iterate through our holiday collection now and access each Employee property, the lazy loading logic will check within the context and find that each Employee is now already loaded, so we actually don't need to change our display code from before at all. If we're no longer using the employees variable to access the loaded employee set, we can actually remove it entirely, apparantly throwing the result of the .Where operation away. Of course, we are relying on implicit behaviour here, so our happiness about this may vary with our stomach for that sort of thing.

Using joins

If you were retrieving this data by hand for an ad-hoc report, you'd most likely go straight to using a SQL join - so if it works there, how about we try that in our application too? Entity Framework can indeed be coerced into using joins - we must use the .Include extension method of the IQuertable<T> interface implemented by the data sets:

var holidays = _context.Holidays.Include(x => x.Employee).ToList();

This will load the holidays with a LEFT JOIN on the Employee table, so now we'll get each row of the Holidays with all of its associated Employee row. Any RDBMS will handle this operation easily as long as the foreign key is properly indexed, though we may now be receiving a lot of duplicate data over the wire due to the repeated Employees. The good news though is that now we're fetching all of the data we want in a single request-response, which means that we no longer have to query the Holidays, wait, query the Employees, wait... The cost of this denormalisation is not really on either the database or application side, but in the network bandwidth usage of the duplicate data. The cost of a kilobyte in memory is negligable compared to the cost of a kilobyte over the wire.

In terms of the DbContext object graph, this behaves much the same as before. We have the collection of Holidays, and we can happily navigate into the Employee properties without triggering any further hidden network operations. In addition, the duplicate employee data is simply discarded - no more than one Employee entity will be generated for each primary key loaded, and all associated properties will dereference to that single instance.

Lazy loading and the object graph

Note that, even though we have loaded a partial object graph, further use of navigation properties to entities that are not in memory will still necessitate the use of additional lazy loading. In our example so far, the back reference from an Employee to its Holidays will trigger a load, because the DbContext cannot be sure that it has the full set of associated holidays without a query. A one-to-one relationship is simpler - both sides of the relationship are known and hold a reference to each other, once loaded. It's because of all this implicit behaviour and reliance on an underlying open database connection that it is not generally considered good practice to pass ORM entities to the presentation layer of an application - once you have one entity, you can begin pulling on the thread and navigate pretty much everywhere inside the DbContext. There is no guarantee either that your entities will not outlive their database connection, and you may find unexpected exceptions bubbling out of your view logic. In fact, entities don't play nice with serialization code at all, which may get stuck in loops following all of the two-way navigation properties.

If you really want to be sure you're not triggering unseen lazy loading, it is possible to turn off the feature in your DbContext entirely by adding a line to your DbContext constructor:

this.Configuration.LazyLoadingEnabled = false;

If this is too much of a blunt-instrument approach for you, it is also possible to turn off lazy loading on an individual navigation property basis by removing the virtual keyword. The virtual keyword in this case is what allows EF to override the propery's behaviour using IL magic - the materialisation behaviour is added in a generated subclass of your POCO entity at runtime. Non-virtual navigation properties can still be set by using the Include statement and eager loading.

Entity tracking

So we've seen how the DbContext keeps a working object graph in memory, and how this is further populated by each successive query. This lives until it is cleared upon disposal of the DbContext, which acts in the context of EF as a unit of work. Having a working graph of entities is important for write operations, as it allows us to perform transformations on them before synchronising the changes back to the database via a call to Commit. In more traditional database-first modes of EF, entities were generated code and kept track of all property changes so that the synchronisation process could easily generate UPDATEs to sync back to the database. For POCO approaches, this is not possible, so instead the initial state of each entity must also be kept in memory for comparison. Because of this, the memory overhead of a large list query may actually be quite a lot more than just having the entities in memory.

Fortunately, we know that when we do a large multi-row query, we are nearly always retrieving data for a list view, and will therefore not be making any changes to synchronise. As we may potentially be performing many other read operations in parallel at the same time, this could actually form a performance bottleneck. In this case, we may disable tracking for the query:

var holidays = _context.Holidays.Include(x => x.Employee).AsNoTracking().ToList();

This will not impact our performance over the wire - it may even cause further lazy loading to occur if we're not careful, as our context is no longer keeping an in-memory set of entities to draw from - but it may help our memory profile. We can still safely reference our Employee property from each Holiday though, as we have already joined that in the include. In more complex situations it may be sensible to query the various required tables separately into variables and join them in process instead as required. This approach can be quite effective in read-write separation patterns such as CQRS, where we organize our code into operations that are explicitly either read-write or read-only.

Projection

There's another aspect of optimisation that we can explore when we have operations that are explicitly read-only. So far we've looked at reducing the volume of queries, and also using joins to load related data together at the cost of some duplication. The metric we haven't really considered yet is the width of the data set we load. If we're writing a read-only query and we already don't need entity tracking, but we're still loading entire rows of each database entity, do we really need to be doing that? Without even considering how much of the Holiday entity is actually relevant to our list display, our simple use case clearly does not need to be loading the entire Employee row just to retrieve a name. An Employee record may well contain all sorts of information we don't care about for this operation, so can we do this?

In Entity Framework, we can indeed select more limited datasets using Projection. We can actually entirely forego the entity system by selecting into an anonymous type:

var query = _dbContext.Holidays.Select(x => new { Id = x.Id, Date = x.Date });

We're not loading or tracking entities here, but still utilising the data mappings to query our tables. We can also run simple aggreation queries like this, for instance counting the total rows:

var count = _context.Holidays.Count();

Obviously, loading the entire table as entities to do this could be brutally punishing on both our memory and network bandwidth usage!

Our first example still hasn't loaded the holidays with the associated employee name, so let's rectify that. The syntax is a little bit more complicated as we can't use .Include() here. Instead we join using the SelectMany extension method.

var query = _context.Employees.SelectMany(employee => employee.Holidays)
							   .Select(holiday => new {
												        Id = holiday.Id,
												        Date = holiday.Date,
												        Name = employee.Name
							   				    });

So now we're really leveraging our asymmetrical architecture to create potentially significantly slimmed down read sets. Of course, if you're writing simple CRUD operations, you may still find that retrieving a single entity is still the way to go, but it's definitely worth considering if you really need everything you're pulling back.

Wrapping up

We've considered the cost of our database access operations and, accepting that our system will often want to work with up-to-date information, run through a few of Entity Framework's built-in features to try and load data as efficiently as possible. Reducing the footprint of our data access is definitely a laudable goal, but we have not yet really begun to consider whether we should be loading all this data every time in the first place. For instance, it is likely not the end of the world if a system user has changed the name of an employee and this is not reflected in the display for other users for a few minutes. However, keeping a cache of data where we've identified that less fresh data can be used is a complex subject in itself, and beyond the scope of this article.

As a final word, I'd advise that none of these techniques are necessarily going to be the best approach in all situations, and it may be a matter of trial and error with profilng to truly understand the performance characteristics of your queries. Though it's often said that premature optimisation is the root of much evil, if you are more mindful in your approach to data access from the get-go, you will hopefully be able to wring more performance out of your infrastructure and avoid a truly dire situation where the entire data access layer is problematic and requires a rewrite. Even then, you're likely to find complex query situations that are inherently expensive, such as search pages and calendar views potentially combining multiple sets of data - in this cases, it may well be time to let go of optimising the query and go outside of single-source-of-truth database paradigms entirely and look to caching and denormalised read models.