2019-10-04

Entity Framework 6, SQL, and nullable strings

I ran into an issue that appears to be caused by Microsoft attempting to protect me from myself. Although, truth be told, it wouldn't have been an issue if things were a little better designed.

Imagine, if you will, a SQL Server database with a table of transactions. One of the fields on this table is a CorrelationId. It's a text field that is populated by a different system to tie transactions together (for example, two sides of a transfer from one customer to another). This field always gets populated on new transactions; the uncorrelated ones will just be the only one with a given CorrelationId. However, this system is not new; it was converted to replace an older system that did not have a defined CorrelationId. So, although the five million or so transactions created by this system have a CorrelationId, there are 12 million "legacy" records that have a CorrelationId of NULL.

So, say, for a given transaction, you want to find all correlated transactions. In SQL Server, you might use a simple query like this:

SELECT *
FROM dbo.TransactionTable
WHERE CorrelationId =
(SELECT CorrelationId FROM dbo.TransactionTable WHERE Id = @TransactionId)

And this would work, for the most part (except for legacy records, since SQL will fail to match on the NULL value — but we can ignore this for now). If you took this query into SQL Management Studio and looked at the execution plan, you'd see a nice thin line from the index seek on the CorrelationId, showing that it found and processed a tiny number of matching records, resulting in a very quick response.

However, if you were trying to do this programmatically from a C# application using Entity Framework 6, you might write some code like:

var query = from txn in transactionTable.Entities
where txn.Id == transactionId
join txn2 in transactionTable.Entities on txn.CorrelationId equals txn2.CorrelationId
select txn2;

The problem is, in C# code, null values are equal to another; while in SQL, "null" is considered "unknown", and doesn't equal itself. (The theory is, you can't know if one "null", or unknown value, equals another "null"; so equality tests between "null" and "null" are false.) Instead of leaving it up to the programmer to explicitly code for this condition, Entity Framework "helpfully" writes the join clause that it gives to SQL server in this manner:

JOIN [dbo].[TransactionTable] AS [Extent2] ON (
([Extent1].[CorrelationId] = [Extent2].[CorrelationId])
OR
(([Extent1].[CorrelationId] IS NULL) AND ([Extent2].[CorrelationId] IS NULL))
)

The extra check for IS NULL on both sides has two unfortunate side effects in this case:

  1. If the transaction is one of the legacy records, it will return a positive match on all 12 million other legacy records with a null CorrelationId.
  2. If the transaction has a CorrelationId, because of the IS NULL, SQL Server will investigate the 12 million null values in the CorrelationId index, resulting in a big fat line from the index seek in the execution plan, and a return time of a couple seconds or more.

The really annoying part is that there doesn't appear to be a way to stop this. Even if you explicitly add a check for a not-equal-to-null on your target table, Entity Framework still wraps the equality test with checks for IS NULL. The result is almost comical. For instance, adding txn2.CorrelationId != null either in the join statement or as a where clause, results in this (with contradictory statements highlighted):

[Extent2].[CorrelationId] IS NOT NULL
AND (
([Extent1].[CorrelationId] = [Extent2].[CorrelationId])
OR
(([Extent1].[CorrelationId] IS NULL) AND ([Extent2].[CorrelationId] IS NULL))
)

Even trying to break up the work into two statements didn't help. This code:

var corrId = from txn in transactionTable.Entities where txn.Id == transactionId select txn.CorrelationId;
var txns = from txn in transactionTable.Entities where txn.CorrelationId == corrId select txn;

Resulted in this SQL:

WHERE ([Extent1].[CorrelationId] = @p__linq__0)
OR
(([Extent1].[CorrelationId] IS NULL) AND (@p__linq__0 IS NULL))

Granted, this is a really bad situation to be in to begin with. Indexes on text fields tend to perform poorly, and having such a huge number of null values in the index is likewise unhelpful. A better design would be to rip the text field off into another table, or somehow otherwise convert it into an integer that would be easier to index (something we've had to do in other tables on this very same project, where we've had more control of the data).

I'm willing to bet that Microsoft's translation goes completely unnoticed in over 99% of the cases where it occurs. And, if I had the time to make a design change (with all of the necessary changes to all points that hit this table, some of which I don't have direct control over), it could have been resolved without fighting Entity Framework. Even just populating all of the legacy transactions' CorrelationId with random, unique garbage would've solved the problem (though with a lot of wasted storage space that would've made the infrastructure team cry).

In the end, it was solved by creating stored procedures in the datbase to do correlated transaction lookups (where the behavior could be controlled and expected), and having C# code exectue those directly (bypassing EF6) to get transaction IDs. Standard Linq queries would then use those IDs, instead of trying to search the CorrelationId.

This whole exercise was prompted by a script that I had to run to get a bunch of data from a decent number of transactions. It took nearly eleven hours to complete, finishing close to 1am after I started it. If I had time to go through this debugging and implement the fix, it turns out I could've gotten it done in about a third of the time.