UPDATE: I have to discourage using this trick. For reasons I do not yet know, it doesn't seem to work with a large dataset. I do not know the exact point at which it fails, I just know that it does. I noticed that a significant number of values that should have been updated with text, actually got updated with nulls. As much as I would love to investigate this and try to see what is wrong and whether it's a failure in C#, SQL, or some combination, unfortunately, it's more important that my work actually get done; so I've had to abandon the XML route entirely.
Original post follows.
It's been a while since I've posted, well, anything. But I learned of a neat trick that I thought I'd post.
I'm currently working on a program that is converting data from two different sources into a single database. A lot of it is just done with carefully crafted SQL statements, but there are a few steps where I have to take data from one source and use some C# code to do some kind of processing before storing it in the target database. Since the data set is on the order of millions of rows, processing these records one at a time can be prohibitively time-consuming. And, since I have limited access to the SQL Server itself, using SQL CLR isn't a great option. (I probably could get the access if I needed to, but it will be an additional step to have to remember and configure when this goes to production, and the fewer moving parts I create for myself, the better.)
One of the tricks I've implemented is to use multi-threading to let the different steps run simultaneously — one thread extracts the records and puts them into a ConcurrentQueue<>
, another thread processes that and puts the results into another queue, and a third thread updates the records in the database.
I've been trying to come up with ways to do the update in batches. There are ways to create a stored procedure that will take a table parameter, and ways to call that stored procedure by binding the parameter to an equivalent DataSet, but I didn't like the idea of creating a DataSet object just to pass the records in. It just seemed too "heavy" to me. (Though it might've been faster than calling a command object in a loop for records one-by-one.)
Another option was to create a VALUES table and build the command text dynamically. But, since I was working with strings, I didn't like the idea of building dynamic SQL and having to escape quotes or any other special characters that might cause SQL to choke. (Not to mention it's just bad practice, even if my code is unlikely to be used as a SQL injection vector.)
So, I came up with the idea of passing in values as an XML document. By building the XML with Linq-to-XML C# code, all necessary character escapes would be performed automatically. I could pass in as many values at once as I felt comfortable with, and let SQL do the work in a batch instead of one at a time.
To give some context to this code, I am taking email addresses that were encrypted in the source database, and converting them to their decrypted values in the target database. At this point, my queue consists of objects that have two properties: EncryptedEmail and DecryptedEmail. Earlier in my conversion work, I've simply copied the encrypted strings over into the Email field of the table, so all this method has to do is update the table and changing the Email field to its decrypted value.
var recordsToUpdate = GetRecordsToUpdateBatch(250); //Retrieves up to 250 records off of the queue at a time
if (recordsToUpdate.Any()) {
var xdoc =
new XDocument(
new XElement(
"emails",
recordsToUpdate.Select(r => new XElement("email", new XAttribute("encrypted", r.EncryptedEmail), new XAttribute("decrypted", r.DecryptedEmail)))
)
);
using (
var connection =
new SqlConnection(
GlobalSettings.DatabaseConnectionString)) {
await connection.OpenAsync();
using (
var command = connection.CreateCommand()) {
command.CommandText =
@"
WITH emails AS (
--Convert the XML document into a table that SQL can use normally
SELECT Tbl.email.value('@encrypted','varchar(256)') AS Encrypted, Tbl.email.value('@decrypted','varchar(256)') AS Decrypted
FROM @emails.nodes('/emails/email') AS Tbl(email)
)
UPDATE cust SET cust.Email = emails.Decrypted
FROM dbo.Customer cust
INNER JOIN emails ON cust.Email = emails.Encrypted;
";
command.CommandType = System.Data.
CommandType.Text;
var param = command.CreateParameter();
param.ParameterName =
"@emails";
param.SqlDbType = System.Data.
SqlDbType.Xml;
param.Value =
new SqlXml(xdoc.CreateReader());
command.Parameters.Add(param);
await command.ExecuteNonQueryAsync();
}
}
await Task.Run(() =>
Thread.Sleep(1));
}