Our trade desk encounters a good number of Excel files that require a specific set of Excel Add Ins. Because those Add Ins are required by the spreadsheets, libraries that can simply read Excel data were not going to help us in our application design. We needed to be able to use real Excel instances. Fortunately, Microsoft makes that a simple enough proposition through the C# Office Primary Interop Assemblies. But we also needed to work through those spreadsheets as quickly as possible.
#1 – Using Task Parallel Library
This requirement quickly lead to a classic performance question – how to quickly, yet efficiently, work through a cumbersome collection set. Linearly obviously would get the job done, albeit slowly. Sequential activity would in theory get things moving faster, but resource management becomes an issue. Task Parallel Library‘s Parallel.ForEeach
in conjunction with the creation of unique Excel instances provided a quick and elegant solution to this problem, handing off the resource management to the OS while accomplishing the necessary concurrent behaviors.
TPL is pretty intelligent at detecting when to stop creating loops, wait for existing ones to complete, then starting back on the task list. In a separate project, though, I found it would overwhelm itself by first opening Excel instances and then opening an extremely heavy workbook inside of the instances. TPL only controls your program, not what Excel might do natively, so there is the potential for it to run wild and crash your machine depending on what you end up opening or doing inside of Excel. In those circumstances, I had to break my collection of files into a list of lists or ‘batches’, do a Parallel loop through the list of lists and then a traditional loop through each ‘batch’ inside of that Parallel looping list.
In that situation, you want something like the code below, using data structures and batch sizes appropriate for your needs; my project has a small object library. You probably want to generate your list of lists via a loop to improve the maintainability of your code, but I have shown the explicit steps here for illustration.
var getLength = ListOfFiles.Count; var getFirstThird = getLength / 3; List<ListOfList> ThreeGroups = new List<ListOfList>(); ThreeGroups.Add(new SmallerList); for (var i = 0; i < getFirstThird; i++) { ThreeGroups[0].Add(WholeList[i]); } ThreeGroups.Add(new SmallerList); var getMidThird = getFirstThird * 2; for (var i = getFirstThird; i < getMidThird; i++) { ThreeGroups[1].Add(WholeList[i]); } ThreeGroups.Add(new SmallerList); for (var i = getMidThird; i < getLength; i++) { ThreeGroups[2].Add(WholeList[i]); } Parallel.ForEach(ThreeGroups, thisList => { /** make your Excel Instance **/ foreach (var thisObject in thisList) { /** make your Excel Instance do something **/ } }
TPL does provide a means of overriding its defaults which in theory should allow you to avoid this sort of preprocessing, but the general consensus I can find on the web is, as put by Reed Copsey, “… this is rarely required or useful in practice. The default behavior of the TPL is very good, often better than any hand written partitioning strategy.” In practice, I found this batching method also simply performed more quickly than messing with MaxDegreeOfParallelism
.
One final issue was a cryptic message from Excel: “Microsoft Excel is waiting for another application to complete an OLE action.” No one in either of our application development groups has any idea why we get this message, nor has Stack Overflow provided a good answer. I imagine the approach of VSTO+TPL is causing some conflict, but I’m not entirely sure what or where or even to what degree. The message produces nothing in the debugger and dismissing the dialog allows the program to complete and does not interfere with the final outputs. Setting all Excel instances, including the one executing your Add In, to .DisplayAlerts = false;
allows the message to be dismissed in the background so it does not interfere with operation.
#2 – Creating Excel Instances
Ideally, the PIA would allow you to create a new instance of Excel simply by calling new Excel.Application()
. I have found, however, that when Add Ins are required this method can cause the Add Ins to load improperly. I have ended up using a solution along these lines to work around the problem.
Inside of a TPL loop, several independent Excel instances can be created – performing work, connecting to APIs, and generally performing like a normal Excel application. Except with a TPL loop, you can have many of them working on different parts of a heavy workload as if nothing else were happening (minding, of course, the performance considerations mentioned above).
#3 – Using Excel Instances
Obviously having your Add Ins working is great, but the real reason behind the work is most likely to get at data that would otherwise be inaccessible to us.
There are a number of ways to get at Excel data, but I strongly recommend using get_Range
or get_Value
and working on the data within memory (some alternatives include using the COM to engage with Excel cell by cell, but I have the speedtests to prove that is not the way to go).
Excel.Range manyRangeObject = YourWorkSheet.get_Range("A1:Z1000", Missing.Value); if (manyRangeObject != null) { foreach (Excel.Range singleRangeObject in manyRangeObject) { // do things using your new object // and when you want a value... var thisVal = singleRangeObject.Value[4,4] } }
Alternatively…
Object[,] manyRangeObjectArray = (object[,])thisExcel.get_Range( "A1:Z1000", Missing.Value ).get_Value( Excel.XlRangeValueDataType.xlRangeValueDefault ); for (int y = 1; y < manyRangeObjectArray.GetLength(0) + 1; y++) { for (int x = 1; x < manyRangeObjectArray.GetLength(1) + 1; x++) { // do other stuff // and when you want a value... var thisVal = manyRangeObjectArray[x,y]; } }
While attempting to avoid getting too far into the weeds of the Excel Interop, I should point out that using this method ends up being more straight forward array iteration work than having to do a lot of Excel hoops, which is nice. You should keep in mind, though, that Excel, for whatever reason, will start it’s returned object arrays at 1 instead of 0. If you want to create an Object Array as an 1-indexed array to match it, you will need to build it using something like the following, where length and bounds are adjusted for your needs.
int[] myLengthsArray = new int[2] { 1, 1 }; int[] myBoundsArray = new int[2] { 1, 1 }; theGoods = (object[,])Array.CreateInstance( typeof(String), myLengthsArray, myBoundsArray ); theGoods[1, 1] = temp;
#4 – Deeper Excel Work
#4 – Deeper Excel Work
Sometimes the data is spread across a number of worksheets inside of the workbooks (one of a many reasons we needed to work out this problem). The solution, at least syntactically, is not difficult to follow – (Excel._Worksheet)thisExcel.xlApp.Workbooks[1].Worksheets[x]
.
But the point of this sort of implementation is to be able to seamlessly work through a scale of Excel files, so specific declarations aren’t really going to help. Fortunately, Workbook does include a property that will do just the trick.
Parallel.ForEach(ThreeGroups, thisList => { Excel.Application thisExcel = new Excel.Application(); foreach (var thisFileLocation in thisList) { thisExcel.Workbooks.Open(thisFileLocation); foreach (Excel.Worksheet xlworksheet in thisExcel.Workbook.Sheets) { /** make your Excel Instance do something **/ } } }
Finally, I just want to throw in a quick word about finding Excel ranges that hold your data.
In theory, the Interop offers a nice clean way to know how big your used data sets are in the form of .UsedRange
. If you read the spec on the property, however, you will notice a key phrase: “represents all the cells that have contained a value at any time.” I have found that as spreadsheets get used over and over and over again, saved and resaved, the reliability of UsedRange
degrades. Personally, I would recommend employing UsedRange
to get an idea of your data range, but then creating your own method that works on the Object
to find the actual area in use. Here again is another great argument for pulling your data into memory by large sets and avoiding overworking the COM.
Additionally, below the structured data you may occasionally have extra information. When collecting your structured data, you want to know when you have run out and are running into that extra area. Creating a method that finds changes in the structure close to the bottom of your real UsedRange
can be a great tool for your development library.
Update: See my follow up to working around .UsedRange issues here.
In Conclusion
This is still a work in progress in many ways, and I am still working on communicating some of the thoughts behind it, so if you have any questions or feedback feel free to leave a comment.
Generally, if you can avoid running Excel instances as part of another application, I would do it. But there are definitely moments where there is simply no other option. Using TPL to build the process turned out to be a great solution for what we needed, and I hope these notes save others time in the future.
Do you have source code? I could not get Parallel For working with Excel Interop.