For better or for worse, the trade desk runs on spreadsheets. Yes, there are web apps and messaging tools, but when it comes down to it most of the position traders (and definitely their counter parties) are going to throw something in a spreadsheet more often than not.
As a result, before I can even begin to start to do any ‘real’ data work, I have to first make sense of the spreadsheets themselves, which became a project in and of itself. Through the course of the work, I learned a good deal about what work you can and cannot hand off to Excel Interop functions.
In my previous post, I briefly touched on the fact that UsedRange can really throw off your program. In a fresh spreadsheet it should be fine, but in a real world environment of dirty data, it often blows up. As I’ve described it before:
In theory, the Interop offers a nice clean way to know how big your used data sets are in the form of .UsedRange. If you read the spec on the property, however, you will notice a key phrase: “represents all the cells that have contained a value at any time.” I have found that as spreadsheets get used over and over and over again, saved and resaved, the reliability of UsedRange degrades. I would recommend employing UsedRange to get an idea of your data range, but then creating your own method that works on the Object to find the actual area in use.
I got a couple of questions about how one might do that, so I thought I’d share some example code on what I do.
I also have peppered this post with some of my thoughts from a statistical perspective in italics. If you are strictly looking for code, you can feel free to skip them, but if you are interested in the rational behind some of the approach you may find it interesting. It will help if you are familiar with Logistic Regression and I am working on a post to share some additional, more focused insights as well.
#0 – You need to interact with Excel
This post assumes you are comfortable with the Excel Interop and using C# to control Excel instances. I have a class I call singleExcel, inside of which I have a function that can create an Excel instance for me.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
public Excel.Application xlApp { get; set; } public Excel.Workbook xlWorkbook { get; set; } public singleExcel createExcel(string fileLocation) { var thisExcel = new singleExcel(); thisExcel.xlApp = new Excel.Application(); // var thisExcel.xlApp = ExcelInteropService.GetExcelInterop(thisExcel); thisExcel.xlWorkbook = thisExcel.xlApp.Workbooks.Open(fileLocation); thisExcel.xlApp.Visible = false; thisExcel.xlApp.DisplayAlerts = false; bool failed = false; do { try { thisExcel.xlWorkbook.DoNotPromptForConvert = true; thisExcel.xlWorkbook.CheckCompatibility = false; thisExcel.xlWorkbook.Unprotect(); failed = false; } catch (System.Runtime.InteropServices.COMException e) { failed = true; } } while (failed); return thisExcel; } |
You’ll notice there is a commented line. In some instances, Excel will not load all your resources – particularly certain Add Ins. In those situations, you may need to implement something like this solution.
I will continue to refer to singleExcel throughout the post as there are some helpful Excel-specific functions that any good Excel data dive will need and should get your own tool set started well.
#1 – Set some base values and classes
To get started, you need to initialize some base values and establish a couple of classes to keep everything organized.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
public class FileList { public FileList() { ListofFiles = new List<AnalysisObject>(); rowCounts = new List<int>(); colCounts = new List<int>(); } public List<AnalysisObject> ListofFiles { get; set; } public List<int> rowCounts { get; set; } public List<int> colCounts { get; set; } } public class AnalysisObject { public AnalysisObject() { } public string FileName { get; set; } public int rowCount { get; set; } public int colCount { get; set; } public object[,] allTheData { get; set; } public string startCol { get; set; } public string startRow { get; set; } } |
The base values you set may need to be tweaked based on your data sets and the resulting feedback you receive after running through an iteration. As with all data analysis, it starts with some sort of assumption, then analysis of results, and retesting.
If you are more familiar with statistics, this is setting up some sort of decision boundary to be used later. I am creating a tool that intends to do work on a set of data I know little to nothing about, so a initial assumption is established. This assumption will be updated through the course of the program’s operation, reducing the threat that a bad assumption will damage the results. It also will produce a more valid set of numbers at the end of the operation, which can be used in the future or a production version.
And while we’re at it, let’s collect our work set.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
private static FileList Library = new FileList(); static void Main(string[] args) { // Set some bases. These need to be tested for each sample. int avgRowCount = 100; int avgColCount = 20; int fileCount = 0; // Get our files, Create a library to hold everything DirectoryInfo di = new DirectoryInfo("location_of_files"); string searchPattern = "*"; Library.ListofFiles = new List<AnalysisObject>(); foreach (FileInfo f in di.GetFiles(searchPattern)) { if (f.ToString()[0] != '~') Library.ListofFiles.Add(new AnalysisObject { FileName = f.ToString() }); } } |
With that, we can start analysis of each individual file.
#2 – Use Excel for What It IS Good For
The Excel methods aren’t all garbage, so let’s use some of the good ones.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
static void Main(string[] args) { . . . foreach (var a in Library.ListofFiles) { Console.WriteLine(a.FileName); string fileLocation = di + a.FileName; // Open The File, Set the Page singleExcel thisExcel = new singleExcel().createExcel(fileLocation); ExcelWorkSheetChange(thisExcel, 1); // Get Some Generic, Untested Data About The Page From Excel var theTotalColumns = thisExcel.excelRange.Columns.Count; var theTotalRows = thisExcel.excelRange.Rows.Count; // Everything in UsedRange a.allTheData = (object[,])thisExcel.excelRange.get_Value(Excel.XlRangeValueDataType.xlRangeValueDefault); // Everything with a constant in it in UsedRange var constantCells = thisExcel.excelRange.SpecialCells(Excel.XlCellType.xlCellTypeConstants, Type.Missing); // Do a little preprossing on that Data var splitUpAddresses = Regex.Split(constantCells.Address, @"(?:\,|\:)"); // Create an Array of the Constant Addresses in the Order The Were Found splitUpAddresses = splitUpAddresses.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToArray(); var count = splitUpAddresses.Length; string[] tempArr = new string[count]; // Create a Second Array that is Sorted Array.Copy(splitUpAddresses, tempArr, count); // Create a Second Array that is Sorted Array.Sort(tempArr); var firstAddress = Regex.Split(tempArr[0], @"(?:\$)"); // Get Column of First Address, in case it isn't A a.startCol = firstAddress[1]; firstAddress = Regex.Split(splitUpAddresses[0], @"(?:\$)"); // Get Row of First Address, in case it isn't 1 a.startRow = firstAddress[2]; . . . |
Let me explain some of this logic.
We are essentially using Excel methods to gather generic data on the spreadsheet. I don’t really consider all of this data to be reliable, but they are good starting places – it’s definitely better than just starting on the absolute extreme bounds and working your way in.
I add a class and a couple of Excel properties to singleExcel at this point.
1 2 3 4 5 6 7 8 9 10 11 |
public Excel.Range excelRange { get; set; } public Excel._Worksheet xlWorksheet { get; set; } public static singleExcel ExcelWorkSheetChange(BL.singleExcel thisExcel, int x) { thisExcel.xlWorksheet = (Excel._Worksheet)thisExcel.xlApp.Workbooks[1].Worksheets[x]; thisExcel.xlWorksheet.Columns.ClearFormats(); thisExcel.xlWorksheet.Rows.ClearFormats(); thisExcel.excelRange = thisExcel.xlWorksheet.UsedRange; return thisExcel; } |
In some cases ClearFormats is enough to give you a clean and accurate UsedRange, but even in my basic sample sets from the desk it proved insufficient. But sometimes is something of an improvement on never, so I run it before grabbing UsedRange. You will also notice that ExcelWorkSheetChange requires an int. This function allows you to ‘turn the page’ so to speak on your workbook. My example will not get into that, but it should be fairly self-evident.
Once we have the range, we get the column and row counts according to Excel and record them to memory ( theTotalColumns, theTotalRows ). We then get all the data out of the spreadsheet, which I store into a.allTheData. I’ve written in other places about data extraction methods. Do yourself a favor, use this one.
Next, I get all cells with a constant. Well, not exactly all the cells. The Range object that is returned is a little less than I’d like it to be. It gives you a series of Excel Ranges that contain constants. In Excel representation, that means you will get something like [”A1:B12, C3:E:18, F2:G38”]. It’s something, but not awesome.
You could write something that parsed all of that and filled in the blanks, but that felt like a lot of work without much gain, so I went a different route. I created two Arrays, one sorted, one kept in the order the constants were located by Excel.
I then took the letter of the first position of the sorted Array, so I would know the first filled column. I also took the number of that position, so I would know the first filled row. UsedRange gets thrown when the rows or columns are left empty even at the A or 1 bounds, like below, where the blue is your ‘Excel range’ and the red is your ‘UsedRange’.
In the above example, if you want to find ‘Name’ in your array object, it will be [1,1], while if you want to reference it by an Excel address through the COM it will be ‘B2’. If you want to have a prayer of evaluating the object effectively, you need to know where your UsedRange sits in relation to the broader Excel coordinates.
#3 – Consider Out-sized Ranges
Unabstracted, thisExcel.excelRange.Rows.Count is (Excel._Worksheet)thisExcel.xlApp.Workbooks[1].Worksheets[1].UsedRange.Rows.Count, which, in a perfect world, returns the number of rows that contain data in your spreadsheet. But on a practical level, any number of things can go wrong and you can end up with bad numbers, lots of null values, and irrelevant or unstructured data. In my experience, the top two things to watch out for is that you can either end up with 1) the default maximum possible or 2) an unusually large number either due to overwritten data or stray keys.
I account for that by checking for a number near the maximum or outside an acceptable bounds related to my larger data set, then doing some data munging depending on which condition is true.
First the base code, then the munging.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
if (theTotalRows > 65000 || theTotalRows > (avgRowCount * 1.5)) { int centerRowNum = 0; Console.WriteLine("Total Rows: " + theTotalRows); // Most common error is that UsedRange will return back over 65k rows. if (theTotalRows > 65000) { Console.WriteLine("Option 1"); // As this is the default error response, it is assumed that number is wrong and skipped, thus the value 2 sent to ExcelWorkSheetNullCheck centerRowNum = ExcelWorkSheetNullCheck(a, splitUpAddresses, count, 2); } // It is also possible that the UsedRange is technically correct, but there is bad data (there are 30 rows of data and a stray key on row 300). else { Console.WriteLine("Option 2"); // As the row count may simply be significantly larger than the average, this version is more forgiving by sending value 1 centerRowNum = ExcelWorkSheetNullCheck(a, splitUpAddresses, count, 1); } . . . |
I do not check for exactly 65536 rows since more modern versions of Excel can support a higher number of rows, and future versions could support more. So I just went for a certain amount of flexibility. At least for my data sets, if a row count in an Excel spreadsheet returned that high, something is already off.
I also keep a running average on the data set. The base average was set in step #1, and should be tweaked for your particular experiment or the feedback you get after your tests. For my tests, I also found that the row count almost never got far outside of 150% the average. Just because it did, this was not an immediate sign of error, but it did warrant some additional preprocessing – many times there was an issue.
From a statistical standpoint, we are beginning to actually use the decision boundaries that we established earlier, but we are still operating with a fair degree of ambiguity on the areas of actual data. As such, I inflate the boundary to capture a larger group into the ‘acceptable’ category as this process is meant more to save processing time and overcome real issues with UsedRange or human error than to perform any high confidence work – though it has performed fairly well by itself.
Depending on the condition, splitUpAddresses is then sent through ExcelWorkSheetNullCheck on a certain starting point.
This begins some of the data munging. Through UsedRange, Excel returns an Object Array full of data, but if our row count is off the charts it is probably full of null values as well. xlCellTypeConstants provides a set of insights improved over simply iterating through what could be an Array [65536, 100] in size; it gives us the bounds of groupings of constants. By using those groupings and feeling around the positions of their bounds, we can get a sense of what our Object Array actually contains.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
public static int ExcelWorkSheetNullCheck(BL.AnalysisObject a, String[] splitUpAddresses, int count, int lessThis) { string theVal = null; int centerRowNum = 0; while (theVal == null || theVal.Equals("")) { var thisAddress = Regex.Split(splitUpAddresses[count - lessThis], @"(?:\$)"); int centerColNum = BL.singleExcel.ExcelColumnNameToNumber(thisAddress[1], a.startCol); centerRowNum = Convert.ToInt32(thisAddress[2]); int failCount = 0; var tempVal = WiggleRoom(a, centerRowNum, centerColNum, failCount); if (tempVal != null) theVal = a.allTheData[centerRowNum, centerColNum].ToString().Trim(); lessThis++; } return centerRowNum; } public static int ExcelColumnNameToNumber(string columnName, string startCol) { columnName = columnName.ToUpperInvariant(); int sum = 0; for (int i = 0; i < columnName.Length; i++) { sum *= 26; sum += (columnName[i] - Convert.ToChar(startCol) + 1); } return sum; } |
You’ll notice a function that I call ExcelColumnNameToNumber – this allows you to convert Excel address column names (like ‘A’) into an int, which you’ll need to navigate the Object Array that contains your data. Handy since xlCellTypeConstants returns Excel addresses, but UsedRange builds it’s Object Array off of cells that actually contain data thus might not start at ‘A’.
I also have a version of this that I call StaticExcelColumnNameToNumber, where I replace startCol with ‘A’. This is useful to just have in a toolkit library where you are working off of Excel data for when your conditions are more perfect.
From here, you need to feel around the position. I built a function I call WiggleRoom. It essentially searches around the data bounds, looking inward for filled cells or even existing bounds. If it happens to be at ‘A’, then it goes in the opposite direction. If it just doesn’t find anything, it reports that back. But if it does find data, then it just returns that existence.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
private static object WiggleRoom(BL.AnalysisObject a, int centerRowNum, int centerColNum, int failCount) { object tempVal = null; if (failCount == 3) return tempVal; try { tempVal = a.allTheData[centerRowNum, centerColNum]; } catch (System.IndexOutOfRangeException e) { if (BL.singleExcel.GetExcelColumnName(centerColNum - 1) != "A") { centerColNum--; failCount++; WiggleRoom(a, centerRowNum, centerColNum, failCount); } else { centerColNum++; failCount++; WiggleRoom(a, centerRowNum, centerColNum, failCount); } } return tempVal; } public static string GetExcelColumnName(int columnNumber) { int dividend = columnNumber; string columnName = String.Empty; int modulo; while (dividend > 0) { modulo = (dividend - 1) % 26; columnName = Convert.ToChar(65 + modulo).ToString() + columnName; dividend = (int)((dividend - modulo) / 26); } return columnName; } |
It isn’t a perfect picture, but at this point, I am not aiming for perfect, but much improved. I don’t need to know what data exists where. I only need to know if data is close to there. If I can get an improved sense of the actual lower bounds, I can save some performance time as well as operate more tightly against the space I’m actually interested in.
Something like this operation can cut a row count down from 65536 to 300. That is far fewer.
#4 – Getting More Exact
Once that data is found, we can then put our new found numbers through a bit of a finer comb.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
if (theTotalRows > 65000 || theTotalRows > (avgRowCount * 1.5)) { . . . Console.WriteLine("Row 1 " + centerRowNum); // ExcelRowChecker samples the object for data to determine the actual used range centerRowNum = ExcelRowChecker(a, centerRowNum, theTotalColumns, avgColCount); Console.WriteLine("Row 2 " + centerRowNum); var temp = MakeAverage(fileCount, avgRowCount, centerRowNum, avgColCount, theTotalColumns); avgRowCount = temp[0]; avgColCount = temp[1]; Console.WriteLine("Avg Row " + avgRowCount); theTotalRows = centerRowNum; } |
At it’s core, ExcelRowChecker does two things: 1) iterates through the columns and rows of the Array Object representing the UsedRange, less the rows we have already ruled out and 2) account for all kinds of human creativity and non-uniformity. This example runs through the data agnostically, simply looking for its presence or lack thereof. It does, however, provide a basis for working one’s way through a junky spreadsheet.
Rather than commenting the code itself, I will share the function, and then speak to it a bit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
public static int ExcelRowChecker(BL.AnalysisObject a, int centerRowNum, int theTotalColumns, int avgColCount) { int falseRowCount = 0; int altRowCount = 0; int upperAltRowCount = 0; int goodRow = 0; double thePiece = .05; double checkRowRange = centerRowNum * thePiece; if (theTotalColumns >= avgColCount * 1.5) theTotalColumns = avgColCount; for (int y = centerRowNum + 10; y > centerRowNum - checkRowRange; y--) { Double emptyCellCount = 0; for (int x = 1; x < theTotalColumns + 1; x++) { Double x2 = Convert.ToDouble(x); try { if (a.allTheData[y, x] == null || a.allTheData[y, x].ToString().Trim().Equals("")) emptyCellCount++; } catch (System.IndexOutOfRangeException e) { emptyCellCount++; } if (x >= (theTotalColumns * .6) && emptyCellCount / x2 > .6) { if (goodRow == 0) falseRowCount++; else altRowCount++; break; } if (x == theTotalColumns) goodRow++; } if (falseRowCount > 1 && goodRow > 3) { if (falseRowCount + upperAltRowCount < 5 || (falseRowCount < 10 && upperAltRowCount < 10 && goodRow > 5)) { centerRowNum = centerRowNum + 50; centerRowNum = ExcelRowChecker(a, centerRowNum, theTotalColumns, avgColCount); return centerRowNum; } else if (falseRowCount != 10) { centerRowNum = centerRowNum + 10 - falseRowCount + upperAltRowCount; return centerRowNum; } break; } else if (y - 1 <= Math.Ceiling(centerRowNum - checkRowRange)) { thePiece = thePiece + .05; checkRowRange = centerRowNum * thePiece; falseRowCount = falseRowCount + altRowCount; if (goodRow > 0) upperAltRowCount = upperAltRowCount + altRowCount; altRowCount = 0; } if (y <= 0) { if (goodRow != 0) break; else { centerRowNum = 0; return centerRowNum; } } } if (falseRowCount < 10) Console.WriteLine("Odd Result"); centerRowNum = centerRowNum + 10 - falseRowCount; return centerRowNum; } |
You will notice a series of variables established at the top of function, and one more set up right inside of the for loop. Basically, the function samples the row, looking for a certain threshold of empty ‘cells’ (keep in mind that at this point we are really dealing with an Object Array that represents the tabular data). Once that threshold is broken, it is logged as irrelevant for consumption. I use a count of falseRowCount and altRowCount to keep track – falseRowCount are rows that definitely sniff out as bogus, while altRowCount are rows that seem empty but were found after also detecting a potentially valid row. Should the sample run out without finding a valid range of data, then altRowCount is added in with falseRowCount, then a second larger sample is taken. The function can also expand the sample out, should the initial attempt to discover the range bounds have been too aggressive.
Important Note
These are thresholds that I found acceptable for my data sets for a specific project. A 5% sample of the rows, and a 60% threshold held well for my work. You may find that a larger sample or higher threshold may be required. This allowed my program to skip over irrelevant rows like contact information or special notes put at the bottom. My program also ‘reaches’ out 10 more rows beyond its assumed start range, but you may need a higher reach for to start if things are going off the rails with initial tests.
Essentially what this code attempts to do is perform a gradient descent method against the data to find the global minimum and sniff out any local optima. The issue is that we really have a multiclass problem represented by clusters of data (real data cells, signal data cells, out of range cells) that for the time being we need to treat as a binary problem (data cells and out of range cells). Two of the clusters are mixed in together (real data cells, signal data cells) and two of the clusters can be mistaken for each other (signal data cells, out of range cells).
I first work out a discrete choice on the row – this is either a blank row or it is not. I then contribute it to one of the three clusters. Finally, I compare various groups against one another to see the updated size of those clusters and if the hypothesis variables are acceptable given the current internal learning rate. The comparison testing is an implementation of the one-vs-all approach.
#5 – Getting Some Final Analysis
Part of the beauty of a project such as this is that you can setup a nice set of exploratory tools. You have to start with some sort of base assumption, like there will be an average of 100 rows, but I find it useful for those assumptions to eventually be based in reality. To test these assumptions, you can run the numbers through some basic stats and both let your program make adjustments on the fly and make adjustments yourself as you go.
I have been using R.Net to do some quick math for me. For instance, to get the mode of row and column counts, I would use the below.
1 2 3 4 5 6 7 8 9 10 11 |
IntegerVector rowR = RBlock.CreateIntegerVector(Library.rowCounts); IntegerVector colR = RBlock.CreateIntegerVector(Library.colCounts); RBlock.SetSymbol("rowR", rowR); RBlock.SetSymbol("colR", colR); int[] thisTemp; RBlock.Evaluate("temp <- table(as.vector(rowR))"); thisTemp = RBlock.Evaluate("names(temp)[temp == max(temp)]").AsInteger().ToArray(); Library.groupStats.modeRow = thisTemp[0]; RBlock.Evaluate("temp <- table(as.vector(colR))"); thisTemp = RBlock.Evaluate("names(temp)[temp == max(temp)]").AsInteger().ToArray(); Library.groupStats.modeCol = thisTemp[0]; |
By performing a variety of scans against the data we have collected, we can create new decision boundaries and run the program again. Ideally, some learning rate is set and the program can be run for those number of iterations using the update parameters now found – with each iteration, we should be able to find a more accurate answer while avoiding overfitting the system.
And to output that data for examination, I would use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
int x = 0; object[,] excelDrop = new object[Library.groupStats.GetType().GetProperties().Length, 2]; foreach (PropertyInfo objPart in Library.groupStats.GetType().GetProperties()) { excelDrop[x, 0] = objPart.Name; excelDrop[x, 1] = objPart.GetValue(Library.groupStats, null); x++; } BL.singleExcel outputExcel = new BL.singleExcel().createExcel(); string secondRange = "B" + Library.groupStats.GetType().GetProperties().Length; string thisCell = "A1:" + secondRange; var cell = outputExcel.xlWorksheet.Range[thisCell, Type.Missing]; cell.set_Value(Excel.XlRangeValueDataType.xlRangeValueDefault, excelDrop); string path = "C:\\Users\\" + Environment.UserName + "\\Documents\\Email Attachments\\"; string fileName2 = "output"; outputExcel.xlWorkbook.SaveAs(path + fileName2); BL.singleExcel.CloseSheet(outputExcel); |