As a data scientist (of sorts), I’m feeling quite at home in the layers and layers of spreadsheets that represent all the research we’ve done to date. I’ve learned a few things during my time amongst the data, so I figured I’d share these things here.
1. Format is everything. What separates a good dataset from a “meh” dataset is how it reads. Make sure that each sheet has columns in the same place, and each column has a unified format. For instance, our data goes latitude (decimal), longitude (decimal), Location name (legal), city, state (abbreviation), date (YYYY-MM-DD), Performer name (First Last), street address, etc. This makes it easy to switch between sheets and still be able to read the dataset. In addition to layout, having unified spacing, font, coloring, etc is equally important. This helps the eye move quickly over the dataset without being caught on the different aspect. This also applies to having an easily seen sorting mechanism. I always recommend sorting chronologically, as that can show trends without the need for a map. Regardless of the method you choose, make sure you include a key of some sort to explain your data layout to readers, both in your group and others.
2. Do the research yourself sometimes. It’s ok to look things up yourself, and it can even be helpful for maintaining the unified format that you want. It also helps to avoid missing data, which is the bane of any data scientist’s existence. Choosing a unified method of imputation is also important for unknowns. Having blank cells in a spreadsheet can be problematic for many programs, and also breaks the lines and flow of the spreadsheet. I recommend putting “Unknown” for any categorical (word based) variables, and some easily identified number for any quantitative variables. For dates I prefer using 01 for any unknown fields, ie 1801-01-01 would be a default date for the 19th century. The most difficult part of my job has been in the field of data imputation, as finding the historical locations that no longer exist can be troublesome at most. If this becomes the case, see point 3.
3. Delegate! Other people have other specialties that can compliment yours. Being a data specialist doesn’t mean being an expert of research or mapping. Other people can do those tasks better and faster, so ask them to do it. In other words, its a good thing to stay in your lane, since the other lanes are occupied.