Skip to Main Content
Chat With Us

Digital History

:

Compiling Data

This is a short guide to different ways to approach history in a digital environment using diverse digital tools.

Compiling a Dataset

Compiling a datasets from the 1960s to the present can be as easy as accessing OECD Statistics. Older datasets can be much more complicated to identify and assemble. Many historians spend years developing their own sets as they work in different archives and collect information. Others choose a print set to interpret, type, and put it to use. 

This page takes readers through the steps to develop a dataset that includes numbers, dates, and text. Contact your libraries with questions or concerns.

In order to read a dataset, it is important to understand how a specific dataset was developed. It is a rare dataset that has complete information. The who, the how, and the what of the collection changes how researchers should use and read the information collected. Demographic information, for example, often relies on self-reporting. That means (among other things) there will be gaps when those who choose--for whatever reason--not to report. Changing or contested definitions can also be an issue. Employment statistics, for example, change dramatically depending on whether "traditional woman's labor" such as housekeeping is viewed as work or if part time labor is viewed as employment.

What's on This Page

This page has a lot on it. You can use the following anchors (links down the page) to jump straight to a relevant box.

    Identifying a Possible Dataset

    Social and/or economic historians are more likely to compile and use datasets than cultural or political historians, but there are useful datasets and possible visualizations for almost any historian. 

    To identify a potentially useful dataset, the first step is (no surprise) to identify your research question and consider your possible audiences. Regardless of your specific audience, most people don't comprehend larger numbers clearly. What they do understand is comparatives. In short, if you are dealing with larger numbers particularly over multiple decades or centuries, then you should consider different forms of data visualization. 

    This example set is drawn from the "Information Wanted" clasifieds in the Boston Pilot between 1840 and 1850. 

    Creating a Dataset

    In order to read a dataset, it is important to understand how a specific dataset was developed. It is a rare dataset that has complete information. The who, the how, and the what of the collection changes how researchers should use and read the information collected. Demographic information, for example, often relies on self-reporting, which means that there will be gaps for those who choose--for whatever reason--not to report. Or, employment statistics change depending on whether "traditional woman's labor" such as housekeeping is viewed as work or if  part time labor is viewed as employment or not. 

    Unlike demographic census data, the data set for "Information Wanted" was first developed for Dr. Ruth-Anne Harris in the 1980s. The Burns Library staff and student workers then developed the project over the subsequent decades, using a Google form to extract information from more than 40,000 entries between the 1830s and 1870s. The dataset is, however, incomplete with its strengths in the 1840s through 1860s. 

    Text Recognition Software (OCR)

    Transkribus screenshotDepending on your dataset, you may want to run text recognition software (Optical Character Recognition, OCR) on your sources, be they text or numbers.

    Be aware that OCR is far from perfect and often needs to be cleaned up when doing textual analysis or using it to transcribe a chart. Note that in the example on the right, Transkribus did descent work with this newspaper scan, but it is far from perfect. 

    For more on OCR, see the text analysis subpage. 

    Data Cleaning

    Once a dataset has been cumulated, it must be read and cleaned in order to use, particularly for data visualization. Misspellings, typos, differing nomenclature, etc., will all affect results. 

    Useful tools include OpenRefine and Excel (or Google Sheets)

    Controlled Vocabulary

    One of the challenges of working with data is how to make different pieces of the data speak to other pieces. For example, within the "information wanted" classifieds, those seeking their family members and loved ones noted that those they sough had worked--to name only a few occupations--as bakers, bread makers, cooks, chefs, and kitchen help as well as domestics, servants, household staff, and more. For qualitative data interpretation, the differences between those word choices and those positions can be important. For a data analysis, you may want to have "household labor" versus "agriculture" or "construction." and simplify ten categories into two or three. 

    As you work with your data, consider employing a "controlled vocabulary" limiting all household occupations simply to "household" in an "occupation" category to make your visualizations clearer. How you limit these categories, however, should depend on your research questions and goals. There is no single, right answer but rather choices depending on your interests. 

    Go Forth and Create

    Your work should start with a research question and continue based on what you have and can find. The questions and processes above and not a one-time step but rather questions you'll have to return to again and again. Whether you compile your data or use someone else's work, you have to decide how to work with it, change it, or maintain it. 

    Let us know if you have questions.