There are multiple kinds of text analysis in the world of digital scholarship. A basic requirement for all of them, however, is to have a corpus of digital text. This guide takes users through digitizing text from print material using Optical Character Recognition (OCR) and then beginning basic textual analysis in the form of word clouds.
When working with digital facsimiles of printed material (i.e., PDFs or images of newspapers, books, or other material using typeface), researchers can use a variety of programs ranging from Adobe Acrobat Professional to Google Docs to Transkribus for Optical Character Recognition (OCR). These programs convert the images to machine-encoded text, which then allows a user to search the document.
Adobe Acrobat Pro is easy to use but can be expensive. Transkribus (figured right) requires a short learning curve to familiarize oneself with the interface. Google Docs is available as a Google App in your BC email account.
If scholars are working with hundreds of pages written in the same hand, there are now a few programs for optical character recognition that you can try to use to interpret that handwriting. Using programs like Transkribus, users can teach the program to read a specific hand by doing some transcriptions and then cleaning up the program's initial attempts. The further one goes, the better programs like Transkribus get. The process must, however, be repeated with each new writer.
Tool(s): Transkribus (includes excellent user documentation)
Word clouds present the frequency of individual words within a dataset, highlighting the centrality of that concept within a given corpus.
In the word cloud to the right words presented come from the 1840-1850 "Information Wanted" advertisements. Here, viewers can see that people are after news about family with an emphasis on children. They are anxious abut recently seen loved ones.