Digital History

OCR and Text Analysis

This is a short guide to different ways to approach history in a digital environment using diverse digital tools.

OCR and Text Analysis

There are multiple kinds of text analysis in the world of digital scholarship. A basic requirement for all of them, however, is to have a corpus of digital text. This guide takes users through digitizing text from print material using Optical Character Recognition (OCR) and then beginning basic textual analysis in the form of word clouds.

OCR with Print

When working with digital facsimiles of printed material (i.e., PDFs or images of newspapers, books, or other material using typeface), researchers can use a variety of programs ranging from Adobe Acrobat Professional to Google Docs to Transkribus for Optical Character Recognition (OCR). These programs convert the images to machine-encoded text, which then allows a user to search the document.

Adobe Acrobat Pro is easy to use but can be expensive. Transkribus (figured right) requires a short learning curve to familiarize oneself with the interface. Google Docs is available as a Google App in your BC email account.

Skill: Basic

Tool(s): Acrobat Professional (paid, available in the Digital Studio); Transkribus (free); Google Docs (free); ABBYY FineReader (paid, available in Digital Studio)

OCR with Handwriting

If scholars are working with hundreds of pages written in the same hand, there are now a few programs for optical character recognition that you can try to use to interpret that handwriting. Using programs like Transkribus, users can teach the program to read a specific hand by doing some transcriptions and then cleaning up the program's initial attempts. The further one goes, the better programs like Transkribus get. The process must, however, be repeated with each new writer.

Skill: Intermediate

Tool(s): Transkribus (includes excellent user documentation)

Word Clouds as Text Analysis

Word clouds present the frequency of individual words within a dataset, highlighting the centrality of that concept within a given corpus.

In the word cloud to the right words presented come from the 1840-1850 "Information Wanted" advertisements. Here, viewers can see that people are after news about family with an emphasis on children. They are anxious abut recently seen loved ones.

Skills: Basic

Tool(s): Tableau

O'Neill Library

Bapst Library

Burns Library

Educational Resource Center

Law Library

Social Work Library

Theology & Ministry Library

O'Connor Library

Institute for Advanced Jesuit Studies

Digital History

OCR and Text Analysis

OCR and Text Analysis

OCR with Print

OCR with Handwriting

Word Clouds as Text Analysis