Text & Data Mining

Tools and Techniques

This guide is intended to help researchers and librarians find the content, tools, training and other assistance available to engage in successful text mining research at Boston College.

Tools & Techniques

The languages, tools, and methods listed below represent a very small portion of resources available to those interested in text and data mining and analysis. Although they are categorized under specific headings, many are not limited to one type of task.

Extracting & Scraping

Beautiful Soup - Python library used for web-scraping

Import.io - data extractor in a web browser can be used to run automatic and bulk extraction, and run APIs

R - a language and environment for statistical computing and graphics that enables data manipulation, calculation, and graphical display (available via BC Citrix server)

RegEx - define and search for patterns in data or text using find and replace operations; also useful for cleaning messy data

Tabula - extract data tables from PDF files

Web Scraper - Chrome browser extension for extracting data from web pages

Cleaning & Processing

Lexos - integrated workflow of pre-processing, analysis, and visualization tools for finding and exploring patterns in texts

OpenRefine - tool for working with messy data; clean, transform, reconcile, normalize, extend data; compatible with expression languages (i.e. GREL, Jython)

Stanford Parser - probabilistic natural language parser

Stanford Part-of-Speech Tagger - assigns parts of speech to words or tokens

Analysis & Visualization

AntConc - corpus analysis toolkit for text analysis and creation of concordances

Gephi - visualization toolkit for exploring graphs and networks (available in Digital Studio, O'Neill second floor)

Mallet - Java-based toolkit for statistical natural language processing, including tasks, such as document classification, clustering, topic modeling and information extraction

Textexture - visualize texts as a network

Voyant - suite of web-based tools for text reading, analysis, and visualization

Tool Directories

DH Toychest - extensive list of digital humanities tools, including those for text extraction, analysis, mining, and visualization

DiRT Directory - general directory of digital humanities tools with descriptions and metadata including fields such as, development status, cost, platforms, and categories

TAPoR - directory of tools specifically used for text analysis, retrieval, and visualization

Tutorials

Codecademy - learn to code in Python for data extraction and manipulation

Basic Text Mining in R - tutorial on text mining with R

Programming Historian - learn how to extract, clean, manipulate, and transform data; also includes lessons on topic modeling and text analysis

Scikit Tutorial - learn how to use scikit to analyze topics within a collection of texts

Text Analysis Tutorial - tutorials on how to use topic models for quantitative text analysis in the humanities and social sciences

O'Neill Library

Bapst Library

Burns Library

Educational Resource Center

Law Library

Social Work Library

Theology & Ministry Library

O'Connor Library

Institute for Advanced Jesuit Studies

Text & Data Mining

Tools and Techniques

Tools & Techniques

Tool Directories

Tutorials