What is text and data mining?
Text and data mining (TDM) is the computational analysis of vast quantities of digital information, whether free-form natural language text or structured data.
Using specialized software, researchers can extract data, identify trends, look for patterns and better understand the relationships of terms within and between documents. Analysis might focus on word frequency, words that frequently appear near each other, contextual information for key words, common phrases and other patterns.
Materials to be analyzed range from websites (such as publicly available Facebook posts), 16th C. manuscripts, DNA sequences, to old newspapers.
Policies for Mining Licensed Content
If you wish to undertake a text or data mining project with content from the Libraries’ licensed databases, please contact a Subject Librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, we are actively negotiating text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire Boston College community.
Please also see our Best Practice Tips for mining licensed databases.
How we can help?
You can begin your inquiry with your Subject Librarian. They can help you find and interpret the terms and conditions that apply to resources you might want to mine.
The subject librarian may also refer you to a Digital Scholarship specialist for help with planning your TDM project and process. If your questions are primarily about tools and techniques, you can set up a consultation with the Digital Scholarship librarians directly.
This is a graphic analysis, constructed using Voyant, of the frequency of terms in the novel, Agnes Grey, by Charlotte Bronte.