Explore text and data mining with HathiTrust at a series of workshops

Warning Icon This event is in the past.

Date: October 20, 2020 View Recurrence Dates
Time: 11:00 a.m. - 12:30 p.m.
Location: Zoom
Category: Workshop

The Wayne State University Libraries are offering three workshops on using text mining with HathiTrust, a repository of more than 17 million digitized items. Data mining is a process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. This massive collection of text is available for computational text mining, primarily through the tools and services of the HathiTrust Research Center (HTRC). Spread over four days, October 20-23, each workshop will address a different aspect of text and data mining using HathiTrust data and HTRC services. Attendees are not required to attend all workshops and can pick and choose the events that best match their interests and schedules. For someone totally new to text mining and to HathiTrust, it is, however, recommended to attend the introductory session.

The workshops will be held via Zoom and will include a mix of hands-on, discussion and presentation with breakout rooms to support hands-on activities. You will not be required to install any software to participate in the workshops.

The workshops are open to faculty, graduate students, postdoctoral researchers, librarians and academic staff.

Register here: https://docs.google.com/forms/d/13evQ65qzw-b5GgYlOCXAEZ3bpmD5TEsQ5V_k7bFdh40/viewform?edit_requested=true


October 20 | 11-12:30

“Introduction to HTRC for text and data mining” description

In this session, we will explore the basics of HathiTrust as a data source and how to utilize HTRC as a resource for text and data mining. The workshop will address the various tools and services of the Research Center, and options for accessing text data from HathiTrust for text analysis research. The session will be helpful for those who want a general overview, or who want a solid foundation for the other workshops in the series.

Prerequisites: None.

October 21 | 11-12:30

“HTRC Extracted Features dataset” description

This session will introduce you to the Extracted Features data model and the kinds of research it enables. HTRC recently released an updated version of the Extracted Features dataset (v.2.0) that includes 17+ million files, with each file representing a volume in the HathiTrust Digital Library. The Extracted Features files contain metadata about the volumes, as well as tokens (words), parts of speech, and their per-page counts. The dataset can be used for text analysis projects where access to the words and word-counts in a volume are expected by the algorithm, such as topic modeling or certain kinds of machine learning projects. This session will include a hands-on activity using the dataset.

Prerequisites: either the “Introduction to HTRC for text and data mining” workshop, or some previous experience with HathiTrust or HTRC.

October 22 | 11-12:30

HTRC Data Capsules environment” description

This session will introduce you to the HTRC’s capsule environment and how it can be used by intermediate and advanced researchers. An HTRC Data Capsule is a virtual machine with special security settings that allows researchers to access text data from HathiTrust, analyze it using the text and data mining methods of their choice, and then export only the results of their analysis. This session will include a hands-on activity using an HTRC Data Capsule.

Prerequisites: either the “Introduction to HTRC for text and data mining” workshop, or some previous experience with HathiTrust or HTRC.


To learn more about text and data mining, visit the libguides:




Mike hawthorne




Custom View Recurrence Dates
October 2020