# Text Mining in Python through the HTRC Feature Reader

Reviewed by Stéfan Sinclair, Catherine DeRose, Ian Milligan

Summary: We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills.

The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning many centuries and genres, they offer a way to learn about large-scale trends in history and culture, as well as evidence for changes in language or even the structure of the book. To simplify access to this collection the HathiTrust Research Center (HTRC) has released the Extracted Features dataset (Capitanu et al. 2015): a dataset that provides quantitative information describing every page of every volume in the collection.

In this lesson, we introduce the HTRC Feature Reader, a library for working with the HTRC Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. Pandas provides simple structures for holding data and powerful ways to interact with it. The HTRC Feature Reader uses these data structures, so learning how to use it will also cover general data analysis skills in Python.

Today, you’ll learn:

• How to work with notebooks, an interactive environment for data science in Python;
• Methods to read and visualize text data for millions of books with the HTRC Feature Reader; and
• Data malleability, the skills to select, slice, and summarize extracted features data using the flexible “DataFrame” structure.

## Background

The HathiTrust Research Center (HTRC) is the research arm of the HathiTrust, tasked with supporting research usage of the works held by the HathiTrust. Particularly, this support involves mediating large-scale access to materials in a non-consumptive manner, which aims to allow research over a work without enabling that work to be traditionally enjoyed or read by a human reader. Huge digital collections can be of public benefit by allowing scholars to discover insights about history and culture, and the non-consumptive model allows for these uses to be sought within the restrictions of intellectual property law.

As part of its mission, the HTRC has released the Extracted Features (EF) dataset containing features derived for every page of 13.6 million ‘volumes’ (a generalized term referring to the different types of materials in the HathiTrust collection, of which books are the most prevalent type).

What is a feature? A feature is a quantifiable marker of something measurable, a datum. A computer cannot understand the meaning of a sentence implicitly, but it can understand the counts of various words and word forms, or the presence or absence of stylistic markers, from which it can be trained to better understand text. Many text features are non-consumptive in that they don’t retain enough information to reconstruct the book text.

Not all features are useful, and not all algorithms use the same features. With the HTRC EF Dataset, we have tried to include the most generally useful features, as well as adapt to scholarly needs. We include per-page information such as counts of words tagged by part of speech (e.g. how many times does the word jaguar appear as a lowercase noun on this page), line and sentence counts, and counts of characters at the leftmost and rightmost sides of a page. No positional information is provided, so the data would not specify if ‘brown’ is followed by ‘dog’, though the information is shared for every single page, so you can at least infer how often ‘brown’ and ‘dog’ occurred in the same general vicinity within a text.

Freely accessible and preprocessed, the Extracted Features dataset offers a great entry point to programmatic text analysis and text mining. To further simplify beginner usage, the HTRC has released the HTRC Feature Reader. The HTRC Feature Reader scaffolds use of the dataset with the Python programming language.

This tutorial teaches the fundamentals of using the Extracted Features dataset with the HTRC Feature Reader. The HTRC Feature Reader is designed to make use of data structures from the most popular scientific tools in Python, so the skills taught here will apply to other settings of data analysis. In this way, the Extracted Features dataset is a particularly good use case for learning more general text analysis skills. We will look at data structures for holding text, patterns for querying and filtering that information, and ways to summarize, group, and visualize the data.

## Possibilities

Though it is relatively new, the Extracted Features dataset is already seeing use by scholars, as seen on a page collected by the HTRC.

Underwood leveraged the features for identifying genres, such as fiction, poetry, and drama (2014). Associated with this work, he has released a dataset of 178k books classified by genre alongside genre-specific word counts (Underwood 2015).

The Underwood subset of the Extracted Features dataset was used by Forster (2015) to observing gender in literature, illustrating the decline of woman authors through the 19th century.

The Extracted Features dataset also underlies higher-level analytic tools. Mimno processed word co-occurrence tables per year, allowing others to view how correlations between topics change over time (2014). The HT Bookworm project has developed an API and visualization tools to support exploration of trends within the HathiTrust collection across various classes, genres, and languages. Finally, we have developed an approach to within-book topic modelling which functions as a mnemonic accompaniment to a previously-read book (Organisciak 2014).

## Suggested Prior Skills

This lesson provides a gentle but technical introduction to text analysis in Python with the HTRC Feature Reader. Most of the code is provided, but is most useful if you are comfortable tinkering with it and seeing how outputs change when you do.

We recommend a baseline knowledge of Python conventions, which can be learned with Turkel and Crymble’s series of Python lessons on Programming Historian.

The skills taught here are focused on flexibly accessing and working with already-computed text features. For a better understanding of the process of deriving word features, Programming Historian provides a lesson on Counting Frequencies, by Turkel and Crymble.

A more detailed look at text analysis with Python is provided in the Art of Literary Text Analysis (Sinclair). The Art of Literary Text Analysis (ALTA) provides a deeper introduction to foundation Python skills, as well as introduces further text analytics concepts to accompany the skills we cover in this lesson. This includes lessons on extracting features (tokenization, collocations), and visualizing trends.

The lesson files include a sample of files from the HTRC Extracted Features dataset. After you learn to use the feature data in this lesson, you may want to work with the entirety of the dataset. The details on how to do this are described in Appendix: rsync.

## Installation

For this lesson, you need to install the HTRC Feature Reader library for Python alongside the data science libraries that it depends on.

For ease, this lesson will focus on installing Python through a scientific distribution called Anaconda. Anaconda is an easy-to-install Python distribution that already includes most of the dependencies for the HTRC Feature Reader.

To install Anaconda, download the installer for your system from the Anaconda download page and follow their instructions for installation of either the Windows 64-bit Graphical Installer or the Mac OS X 64-bit Graphical Installer. You can choose either version of Python for this lesson. If you have followed earlier lessons on Python at the Programming Historian, you are using Python 2, but the HTRC Feature Reader also supports Python 3.

### Installing the HTRC Feature Reader

The HTRC Feature Reader can be installed by command line. First open a terminal application:

• Windows: Open ‘Command Prompt’ from the Start Menu and type: activate.
• Mac OS/Linux: Open ‘Terminal’ from Applications and type source activate.

If Anaconda was properly installed, you should see something similar to this:

Now, you need to type one command:

conda install -c htrc htrc-feature-reader


This command installs the HTRC Feature Reader and its necessary dependencies. We specify -c htrc so the installation command knows to find the library from the htrc organization.

That’s it! At this point you have everything necessary to start reading HTRC Feature Reader files.

psst, advanced users: You can install the HTRC Feature Reader without Anaconda with pip install htrc-feature-reader, though for this lesson you’ll need to install two additional libraries pip install matplotlib jupyter. Also, note that not all manual installations are alike because of hard-to-configure system optimizations: this is why we recommend Anaconda. If you think your code is going slow, you should check that Numpy has access to BLAS and LAPACK libraries and install Pandas recommended packages. The rest is up to you, advanced user!

## Start a Notebook

Using Python the traditional way – writing a script to a file and running it – can become clunky for text analysis, where the ability to look at and interact with data is invaluable. This lesson uses an alternative approach: Jupyter notebooks.

Jupyter gives you an interactive version of Python (called IPython) that you can access in a “notebook” format in your web browser. This format has many benefits. The interactivity means that you don’t need to re-run an entire script each time: you can run or re-run blocks of code as you go along, without losing your enviroment (i.e. the variables and code that are already loaded). The notebook format makes it easier to examine bits of information as you go along, and allows for text blocks to intersperse a narrative.

Jupyter was installed alongside Anaconda in the previous section, so it should be available to load now.

From the Start Menu (Windows) or Applications directory (Mac OS), open “Jupyter notebook”. This will start Jupyter on your computer and open a browser window. Keep the console window in the background, the browser is where the magic happens.

If your web browser does not open automatically, Jupyter can be accessed by going to the address “localhost:8888” - or a different port number, which is noted in the console (“The Jupyter Notebook is running at…”):

Jupyter is now showing a directory structure from your home folder. Navigate to the lesson folder where you unzipped lesson_files.zip.

In the lesson folder, open Start Here.pynb: your first notebook!

Here there are instructions for editing a cell of text or code, and running it. Try editing and running a cell, and notice that it only affects itself. Here are a few tips for using the notebook as the lesson continues:

• New cells are created with the Plus button in the toolbar. When not editing, this can be done by pressing ‘b’ on your keyboard.
• New cells are “code” cells by default, but can be changed to “Markdown” (a type of text input) in a dropdown menu on the toolbar. In edit mode, you can paste in code from this lesson or type it yourself.
• Switching a cell to edit mode is done by pressing Enter.
• Running a cell is done by clicking Play in the toolbar, or with Ctrl+Enter (Ctrl+Return on Mac OS). To run a cell and immediately move forward, use Shift+Enter instead.

An example of a full-fledged notebook is included with the lesson files in example/Lesson Draft.ipynb.

In this notebook, it’s time to give the HTRC Feature Reader a try. When it is time to try some code, start a new cell with Plus, and run the code with Play. Before continuing, click on the title to change it to something more descriptive than “Start Here”.

The HTRC Feature Reader library has three main objects: FeatureReader, Volume, and Page.

The FeatureReader object is the interface for loading the dataset files and making sense of them. The files are originally formatted in a notation called JSON (which Programming Historian discusses here) and compressed, which FeatureReader makes sense of and returns as Volume objects. A Volume is a representation of a single book or other work. This is where you access features about a work. Many features for a volume are collected from individual pages; to access Page information, you can use the Page object.

Let’s load two volumes to understand how the FeatureReader works. Create a cell in the already-open Jupyter notebook and run the following code. This should give you the input shown below.

from htrc_features import FeatureReader
import os
paths = [os.path.join('data', 'sample-file1.json.bz2'), os.path.join('data', 'sample-file2.json.bz2')]
for vol in fr.volumes():
print(vol.title)

June / by Edith Barnard Delano ; with illustrations.
You never know your luck; being the story of a matrimonial deserter, by Gilbert Parker ... illustrated by W.L. Jacobs.


Here, the FeatureReader is imported and initialized with file paths pointing to two Extracted Features files. The files are in a directory called ‘data’. Different systems do file paths differently: Windows uses back slashes (‘data\…’) while Linux and Mac OS use forward slashes (‘data/…’). os.path.join is used to make sure that the file path is correctly structured, a convention to ensure that code works on these different platforms.

With fr = FeatureReader(paths), the FeatureReader is initialized, meaning it is ready to use. An initialized FeatureReader is holding references to the file paths that we gave it, and will load them into Volume objects when asked.

Consider the last bit of code:

for vol in fr.volumes():
print(vol.title)


This code asks for volumes in a way that can be iterated through. The for loop is saying to fr.volumes(), “give me every single volume that you have, one by one.” Each time the for loop gets a volume, it starts calling it vol, runs what is inside the loop on it, then asks for the next one. In this case, we just told it to print the title of the volume.

You may recognize for loops from past experience iterating through what is known as a list in Python. However, it is important to note that fr.volumes() is not a list. If you try to access it directly, it won’t print all the volumes; rather, it identifies itself as something known as a generator:

What is a generator, and why do we iterate over it?

Generators are the key to working with lots of data. They allow you to iterate over a set of items that don’t exist yet, preparing them only when it is their turn to be acted upon.

Remember that there are 13.6 million volumes in the Extracted Features dataset. When coding at that scale, you need to be be mindful of two rules:

1. Don’t hold everything in memory: you can’t. Use it, reduce it, and move on.
2. Don’t devote cycles to processing something before you need it.

A generator simplifies such on-demand, short term usage. Think of it like a pizza shop making pizzas when a customer orders, versus one that prepares them beforehand. The traditional approach to iterating through data is akin to making all the pizzas for the day before opening. Doing so would make the buying process quicker, but also adds a huge upfront time cost, needs larger ovens, and necessitates the space to hold all the pizzas at once. An alternate approach is to make pizzas on-demand when customers buy them, allowing the pizza place to work with smaller capacities and without having pizzas laying around the shop. This is the type of approach that a generator allows.

Volumes need to be prepared before you do anything with them, being read, decompressed and parsed. This ‘initialization’ of a volume is done when you ask for the volume, not when you create the FeatureReader. In the above code, after you run fr = FeatureReader(paths), there are are still no Volume objects held behind the scenes: just the references to the file locations. The files are only read when their time comes in the loop on the generator fr.volumes(). Note that because of this one-by-one reading, the items of a generator cannot be accessed out of order (e.g. you cannot ask for the third item of fr.volumes() without going through the first two first).

## What’s in a Volume?

Let’s take a closer look at what features are accessible for a Volume object. For clarity, we’ll grab the first Volume to focus on, which can conveniently be accessed with the first() method. Any code you write can easily be run later with a for vol in fr.volumes() loop.

Again here, start a new code cell in the same notebook that you had open before and run the following code. The FeatureReader does not need to be loaded again: it is still initialized and accessible as fr from earlier.

# Reading a single volume
vol = fr.first()
vol

<htrc_features.feature_reader.Volume at 0x1cf355a60f0>


While the majority of the HTRC Extracted Features dataset is features, quantitative abstractions of a book’s written content, there is also a small amount of metadata included for each volume. We already saw Volume.title accessed earlier. Other metadata attributes include:

• Volume.id: A unique identifier for the volume in the HathiTrust and the HathiTrust Research Center.
• Volume.year: The publishing date of the volume.
• Volume.language: The classified language of the volume.
• Volume.oclc: The OCLC control number(s).

The volume id can be used to pull more information from other sources. The scanned copy of the book can be found from the HathiTrust Digital Library, when available, by accessing http://hdl.handle.net/2027/{VOLUME ID}. In the feature reader, this url is retrieved by calling vol.handle_url:

print(vol.handle_url)

http://hdl.handle.net/2027/nyp.33433075749246


Hopefully by now you are growing more comfortable with the process of running code in a Jupyter notebook, starting a cell, writing code, and running the cell. A valuable property of this type of interactive coding is that there is room for error. An error doesn’t cause the whole program to crash, requiring you to rerun everything from the start. Instead, just fix the code in your cell and try again.

In Jupyter, pressing the ‘TAB’ key will guess at what you want to type next. Typing vo then TAB will fill in vol, typing Fea then TAB will fill in FeatureReader.

Auto-completion with the tab key also provides more information about what you can get from an object. Try typing vol. (with the period) in a new cell, then press TAB. Jupyter shows everything that you can access for that Volume.

The Extracted Features dataset does not hold all the metadata that the HathiTrust has for the book. More in-depth metadata like genre and subject class needs to be grabbed from other sources, such as the HathiTrust Bibliographic API. The URL to access this information can be retrieved with vol.ht_bib_url.

An additional data source for metadata is the HTRC Solr Proxy, which allows searches for many books at a time, but only for Public Domain books. vol.metadata can ask this source for metadata straight from your code. Remember that pinging HTRC adds overhead, so an efficient large-scale algorithm should avoid vol.metadata.

## Our First Feature Access: Visualizing Words Per Page

It’s time to access the first features of vol: a table of total words for every single page. These can be accessed by calling vol.tokens_per_page(). Try the following code.

If you are using a Jupyter notebook, returning this table at the end of a cell formats it nicely in the browser. Below, you’ll see us append .head() to the tokens table, which allows us to look at just the top few rows: the ‘head’ of the data.

tokens = vol.tokens_per_page()
# Show just the first few rows, so we can look at what it looks like

count
page
1 5
2 0
3 1
4 0
5 1

No print! We didn’t call ‘print()’ to make Jupyter show the table. Instead, it automatically guessed that you want to display the information from the last code line of the cell.

This is a straightforward table of information, similar to what you would see in Excel or Google Spreadsheets. Listed in the table are page numbers and the count of words on each page. With only two dimensions, it is trivial to plot the number of words per page. The table structure holding the data has a plot method for data graphics. Without extra arguments, tokens.plot() will assume that you want a line chart with the page on the x-axis and word count on the y-axis.

%matplotlib inline
tokens.plot()


%matplotlib inline tells Jupyter to show the plotted image directly in the notebook web page. It only needs to be called once, and isn’t needed if you’re not using notebooks.

On some systems, this may take some time the first time. It is clear that pages at the start of a book have fewer words per page, after which the count is fairly steady except for occasional valleys.

You may have some guesses for what these patterns mean. A look at the scans confirms that the large valleys are often illustration pages or blank pages, small valleys are chapter headings, and the upward pattern at the start is from front matter.

Not all books will have the same patterns so we can’t just codify these correlations for millions of books. However, looking at this plot makes clear an inportant assumption in text and data mining: that there are patterns underlying even the basic statistics derived from a text. The trick is to identify the consistent and interesting patterns and teach them to a computer.

### Understanding DataFrames

Wait… how did we get here so quickly!? We went from a volume to a data visualization in two lines of code. The magic is in the data structure used to hold our table of data: a DataFrame.

A DataFrame is a type of object provided by the data analysis library, Pandas. Pandas is very common for data analysis, allowing conveniences in Python that are found in statistical languages like R or Matlab.

In the first line, vol.tokens_per_page() returns a DataFrame, something that can be confirmed if you ask Python about its type with type(tokens). This means that after setting tokens, we’re no longer working with HTRC-specific code, just book data held in a common and very robust table-like construct from Pandas. tokens.head() used a DataFrame method to look at the first few rows of the dataset, and tokens.plot() uses a method from Pandas to visualize data.

Many of the methods in the HTRC Feature Reader return DataFrames. The aim is to fit into the workflow of an experienced user, rather than requiring them to learn proprietary new formats. For new Python data mining users, learning to use the HTRC Feature Reader means learning many data mining skills that will translate to other uses.

The information contained in vol.tokens_per_page() is minimal, a sum of all words in the body of each page. The Extracted Features dataset also provides token counts with much more granularity: for every part of speech (e.g. noun, verb) of every occurring capitalization of every word of every section (i.e. header, footer, body) of every page of the volume.

tokens_per_page() only kept the “for every page” grouping; vol.tokenlist() can be called to return section-, part-of-speech-, and word-specific details:

tl = vol.tokenlist()
# Let's look at some words deeper into the book:
# from 1000th to 1100th row, skipping by 15 [1000:1100:15]
tl[1000:1100:15]

count
page section token pos
27 body those DT 1
within IN 1
28 body a DT 3
be VB 1
deserted VBN 1
faintly RB 1
important JJ 1

As before, the data is returned as a Pandas DataFrame. This time, there is much more information. Consider a single row:

The columns in bold are an index. Unlike the typical one-dimensional index seen before, here there are four dimensions to the index: page, section, token, and pos. This row says that for the 24th page, in the body section (i.e. ignoring any words in the header or footer), the word ‘years’ occurs 1 time as an plural noun. The part-of-speech tag for a plural noun, NNS, follows the Penn Treebank definition.

The “words” on the first page seems to be OCR errors for the cover of the book. The HTRC Feature Reader refers to “pages” as the