by Nathan Gerth, PhD, University of Nevada, Reno
In 2019, the University of Nevada, Reno Libraries acquired Senator Dean Heller’s electronic records: 3,549 digital objects that constitute the press records for his full tenure in Congress. The Senator elected to open the records immediately, thereby making it a priority to process and post these materials to our digital repository. However, our repository functions best when users can filter by authority headings. As a result, we faced a challenge: how can we efficiently assign metadata to these records, while at the same time avoiding a scenario in which we manually review each file?
This blog post outlines our use of the topic modeling application MALLET to assign headings to the Heller records. While the process was not without its difficulties, the test shows that topic modeling allowed us to enhance the metadata for one third of the records, all while leaving the reading of the texts to MALLET.
For the uninitiated, topic modeling involves looking for statistically meaningful relationships, i.e., topics, within a corpus of text. In the past, I have used topic modeling in an archival setting when assigning subject headings to finding aids or for specific research projects. Yet, in this case I sought to go beyond broad characterizations of a corpus, and instead find a means to associate individual topics with files in the Senator’s press archive.
The Heller records include five file types: Microsoft Word, PDF, JPEG, PNG, and HTML. Of these, the Word documents are by far the most voluminous: 89% of the total number of files. The predominance of these .doc and .docx files, combined with the ease of harvesting text from those files, led me to select them for this test.
For the purpose of harvesting the text, I chose to use the textutil app, which is available as part of the operating system on Macs. There are numerous free and open source utilities available for Windows and Linux as well, including Libre Office, which can be used to convert files in batch from the command line. When triggered in a given directory, it converts all instances of a designated file type to a select format. For example, the following converts all docx files:
$ textutil -convert txt *.docx
After running this command for both .doc and .docx files, I then transferred the resulting .txt files to a staging directory.
For the topic modeling itself, I closely followed the MALLET tutorial Adam Crymble wrote for the Programming Historian (https://programminghistorian.org/en/lessons/topic-modeling-and-mallet). I began the process of topic modeling with MALLET by first importing the corpus of text documents:
./bin/mallet import-dir –input heller_data –output heller.mallet –keep-sequence –remove-stopwords –extra-stopwords heller_stopwords.txt
It is worth noting that I included extra stop words, e.g., senator, senate, Heller, etc. Doing so helped ensure that the statistical “noise” of these words did not drown out more meaningful associations.
I then generated ten topics, along with a “composition file” that listed the association, by percentage, of each file with a given topic (heller_composition.txt).
.bin/mallet train-topics –input heller.mallet –num-topics 10 –output-state topic-state.gz –output-topic-keys heller_keys.txt –output-doc-topics heller_compostion.txt
Of the resulting ten topics, we found that seven lent themselves to clear association with a broader idea or issue:
0 transportation federal commerce tourism international safety neal infrastructure commission fcc travel public nevada’s national access industry security heller’s law technology
1 tax businesses health small families insurance nevadans relief jobs housing business credit care americans workers increase employees financial home law
2 land public lands water reid development lake tahoe federal conservation management economic resources important county online gaming poker projects national
3 vegas las media press bybee stewart advisory day reno august news rural conference speak date pdt event heller’s satellite july
4 law bipartisan education violence sexual enforcement students victims human children women trafficking services national health programs signed cancer assault domestic
5 energy yucca mountain nuclear waste chairman appropriations funding project secretary storage water million member request repository ranking nation’s budget department
6 budget pay unemployment work nation jobs stewart bybee economy economic job spending nevadans debt press back nation’s country proposal date
7 americans public law concerns transparency department rights general rules proposed jerusalem issue administration irs process security secretary agency data open
8 program las funding vegas million federal city department funds project county grant community local rural provide resources communities security secretary
9 veterans affairs care military claims backlog service veteran department health nevada’s receive women working medical member ensure secretary disability homeless
With each of the viable topics identified, I worked with our Metadata and Cataloging department to select appropriate authority headings. In the end, we settled on the following options, some of which were applied in combination to better characterize a topic:
Topic 0: Infrastructure (Economics)–United States
Topic 1: Tax administration and procedure–United States
Topic 2: Public lands|Water conservation|Land conservation
Topic 3: Excluded
Topic 4: Women — United States — Social conditions
Topic 5: Radioactive waste disposal — Nevada — Yucca Mountain|Groundwater — Nevada — Yucca Mountain
Topic 6: Excluded
Topic 7: Excluded
Topic 8: Federal aid to community development–United States
Topic 9: Veterans–Legal status, laws, etc.–United States
I then plotted each of these headings on the corresponding column in the composition document. By sorting the probabilities listed in the composition file for each column, I floated the file names with the closest ties to a given topic to the top of each column.
I then dragged down a given heading for the column until the file titles started to imply that the topic was no longer relevant. After completing this work, I crosswalked these values into our ingest spreadsheet for the collection.
In the end, this test showed that using MALLET allowed us to look beyond the limited information in the filenames when creating metadata. For example, the topic modeling successfully associated our “veterans” topic with a file whose name simply referred to “depleted uranium,” a subject associated with Gulf War Syndrome and, by extension, veterans. Granted, selecting authority headings was a tricky process. However, making those difficult decisions is still preferable to a scenario in which I would need to unduly toil through a manual review of each document.