Bringing the Hammer Down: Using MALLET to Assign Authority Records to Senator Heller’s Press Records

by Nathan Gerth, PhD, University of Nevada, Reno


In 2019, the University of Nevada, Reno Libraries acquired Senator Dean Heller’s electronic records: 3,549 digital objects that constitute the press records for his full tenure in Congress. The Senator elected to open the records immediately, thereby making it a priority to process and post these materials to our digital repository. However, our repository functions best when users can filter by authority headings. As a result, we faced a challenge: how can we efficiently assign metadata to these records, while at the same time avoiding a scenario in which we manually review each file?

This blog post outlines our use of the topic modeling application MALLET to assign headings to the Heller records. While the process was not without its difficulties, the test shows that topic modeling allowed us to enhance the metadata for one third of the records, all while leaving the reading of the texts to MALLET.

Topic Modeling

For the uninitiated, topic modeling involves looking for statistically meaningful relationships, i.e., topics, within a corpus of text. In the past, I have used topic modeling in an archival setting when assigning subject headings to finding aids or for specific research projects. Yet, in this case I sought to go beyond broad characterizations of a corpus, and instead find a means to associate individual topics with files in the Senator’s press archive.

The Process

The Heller records include five file types: Microsoft Word, PDF, JPEG, PNG, and HTML. Of these, the Word documents are by far the most voluminous: 89% of the total number of files. The predominance of these .doc and .docx files, combined with the ease of harvesting text from those files, led me to select them for this test.

For the purpose of harvesting the text, I chose to use the textutil app, which is available as part of the operating system on Macs. There are numerous free and open source utilities available for Windows and Linux as well, including Libre Office, which can be used to convert files in batch from the command line. When triggered in a given directory, it converts all instances of a designated file type to a select format. For example, the following converts all docx files:

$ textutil -convert txt *.docx

After running this command for both .doc and .docx files, I then transferred the resulting .txt files to a staging directory.

For the topic modeling itself, I closely followed the MALLET tutorial Adam Crymble wrote for the Programming Historian ( I began the process of topic modeling with MALLET by first importing the corpus of text documents:

./bin/mallet import-dir –input heller_data –output heller.mallet –keep-sequence –remove-stopwords –extra-stopwords heller_stopwords.txt

It is worth noting that I included extra stop words, e.g., senator, senate, Heller, etc. Doing so helped ensure that the statistical “noise” of these words did not drown out more meaningful associations.

I then generated ten topics, along with a “composition file” that listed the association, by percentage, of each file with a given topic (heller_composition.txt).

.bin/mallet train-topics  –input heller.mallet –num-topics 10 –output-state topic-state.gz –output-topic-keys heller_keys.txt –output-doc-topics heller_compostion.txt

Of the resulting ten topics, we found that seven lent themselves to clear association with a broader idea or issue:

0          transportation federal commerce tourism international safety neal infrastructure commission fcc travel public nevada’s national access industry security heller’s law technology

1          tax businesses health small families insurance nevadans relief jobs housing business credit care americans workers increase employees financial home law

2          land public lands water reid development lake tahoe federal conservation management economic resources important county online gaming poker projects national

3          vegas las media press bybee stewart advisory day reno august news rural conference speak date pdt event heller’s satellite july

4          law bipartisan education violence sexual enforcement students victims human children women trafficking services national health programs signed cancer assault domestic

5          energy yucca mountain nuclear waste chairman appropriations funding project secretary storage water million member request repository ranking nation’s budget department

6          budget pay unemployment work nation jobs stewart bybee economy economic job spending nevadans debt press back nation’s country proposal date

7          americans public law concerns transparency department rights general rules proposed jerusalem issue administration irs process security secretary agency data open

8          program las funding vegas million federal city department funds project county grant community local rural provide resources communities security secretary

9          veterans affairs care military claims backlog service veteran department health nevada’s receive women working medical member ensure secretary disability homeless

With each of the viable topics identified, I worked with our Metadata and Cataloging department to select appropriate authority headings. In the end, we settled on the following options, some of which were applied in combination to better characterize a topic:

Topic 0: Infrastructure (Economics)–United States

Topic 1: Tax administration and procedure–United States

Topic 2: Public lands|Water conservation|Land conservation

Topic 3: Excluded

Topic 4: Women — United States — Social conditions           

Topic 5: Radioactive waste disposal — Nevada — Yucca Mountain|Groundwater — Nevada — Yucca Mountain

Topic 6: Excluded

Topic 7: Excluded

Topic 8: Federal aid to community development–United States

Topic 9: Veterans–Legal status, laws, etc.–United States

I then plotted each of these headings on the corresponding column in the composition document. By sorting the probabilities listed in the composition file for each column, I floated the file names with the closest ties to a given topic to the top of each column.

I then dragged down a given heading for the column until the file titles started to imply that the topic was no longer relevant. After completing this work, I crosswalked these values into our ingest spreadsheet for the collection.


In the end, this test showed that using MALLET allowed us to look beyond the limited information in the filenames when creating metadata. For example, the topic modeling successfully associated our “veterans” topic with a file whose name simply referred to “depleted uranium,” a subject associated with Gulf War Syndrome and, by extension, veterans. Granted, selecting authority headings was a tricky process. However, making those difficult decisions is still preferable to a scenario in which I would need to unduly toil through a manual review of each document.

CPS Day 2018: Diversity, Advocacy, and Data

The Congressional Papers Section (CPS) met during the Society of American Archivists (SAA) annual conference in Washington, D.C. on August 15. The venue in the nation’s capital provided session organizers a chance to bring in congressional staffers and gave the opportunity for archivists to visit with their congressional delegations.  The most prominent theme of the day was “advocacy”—advocacy for the history of undocumented groups, advocacy for saving important data, and advocacy for congressional offices to keep and donate their records—and the various ways this can be achieved.

The first session, sponsored by the CPS Diversity Task Force, was based on “archival silences” or the “empty spaces in…archival records and/or practices.” Panel contributors were from a public research library (Andrea Copeland), a history/memory preservation project (Adrena Ifill), a public and civic programs foundation (Scott St. Louis), and the Congressman Charles Rangel Archive Project (Kimberly Peach).  Talks were on the following: the records of the Indianapolis AME Bethel Church (founded in 1836) and how Copeland persuaded their custodians to donate them to her institution; how Ifill helps African-American members of Congress to place their records in archives and draws documents from a variety of sources to tell their stories; how Peach uses the Rangel collection to illustrate local history; and how public policy foundations and archives can further civic education and illuminate previously under-studied areas.  Copeland’s “lessons learned” included the need to diversify the profession (to better work with minority communities and access otherwise forgotten collections), getting participation from heritage groups and third party organizations, and being open to working with living donors.  She introduced the idea of “participatory” heritage, a new and more activist way of thinking about archives as “service to the community.”

Diversity Panel (with speaker Andrea Copeland)

The CPS Electronic Records Committee (ERC), in partnership with the CPS Constituent Services System (CSS) Task Force, put together the second session on preserving and accessing CSS data, which can contain important topical, administrative, and demographic information. The session’s aim was to “demystify” CSS data and argue for its inclusion in a collection.  Along with a summary of the Task Force’s recent report and a Senate Historical Office survey of current CSS use, the session introduced West Virginia University’s promising new tool to make CSS data accessible to researchers.  The tool has now been granted a Lyrasis Catalyst Grant for further research and development, and both the ERC and the Task Force plan to follow its progress.[1]  Then the group got an opportunity to talk with a Senate systems administrator (Vik Kulkarni) and a House office legislative correspondent (Lucy Shaw).  Questions were on duplication between CSS and shared drives, who uses CSS, the differences between the House and Senate vendor contracts and downloads, and when to start the conversation about getting CSS data (Vik joked you should ask before the Member is elected!).  Vik urged archivists to be thorough in discussions about the CSS with offices, such as finding out the context of the data, getting internal codes defined, technical facts, and any system migration information.  Lucy pointed out that she sometimes finds her CSS clunky and glitchy, and that archivists should be aware of past problems and the presence of “junk data.”

CSS Panel (with Jessica Tapia showing the WVU CSS Tool)

The final panel, moderated by Nathan Gerth of the Advocacy Task Force, focused on the idea and practice of advocacy, on which CPS has a new brief “CPS Advocacy Day 2018”.[2]  The session featured “insider perspectives” from several Senate staffers from offices in various stages of archiving, along with the experiences of a repository archivist, Leigh McWhite of the University of Mississippi. It was an open and often frank discussion, which ranged from how we as archivists can advise offices better, getting the attention of the Member, useful communications strategies, why the word “archivist” still scares staffers(!) and what to do about it, to the challenges of finding the best and most suitable repository–even whether a repository will even want the collection. To the last issue, which McWhite joked was a “dirty secret,” there may be institutions which have internal collecting policies that place limitations on what they can take and that there may be bureaucratic barriers on their end that prevent more active  outreach.

Hopefully, attendees left motivated about new ways they can share and showcase their collections, work with legislators and their staffers, and why saving the CSS data (which can contain those communications and interactions with underrepresented groups!) may be worth it.

[1] The WVU CSS tool can be downloaded at  They have used Senate and House office data sets, one using the “Archival” download (limited number of fields) and the more robust “SDIFF” Interchange.  More information about the tool can be found in the CSS Task Force report.

[2] Should be available on the CPS website.

“Ask A Systems Administrator” Series: Volume Four – Constituent Services Systems (CSS)

The Electronic Records Committee (ERC) of the Congressional Papers Section (CPS) started a blog series in 2016 entitled “Ask a Systems Administrator,” in which repository archivists working with congressional electronic records submit questions for systems administrators working in congressional offices. The goal with this project was to help congressional archivists better understand the congressional environment.

For the ERC’s session on preserving and accessing CSS data held during CPS Day 2018, Committee member Elisabeth Butler of the Senate Historical Office put together a simple survey of U.S. Senate systems administrators (SA) and correspondence managers on whether offices were using CSS for purposes beyond managing constituent correspondence and casework.  Her colleague, Alison White, presented a quick summary of the survey results at the session.  The survey provides a snapshot of how congressional offices are currently employing their CSS.

The survey (built using the free version of Survey Monkey) was sent to the SA and correspondence manager listservs on July 23 and again on July 31. A total of 34 offices responded (answers were kept anonymous).

Question 1: Is your office using CSS beyond basic constituency work?









Question 2: If You Answered “yes”, what additional functions have been added to your CSS? Please check all that apply












Note that “legislative correspondence/drafting” and “calendar” were the most popular choices. The 8 “other” responses included:

  • “tele town hall, e-newsletters and targeted emails, tours, flags. We have also created customized services to track scheduling requests and stories from constituents so we can better categorize to pull for Senator’s remarks and floor speeches.”
  • “gift tracking, award tracking, meeting follow-up to dos, projects and grants tracking.”
  • “email outreach/press releases.”

Question 3: Use This Box to Provide Additional Information or Elaborate on Your Response

This question received 9 responses mostly having to do with the Senator’s scheduling but others included:

  • “we utilize an in-house sharepoint 2016 on premise installation for our document management, calendars, and CMS.”
  • “We use our CSS to draft legislative response letters to reply to our constituent mail (physical and electronic)”[response also reported that legislative staff pair their Outlook accounts to the CSS to receive notifications of letter approvals]
  • “I think we might use it for more things if the functionality was there…and if staff were more strongly compelled to use it”!

Question 4 was for those administrators who wanted to further discuss their offices’ use of CSS with the Senate Historical Office’s archivists.

Vik Kulkarni, a Senate Systems Administrator, who was on the ERC session panel, backed up these responses. Vik, who has set up several personal offices’ IT systems, has found that staffers will often start using the CSS for document management because they find it better structured than the shared drives! They also find it a good way to keep “final” versions of documents organized and accessible.  Want to know more about what Vik (and Lucy Shaw from the House side) said about the CSS? Check out the CPS Day 2018 recap.

The “Future of Email Archives” Report: A Summary and Review

In 2016 the Task Force on Technical Approaches to Email Archives was formed to “construct a working agenda for the community” and articulate “a conceptual and technical framework” to archiving email.  A Task Force feature of note is that, alongside members from academic libraries and institutional archives, there are representatives from two private sector tech companies, Microsoft and Google.  Its final report “The Future of Email Archives,” is now out.[1]  The report’s “primary objective…is assessing and recommending methods by which archivists can engage with these new…technologies as well as identifying gaps and recommending additional development.” The report recognizes that, while most archives have a basic handle on capturing, preserving, and providing access to email, it’s not yet done in a systematic way to become truly effective. It’s not enough to have the shared goal of archiving email; there must also be a deployment of new techniques and approaches, complemented by advanced technology, and improved advocacy.  The following is a synopsis and review of the report.

In the first section, the “Untapped Potential of Email,” the Task Force observes that, far from being a dying medium, email is still pervasive and powerful, and still contains the important communications which archivists need to capture and maintain to fulfill their critical responsibility of preserving our history. It’s imperative for archivists to understand email in order to become fully knowledgeable about their collections.  The nature of email—it’s ephemerality, size, tendency to disorganization, software requirements—means that it is fundamentally different from other electronic records. Among many other points made was that email is already changing traditional archival practices, in that it is drawing more repositories to acquire born-digital accessions.  The following section, the “Email Stewardship Lifecycle,” extensively goes through the email lifecycle model, with the associated preservation challenges and suggestions for managing each level.

The more technical and substantive material exists in the next three sections, which are about defining email, assessing the landscape of email preservation, and identifying problems and promulgating potential solutions.  The section “Email as a Documentary Technology” explains an email’s architectural characteristics (the structure of a message, operational and administrative features, software systems, and why email became so widely adopted), security vulnerabilities and their complications, and additional components (such as attachments and links).  The next section, “Current Services and Trends”, is an excellent summary of the state of email communications in today’s society and how the IT industry and the archival/library community are addressing various challenges.  Included are problems with email abuse and security (which can have a deleterious effect on ensuring authenticity) and the features industry is introducing to combat them (the report notes which solutions are the most and least valuable to archives). Also considered is the range of existing records management solutions and how useful they are (or not) to long-term email archiving, and the additional work the archival and library community has taken in capturing, authenticating, and preserving messages. This section notes that the archival community has already consolidated around disk imaging and various types of data exports as effective ways to capture emails.  Discussion of current preservation tools and strategies are in “Potential Solutions and Sample Workflows.”  As the report states, “regardless of the chosen approach, tools should be able to exchange data regarding both the email and relevant preservation actions.”  The pros and cons of various preservation strategies, and suggestions for what a community data model should look like, are thoroughly discussed.

The report closes with the “Path Forward: Recommendations and Next Steps,” which provides both short-term and long-term proposals in the areas of community development/advocacy and tools support, testing, and development, helpfully targeted to repositories of varying levels of electronic records archiving and tech-savviness.

Some immediate takeaways from the report:

  • The documenting of process should be central to any email preservation strategy, and is part of an overall need to develop new kinds of archival practices (such as aggregate description).
  • The unique demands of email require extensive human intervention, and much earlier in the lifecycle than other electronic records because “decisions made or not made by those who send, receive, and manage email [at the start]…determine if and how it can be made available to future researchers.”
  • The openness of email that is key to its success and popularity is the primary complication of preserving it in the long-term, in accordance with archival principles.
  • Archivists should be aware that current “email archiving” industry solutions are largely geared to legal or records management issues of compliance, retention, and risk management, i.e. deletion of emails as soon as possible!
  • Since email is bulky and unorganized, automation (such as predictive coding) and even artificial intelligence in some form will play an increasing role in its management.

Scattered throughout the report are workflows/scenarios, term and format definitions, technical explanations, and case studies. One of the most valuable parts of the report is the series of appendices on email preservation tools and current email preservation research projects, including an automating system processes model.  It’s a great resource for those who want to know about the latest initiatives and standards in email archiving.

The report provides one of the most comprehensive assessments of the subject and many readers will find their challenges, concerns, and experiences with archiving email  reflected and verified. It certainly succeeds in establishing an “interoperable toolkit” that provides useful information and solid ideas for creating or furthering an email archiving plan or strategy without getting bogged down by policy decision specifics or limitations of tools.  The report not only “gets down in the weeds,” but also considers the bigger picture by making connections between current technology and process and available financial and labor resources.  The report is not a one-off but part of a larger Task Force program to secure stakeholders, follow up on promising approaches, and augment existing recommendations.  The Task Force urges that, at whatever level you perform email preservation and access, you can achieve a workable solution.

You can download or purchase the report from the CLIR website (CLIR Pub 175).

[1] Officially announced at the session “Email Archiving Comes of Age” #201 on August 16, 2018 at the Society of American Archivist annual conference in Washington, D.C.

Why I Added “DAS” to My Signature File

by Mary Goolsby, MLIS, CA, DAS, Baylor University

The Digital Archives Specialist (DAS) is one of two certificate programs available through the Society of American Archivists (SAA). Through both online and in-person courses, the curriculum is “designed to provide you with the information and tools you need to manage the demands of born-digital records.”[1] SAA provides excellent information on their website for those interested in checking out individual courses to meet a specific need and for those who want to pursue the DAS certificate. To earn a DAS certification, you must complete courses in four areas: Foundational (four), Tactical and Strategic (three), Tools and Services (one) and Transformational (one). SAA allows 24 months to complete the coursework, then up to five months after that to pass the comprehensive exam. All tests for individual courses are timed and taken online, with testing out of Foundational courses as an option. The comprehensive exam is taken the same way, but is only available during the months of April, August and December.

I’ve worked in the Baylor Libraries in Texas since 1995, first as a staff member in the fine arts library when I began my MLIS, and then as an assistant in library development when I completed my degree in December 2000. I moved to the W. R. Poage Legislative Library in June 2008, but didn’t begin working with archives until five years after arriving. I was strongly encouraged by the interim Dean to take the Certified Archivist (CA) exam and passed in August 2016.

I decided to pursue the DAS certificate to increase my understanding of the digital records received by congressional archives and to make better decisions about what to digitize. The nuts and bolts of getting the certificate were not overly taxing. SAA requires that you take at least two courses in person; however, I ended up taking seven on site and four online.[2] Two of the in-person courses were taught here at Baylor and hosted by the Poage Library. Hosting provided a good way for me to take two classes I needed, and also benefited other library staff members on campus. The farthest in-person course I took was in Kentucky. It was my last class and I wanted to take the comprehensive exam in April 2018, so I traveled for the timing to work out. The other courses were closer for me in Houston and Austin. Poage Library paid for most of the courses and I paid for travel.

Earning a DAS certificate facilitated my growth in knowledge of digital archives. I took courses in the essentials of electronic records – appraisal, arrangement and description, and preservation and access. I learned how to manage, curate and preserve digital archives. I even took a class on command line interface which was new to me, yet seemed familiar in an HTML kind of way.

What the process didn’t do was teach me the technical skills to be my own digital archive shop.[3] A few of the courses provided hands-on experiments with ingesting, arranging, and describing digital records. I did not possess a lot of technical skills other than HTML going in, and the DAS curriculum was not designed to teach them to me since the certificate requires only one “Tools and Services” course. I am very fortunate that at Baylor we have a digitization center with staff who possess the technical skills necessary to work with digital objects. However, they are not archivists trained in archival principles with an understanding of which artifacts possess significant research value, and are thus worthy of sustaining digitally across generations of technology. Digitizing material or retaining digital objects without applying archival principles is obviously expensive and unsustainable. The DAS curriculum has taught me to create the practices, procedures, and plans to complement and enhance the digital archives at my institution and library.

The DAS provided professional growth and will assist with my CA recertification. I am glad I made the effort to earn it, and I look forward to taking more courses next year to meet the 5-year recertification. Never stop growing. Never stop learning!

[1] Society of American Archivists. Digital Archives Specialist (DAS) Curriculum and Certificate Program. Accessed July 28, 2018.

[2] A number of courses have been added since I started in October 2016, and will make it easier to complete more classes online.

[3] DAS courses can help with learning technical skills with classes such as Advanced Digital Forensics for Archivists, Research Data Curation, Preservation Formats, Tool Selection and Management, and Archival Collections Management Systems.



Electronic Records Modular Manual: Two More Modules!

The Electronic Records Committee (ERC) of the Congressional Papers Section is pleased to announce a new set of modules for its electronic records manual.

The modules, by new entrants John Caldwell (University of Delaware) and Nathan Gerth (University of Nevada), are on workflows involving BitCurator (a suite of digital preservation tools) and a descriptive tool for digital records called Brunnhilde. The BitCurator module sets out the steps needed to create a disk image, and the Brunnhilde module explains how an appraisal report can be generated from a disk image. Because these modules work with different aspects of disk images using BitCurator, they closely complement each other.  BitCurator is an important resource in the archival community and we are pleased to have some examples of possible workflows for it.  The modules feature diagrams, screenshots, and command line instructions, and cover the areas of accessioning and description.  These modules bring our collection to 15!

The idea of the modular manual is to provide documentation for a possible method to address a need in an electronic records workflow. An institution can mix and match them to create an electronic records workflow that meets its needs. Community members are invited to contribute their own process documentation. The goal is to build up a collection of modules that offer alternatives to each task that makes up the electronic records workflow, from donor discussions through access.

Modules are offered separately or complete in a PDF portfolio.

When formulating this initiative, the ERC put together an outline for further modules. Please contact the ERC if you would like to contribute a module and we will provide you with suggestions.

DigiPres17: Congressional Archivists Discuss “From the Hill to the Home State.”

Alison White, of the Senate Historical Office, and other current or former Congressional archivists, participated in a session on issues involving digital preservation of congressional records at the National Digital Stewardship Alliance’s Digital Preservation conference back in October.  Instead of doing just a regular blog post about the session, she put together a special “storify” of the archivists’ tweets.

Alison’s complete post can be found here.