Lisa Nguyen
Lisa is the digital archivist at the UCSF Library.

2024 Senior Summer Fellow Reflection: Theo Zhang

Photo of Theo Zhang

The UCSF Archives and Special Collections and Industry Documents Library teams are pleased to highlight the work of 2024 senior summer fellow Theo Zhang. Theo is a senior studying computer science at the University of California, Los Angeles (UCLA). Over the past year, they developed a passion for research on responsible artificial intelligence and its applications. As a result, they were excited about the summer research opportunity with the UCSF Archives and Special Collections. Theo’s primary interests include machine learning and artificial intelligence, specifically in relation to humanistic approaches and ethical considerations. 

Theo Zhang shares more about their project and final report below.

Project details

My final report Silence in OCR: What Could Handwritten Documents Tell Us? examines the quality of optical character recognition (OCR) technology for handwritten and typewritten documents and their key differences.

OCR for handwritten documents is inferior compared to typewritten documents. This creates biases and gaps in datasets, making handwritten documents less accessible and underutilized by researchers. I analyzed three OCR tools, Tesseract, Document AI, and Textract, to assess performance on different document types. The goal was to identify the tool that can effectively unlock more handwritten data within the UCSF Archives and Special Collections. Many documents are “hidden” due to the amount of information that one would have to look through to discover a useful or relevant piece, but OCR allows for this data to be more readily accessed.

To illustrate the necessity for better OCR tools in the library, I compared the content of different document types to assess the value of handwritten documents in research and the gaps created when they are excluded from datasets. The dataset consists of documents and notes from three different sources concerning organizations and organizers working on the AIDS/HIV epidemic in the 1980’s to 1990’s.

There are three categories of documents: handwritten, typewritten, and mixed. 

Textract emerged as the best performing OCR tool for both handwritten and typewritten text. However, the challenge lies not in the quality of Textract’s OCR output but in the search capabilities of Calisphere, the University of California’s digital platform for research, teaching, and exploration. While Calisphere offers discovery through metadata, it lacks full-text search functionality. This impacts user interaction with digital collections on Calisphere, including those from UCSF Archives and Special Collections. Despite these limitations, there may be alternative channels for accessing Textract’s OCR outputs in the future.  

Final reflection

Through programmatic analysis and close reading of the documents, it is clear that OCR quality greatly impacts downstream tasks such as sentiment analysis. This argument supports using the best OCR tools for improved accuracy and reliability. Additionally, the results imply that handwritten documents offer deeper context to a research project, specifically concerning details that may not appear in typewritten documents. By lowering the scrutiny for handwritten documents, researchers risk missing intimate details about their research subjects and contextual information related to organizational dynamics and tensions within and between groups.  

If the project is repeated with different foci, the conclusions on content analysis may be different. The era studied encompasses a period where typewritten and handwritten material were equally important in organizational work; other eras may be different. Ultimately, the biases created by automated tools can have enormous downstream impacts. The project reflects the growing relationship between technology and archival research, and why scrutinizing automated tools for biases is important for preserving the integrity of the archival field. 

For more details, read my full report, Silence in OCR: What Could Handwritten Documents Tell Us? via eScholarship.

Acknowledgments  

I would like to extend my sincere gratitude to the Industry Document Library and Archives and Special Collections teams for their support during the internship:  

  • Lisa Nguyen, digital archivist 
  • Sean Purcell, digital health humanities program coordinator 
  • Geoffrey Boushey, head of data engineering
  • Rebecca Tang, Industry Documents Library software developer  
  • Peggy Tran-Le, research and technical services managing archivist  
  • Kate Tasker, director of the Industry Documents Library 
  • Rachel Taketa, Industry Documents Library processing and reference archivist  
  • Gordon Lichtstein, 2024 industry documents summer fellow

Image credits

  • Featured image by Noah Berger, 2023
  • Handwritten image: Notes on meetings and contact information by John S. James (file name: 35), AIDS Treatment News Records, Box 8 Folder 16, UCSF Archives and Special Collections
  • Typed image: Title page from “California’s War on AIDS” (file name: ucsf_mss2015-01_011_061) in the Donald P. Francis papers, Box 11 Folder 61, UCSF Archives and Special Collections
  • Mixed image: Letter thanking the San Francisco AIDS Foundation for a postcard book contribution, with a sticky note from San Francisco AIDS Foundation records, UCSF Archives and Special Collections