ProQuest is a content aggregator and research and learning hub with deep historical newspaper content, which creates challenges for optical character recognition (OCR). Students on this team will use machine learning to deliver an end-to-end system that reliably improves the quality of OCR ProQuest content.

Abstract:
ProQuest is a content aggregator and research and learning hub for students, librarians, and instructors. New methods for exploring and analyzing large amounts of text data are changing the way our users and buyers access and analyze our content. One of the great strengths of ProQuest as a company is its depth of historical newspaper content. However, one weakness of this content is its varying degrees of quality of optical character recognition (OCR) performance. Various document scanning methods and limitations over time have produced content which often contains errors. These errors impact downstream natural language processing and machine learning methods within ProQuest products such as TDM Studio for researchers and students.

To combat these issues we need to develop methods for improving our OCR quality. This project will include:

  • Modeling OCR improvement via sequence-to-sequence neural networks.
  • Building and leveraging word and character n-gram embeddings built on OCR text.
  • Developing novel approaches to OCR improvements for older content with severe degradation and a lack of ground-truth data.
  • Developing reliable document and corpus-level OCR cleanliness metrics.

Content to be analyzed will include:

  • More than 10M New York Times articles from 1867 to 2015 as well as selections from other major English newspapers. (Note: the earlier documents of this series have very challenging OCR problems.)
  • Overlapping newspaper articles from OCR and electronic-delivered (clean) newspaper versions from recent decades to generate training and evaluation sets.

This work will build off of previous work in language modeling and language translation used for similar tasks such as spell checking and automatic text generation.

One unique challenge of this problem is that we would like to improve the OCR quality without introducing new errors. Additional challenges to this problem include model transfer across different datasets and time frames as well as evaluation of time periods and newspapers for which we do not have ground-truth examples.

The student team will deliver an end-to-end system/engine that reliably improves quality of OCR ProQuest content.

Example of Historical Scanned Content:

Current Optical Character Recognition:

Electronic Text:

More information