ProQuest is a content aggregator and research and learning hub with deep historical newspaper content, which creates challenges for optical character recognition (OCR). Students on this team will use machine learning to deliver an end-to-end system that reliably improves the quality of OCR ProQuest content.
Abstract:
ProQuest is a content aggregator and research and learning hub for students, librarians, and instructors. New methods for exploring and analyzing large amounts of text data are changing the way our users and buyers access and analyze our content. One of the great strengths of ProQuest as a company is its depth of historical newspaper content. However, one weakness of this content is its varying degrees of quality of optical character recognition (OCR) performance. Various document scanning methods and limitations over time have produced content which often contains errors. These errors impact downstream natural language processing and machine learning methods within ProQuest products such as TDM Studio for researchers and students.
To combat these issues we need to develop methods for improving our OCR quality. This project will include:
- Modeling OCR improvement via sequence-to-sequence neural networks.
- Building and leveraging word and character n-gram embeddings built on OCR text.
- Developing novel approaches to OCR improvements for older content with severe degradation and a lack of ground-truth data.
- Developing reliable document and corpus-level OCR cleanliness metrics.
Content to be analyzed will include:
- More than 10M New York Times articles from 1867 to 2015 as well as selections from other major English newspapers. (Note: the earlier documents of this series have very challenging OCR problems.)
- Overlapping newspaper articles from OCR and electronic-delivered (clean) newspaper versions from recent decades to generate training and evaluation sets.
This work will build off of previous work in language modeling and language translation used for similar tasks such as spell checking and automatic text generation.
One unique challenge of this problem is that we would like to improve the OCR quality without introducing new errors. Additional challenges to this problem include model transfer across different datasets and time frames as well as evaluation of time periods and newspapers for which we do not have ground-truth examples.
The student team will deliver an end-to-end system/engine that reliably improves quality of OCR ProQuest content.
Example of Historical Scanned Content:
Current Optical Character Recognition:
Electronic Text:
More Information
Natural Language Processing (2 Students)
Specific Skills: Strong interest in Natural Language Processing and Statistical Language Modeling. Please highlight your experience in your personal statement
Likely Majors: CS, EE, MATH
Machine Learning (2 Students)
Specific Skills: Experience / Strong interest in Machine Learning
Likely Majors: CS, EE, MATH
Programming (2 – 3 Students)
Specific Skills: Solid programming experience (i.e. EECS 281 or equivalent). Key Skills: Python
Likely Majors: CS (ALL)
Sponsor Mentors
John Dillon
Text and Data Mining Product Manager
John Dillon, Ph.D., is the Text and Data Mining Product Manager at ProQuest. His work focuses on pairing computational text analysis methods with traditional Humanities and Cultural Studies disciplines. He has published papers on Machine Learning and Sentiment Analysis and has worked previously as a postdoctoral researcher with the University of Notre Dame, USAID, and IBM Research.
Dan Hepp
Data Scientist Lead
Dan has thirty years of experience in research and production settings developing complex systems. He has a demonstrated track record of finding creative solutions to difficult technical problems and making them effective in real-world situations. Dan has expertise in machine learning, data mining, information extraction, pattern recognition, information retrieval, natural language processing, computer vision, artificial intelligence, and optical character recognition.
Executive Mentor
Roger Valade
VP of Engineering, ProQuest
Senior technology leader with extensive experience in enterprise and application architecture, software development and methodology (with an emphasis on agile), strategic planning, project and program management, offshoring in China and India, and change management. Former positions include VP, Technology for a $200M publishing company; VP, Technical Solutions for a J2EE consultancy; and Architect at General Motors. Have managed teams of up to 105 people and budgets of nearly $20M.
Faculty Mentor
Sindhu Kutty
Electrical Engineering and Computer Science
Dr. Kutty brings her enthusiasm about Computer Science to her teaching. She focuses on teaching math-based Computer Science courses like Machine Learning (EECS 445) and Foundations of Computer Science (EECS 376). She is also passionate about getting undergraduate students excited about venturing beyond the course curriculum, and works with them to channel that excitement into publishable undergraduate research. While she has published in highly selective conferences in her area of market mechanism design and its connections to statistical machine learning, Dr. Kutty is especially proud of the work she has published and presented with her undergraduate students. Her research work both with undergraduate students and other collaborators has been recognized by awards at conferences and symposia. She has also been recognized by the American Society for Engineering Education for her work as a Graduate Student Instructor and she has won numerous competitive faculty teaching grants.
Course Substitutions: Honors, ChE Elective, CS MDE/Capstone, CE MDE, Data Science Capstone, EE MDE, IOE Senior Design
Internship/Summer Opportunity: Students will be guaranteed an interview for a 2021 internship. The interviews will take place between January 1 and February 28, 2021.
Citizenship Requirements: This project is open to all students.
IP/NDA: Students will sign standard University of Michigan IP/NDA documents.