Automatic Recognition of Historical Newspaper Content

Automatic Recognition of Historical Newspaper Content

ProQuest has recently acquired 80 million page images of historical newspapers that cover local and global news coverage back through the eighteenth century. We now wish to automatically decompose those page images into their constitutive headlines and article blocks to make them more useful for information retrieval purposes. This team will create, train and deploy a machine learning pipeline capable of breaking those newspaper page images into their constitutive headlines and article blocks.

More Information: 2017-proquest-automatic-recognition-project



Students who successfully match to this project team will be required to sign the following two documents in January 2017:

Student IP Agreement for this Project Team
Student Non-Disclosure Agreement for this Project Team
How to Apply

Project Features

  • Skill level All levels
  • Students 5-7
  • Likely Majors CS, DATA, ECE, EE, MICDE, MIDAS
  • Course Substitutions Honors, CSE-G, MIDAS, MICDE, Data Science Capstone, ECE Cognate
  • IP & NDA Required? Yes
  • Summer Opportunity Interview Guaranteed
  • Machine Learning / Machine Vision (2-3 Students)

    Machine vision / optical character recognition (OCR) skills, Experience with image processing, Experience with machine learning, Familiarity with text analysis methods for document clustering

    • Likely Majors: EE, ECE, Robotics, Data Science, MIDAS, MICDE
  • Programming (2-3 Students)

    Collaborative programming experience. Software development: Solutions in any language, though Python is preferred for the machine learning components of the task

    • Likely Majors: CSE/CS-LSA

Faculty Mentor: Brent Griffin2017 Stryker projects
Graduate Student Research Assistant, EECS
I have industry experience in many aspects of engineering. As an intern, I have performed quality assurance at Kawasaki, reliability engineering at Spirit AeroSystems, and product engineering at Nebraska Boiler. As a full-time employee, I have designed and developed hardware-in-the-loop flight simulators at Cessna Aircraft and biotechnology research instruments at LI-COR Biosciences.
Sponsor Mentor: Douglas Duhaime2017 Stryker projects
Product Manager
Douglas Duhaime is a Text and Data Mining Product Manager at ProQuest. He came to ProQuest after spending several years as a doctoral researcher studying applications of data mining and machine learning within the domain of historical research.
Sponsor Mentor: Roger Valade2017 Stryker projects
Vice President of Engineering
Roger Valade is the Vice President of Engineering at ProQuest. He came to ProQuest as a Senior technology leader with extensive experience in enterprise and application architecture, software development and methodology (with an emphasis on agile), strategic planning, project and program management, offshoring in China and India, and change management.
For More Information About This Sponsor, Visit Their Website (ProQuest).