Historical newspapers are some of the richest, longitudinal sociological data sources available, going back more than 250 years. Accessing content from older sources digitally is challenging. Students on the ProQuest team will use large language models and other ML techniques to develop a proof-of-concept system for reconstructing individual articles from full-page newspaper images.
Abstract:
ProQuest, part of Clarivate, is an educational technology company committed to empowering researchers and librarians around the world. The company’s portfolio of assets — including content, technologies, and deep expertise — drives better research outcomes for users, and greater efficiency for the libraries and organizations that serve them.
Historical newspapers are some of the richest, longitudinal sociological data sources available, going back more than 250 years. Many questions could be asked of this data: What can we learn from newspaper data about the stock market crash of 1929—and could this be used to better understand future crashes? How does the public response and media messaging around the flu of 1918 compare with the recent covid pandemic? How can we use this historical data to give generative AI models a sense of history?
One core challenge, however, of working with historical newspapers is that the machine-readable text files are often created from page-level scans. This means that each file contains all the articles from a single newspaper page. The goal of this project is to build an NLP model which can accurately split this single text file into its correct newspaper articles. The recent performance gains of large language models afford new approaches to this newspaper segmentation problem—Can we, for instance, fine tune a model that is expert at linking newspaper subtitles to their respective articles? Or another model which will understand a typical front-page newspaper spatial layout, and segment text with this prior knowledge?
In this project, we will leverage our existing page-level and article-level segmented newspaper content. ProQuest is one of the largest aggregators of newspaper content, with hundreds of millions of scanned and digitized newspaper articles. These newspapers come from varying geographic sources and time periods. These text assets will be valuable for training models, as well as evaluation.
Impact:
If we can accurately segment newspaper pages into newspaper articles, there are several benefits to researchers and students who use our products and platforms: for example, being able to search at the article-level, instead of the page level, is valuable. In addition, researchers using Python or R to analyze the newspaper data need the text segmented at article level for accurate topic modeling, sentiment analysis, entity recognition, or any other context-dependent task.
Scope:
Minimum Viable Product Deliverable (Minimum level of success)
- Complete a full literature review of current published techniques, patents, and trade knowledge internal to the sponsor’s organization
- Create (and clean) appropriate data sets to support the project
- Segment page-level text into article-level text from page-level newspaper OCR signal with high degree of precision and recall (>.9) for 1-2 newspapers across all time periods
- Segment and identify article headlines, and link these headlines to the correct article for these same newspapers
Expected Final Deliverable (Expected level of success)
- Extend capabilities to scale across many newspapers, across all time periods.
- Identify and discard non-article-text (e.g. images, captions, pagination materials) from page-level newspaper OCR signal with a high degree of precision and recall (>.9).
- Develop evaluation system and metrics for the above tasks. Set baseline goals and targets. This may involve ground truthing specific sections of data for evaluation and testing.
Stretch Goal Opportunities: (High level of success)
- Successfully thread the first portion of an article from page X to the second portion of the same article on a different page Y.
Natural Language Processing (2 Students)
Specific Skills: Strong interest in natural language processing and statistical language modeling.
Please highlight your experience in your personal statement.
Likely Majors: CS, EE, MATH
Machine Learning (2 Students)
Specific Skills: Experience / Strong interest in Machine Learning
Likely Majors: CS, EE, MATH
General Programming (2-3 Students)
Specific Skills: Solid general programming experience
EECS 281 (or equivalent) is required.
Key Skills: Python
Likely Majors: CS
Additional Desired Skills/Knowledge/Experience
- Please include a description of your programming language experience in your Experience &
Interest Form - Python: this project will be run in Python; students should have experience in Python, or be prepared to quickly develop their skills
- Project experience with the following python libraries is especially desirable: dask, pandas, scikit-learn, NLTK, Keras, Theano
- Group Development Experience – working within a team, coordinating, managing/integrating code, versioning, etc.
Sponsor Mentor
John Dillon
Text and Data Mining Manager
John Dillon is the Text and Data Mining Manager at Clarivate. He has experience in creating EdTech products which focus on data visualization, natural language processing, and data science. He also has research experience focusing on Sentiment Analysis, Machine Learning, and Learning Analytics. He has a PhD from the University of Notre Dame, and has worked previously as a postdoctoral researcher with the University of Notre Dame, USAID, and IBM Research.
Sponsor Mentor
Dan Hepp
Data Scientist Lead
Dan has thirty years of experience in research and production settings developing complex systems. He has a demonstrated track record of finding creative solutions to difficult technical problems, and making them effective in real-world situations. Dan has expertise in machine learning, data mining, information extraction, pattern recognition, information retrieval, natural language processing, computer vision, artificial intelligence, and optical character recognition.
Faculty Mentor
Sindhu Kutty
Electrical Engineering and Computer Science
Dr. Kutty is a faculty member in the Computer Science department at the University of Michigan where her primary focus is on undergraduate teaching and research. Her research interests are in the applications of Machine Learning (including in Economics), fairness in Machine Learning as well as in Computer Science Education. She is passionate about getting undergraduate students excited about venturing beyond the course curriculum, and works with them to channel that excitement into publishable research. Her research work both with undergraduate students and other collaborators has been recognized by awards at various conferences and competitions. Most recently, undergraduate work that she mentored has been recognized with first place awards at the international research competition at Project X and at the ACM undergraduate research competition at the Grace Hopper Conference.
Weekly Meetings: During the winter 2024 semester, the ProQuest team will meet on North Campus on Fridays from 1:30 – 3:30 PM.
Work Location: Most of the work will take place on campus in Ann Arbor.
Course Substitutions: CE MDE, ChE Elective, CS Capstone/MDE, DS Capstone, EE MDE, CoE Honors, SI Elective/Cognate
Citizenship Requirements: This project is open to all students. Note: International students on an F-1 visa will be required to declare part time CPT during Winter 2024 and Fall 2024 terms.
IP/NDA: Students will sign standard University of Michigan IP/NDA documents.
Summer Project Activities: No summer activity will take place on the project.
Learn more about the expectations for this type of MDP project