ProQuest, part of Clarivate, is an educational technology company committed to empowering researchers and librarians around the world. The company’s portfolio of assets — including content, technologies, and deep expertise – drives better research outcomes for users, and greater efficiency for the libraries and organizations that serve them. Students on this team will use machine learning techniques to deliver a proof-of-concept system which predicts the future success of academic researchers based on information available near the time of dissertation publication.
Abstract:
This project brings together two major research datasets, namely ProQuest’s Dissertations and Theses and Clarivate’s Web of Science. Each of these datasets is unique and valuable for understanding both the start and subsequent success of academic researchers. Can we predict a researcher’s success from their dissertation? When does a research career peak? Does this vary by field? Do superstar researchers come from within or from outside? What are some of the biases and pitfalls of different definitions of success?
To support the proof of concept, this project will:
- Model academic research success as a machine learning problem.
- Formulate predictive features from a variety of data sources.
- Leverage textual data to exploit word embeddings where available and appropriate.
- Measure and understand the relationship between our ability to make accurate predictions and the amount of time which has passed since the publishing date of the researcher’s dissertation.
Data sources to exploit will include, but not be limited to:
- ProQuest Dissertations – available data includes the author, advisor, committee members, subject terms, author-supplied keywords, university, department, references, and text of the dissertation abstract.
- Web of Science – available data includes author, title, abstract, publication, publication date, references, citation counts and usage counts.
- Historical information about searches and their frequencies on the ProQuest Platform.
- Outside data sources which can be freely obtained and have relevant time period information.
Impact:
The ability to successfully predict up-and-coming academic superstars can be leveraged across ProQuest’s suite of products to support both librarians making acquisition decisions, and the researchers who use the acquired assets.
Natural Language Processing (2 Students)
Specific Skills:
Strong interest in Natural Language Processing and Statistical Language Modeling. Please highlight your experience in your personal statement.
Likely Majors: CS, EE, Math, CE, DATA
Machine Learning (2 Students)
Specific Skills:
Experience / Strong interest in Machine Learning
Likely Majors: CS, EE, ROB, Math, CE, DATA
Programming (2-3 Students)
Specific Skills:
Solid programming experience — EECS 281 (or equivalent)
Key Skills: Python
Likely Majors: CS, DATA
Sponsor Mentor
John Dillon
Text and Data Mining Product Manager
John Dillon, Ph.D., is the Text and Data Mining Product Manager at ProQuest. His work focuses on pairing computational text analysis methods with traditional Humanities and Cultural Studies disciplines. He has published papers on Machine Learning and Sentiment Analysis, and worked previously as a postdoctoral researcher with the University of Notre Dame, USAID, and IBM Research.
Executive Mentor
Dan Hepp
Data Science Lead
Dan has thirty years of experience in research and production settings developing complex systems. He has a demonstrated track record of finding creative solutions to difficult technical problems and making them effective in real-world situations. Dan has expertise in machine learning, data mining, information extraction, pattern recognition, information retrieval, natural language processing, computer vision, artificial intelligence, and optical character recognition.
Faculty Mentor
Sindhu Kutty
Electrical Engineering and Computer Science
Dr. Kutty is a faculty member in the Computer Science department at the University of Michigan where her primary focus is on undergraduate teaching and research. Her research interests are in the applications of Machine Learning (including in Economics), fairness in Machine Learning as well as in Computer Science Education. She is passionate about getting undergraduate students excited about venturing beyond the course curriculum, and works with them to channel that excitement into publishable research. Her research work both with undergraduate students and other collaborators has been recognized by awards at various conferences and competitions. Most recently, undergraduate work that she mentored has been recognized with first place awards at the international research competition at Project X and at the ACM undergraduate research competition at the Grace Hopper Conference.
Citizenship Requirements:
- This project is open to all students.
- International students on an F-1 visa will be required to declare part-time CPT during Winter 2023 and Fall 2023 terms.
IP/NDA: Students will sign standard University of Michigan IP/NDA documents.
Internship/Summer Opportunity: No summer activity will take place on the project.