ProQuest is an educational technology company which provides academic libraries with Rialto, a marketplace which supports evidence-based, data-driven decisions for intelligent book acquisition. Students on this team will use machine learning techniques to deliver a proof of concept system which reliably predicts the likelihood of winning a book award, for each book published in a given year.
Abstract:
ProQuest is committed to empowering researchers and librarians around the world. The company’s portfolio of assets — including content, technologies and deep expertise — drives better research outcomes for users and greater efficiency for the libraries and organizations that serve them. With Rialto, librarians can create recommendation profiles, which incorporate analytic and predictive signals in combination with book metadata to identify titles for effective acquisition. In the effort to obtain high-quality books, librarians would be well-served by a measure of the likelihood that any given book may prove to be an award-winner. This is complicated by the tremendous number of books published each year, with relatively few winning awards, and the immediacy with which the librarians would like to know this information.
To support the proof of concept this project will:
- Model book award prediction as a machine learning problem.
- Formulate predictive features from a variety of data sources.
- Leverage textual data to exploit word embeddings where available and appropriate.
- Measure and understand the relationship between the ability to make accurate predictions and the amount of time which has passed since the publishing date for each book.
Data sources to exploit will include but not be limited to:
- Metadata and textual descriptions for all books from the catalog of more than 30 million books available for purchase.
- Holdings and historical usage information for books held by roughly 200 academic libraries around the world.
- Predictive measures of holdings and usage for all books in the catalog.
- Book review metadata.
- Outside sources of information about scholarly research, authors and subjects, e.g. Microsoft Academic Graph.
- Historical information about searches and their frequencies from the ProQuest Platform.
The notion of time is a central element of this problem. To make a prediction at time t, we can only exploit information which would have been available at this time. Due to the limited number of books which are award winners, we will need to work with many time periods, being careful to model all data correctly. Additionally, librarians are typically eager to purchase books as soon as possible after the book becomes available. For this reason, backwards-looking measures like breadth of holding or demonstrated usage are unlikely to be useful for freshly published books; instead we will more heavily rely on imperfect predictions of holdings and usage. Importantly, we will assess our ability to predict in simulated weekly increments to understand how quickly after publication our predictions become useful.
Impact:
The ability to predict book awards will fold naturally into the process of building recommendation profiles, enhancing librarians’ ability to tease out the books most likely to strengthen the quality of their book collections.
See complete details
Natural Language Processing (2 Students)
Specific Skills: Strong interest in Natural Language Processing and Statistical Language Modeling. Please highlight your experience in your personal statement.
Likely Majors: CS, EE, MATH, CE, DATA
Machine Learning (2 Students)
Specific Skills: Experience / Strong interest in Machine Learning
Likely Majors: CS, EE, ROB, MATH, CE, DATA
Programming (2 – 3 Students)
Specific Skills: Solid programming experience (i.e. EECS 281 or equivalent). Key Skills: Python
Likely Majors: CS, DATA
Sponsor Mentors
John Dillon
Text and Data Mining Product Manager
John Dillon, Ph.D., is the Text and Data Mining Product Manager at ProQuest. His work focuses on pairing computational text analysis methods with traditional Humanities and Cultural Studies disciplines. He has published papers on Machine Learning and Sentiment Analysis and has worked previously as a postdoctoral researcher with the University of Notre Dame, USAID, and IBM Research.
Dan Hepp
Data Scientist Lead
Dan has thirty years of experience in research and production settings developing complex systems. He has a demonstrated track record of finding creative solutions to difficult technical problems and making them effective in real-world situations. Dan has expertise in machine learning, data mining, information extraction, pattern recognition, information retrieval, natural language processing, computer vision, artificial intelligence, and optical character recognition.
Executive Mentor
Roger Valade
VP of Engineering, ProQuest
Senior technology leader with extensive experience in enterprise and application architecture, software development and methodology (with an emphasis on agile), strategic planning, project and program management, offshoring in China and India, and change management. Former positions include VP, Technology for a $200M publishing company; VP, Technical Solutions for a J2EE consultancy; and Architect at General Motors. Have managed teams of up to 105 people and budgets of nearly $20M.
Faculty Mentor
Sindhu Kutty
Electrical Engineering and Computer Science
Dr. Kutty brings her enthusiasm about Computer Science to her teaching. She focuses on teaching math-based Computer Science courses like Machine Learning (EECS 445) and Foundations of Computer Science (EECS 376). She is also passionate about getting undergraduate students excited about venturing beyond the course curriculum, and works with them to channel that excitement into publishable undergraduate research. While she has published in highly selective conferences in her area of market mechanism design and its connections to statistical machine learning, Dr. Kutty is especially proud of the work she has published and presented with her undergraduate students. Her research work both with undergraduate students and other collaborators has been recognized by awards at conferences and symposia. She has also been recognized by the American Society for Engineering Education for her work as a Graduate Student Instructor and she has won numerous competitive faculty teaching grants.
Course Substitutions: Honors, ChE Elective, CS MDE/Capstone, CE MDE, Data Science Capstone, EE MDE, IOE Grad Cognate, SI Elective, SI Grad Cognate
Citizenship Requirements: This project is open to all students on campus.
IP/NDA: Students will sign standard University of Michigan IP/NDA documents.
In Person/Remote Participation Options: Work will take place on campus with occasional visits to the ProQuest Ann Arbor office to hold meetings and give presentations. (MDP will provide transportation.).
In Person Only: This project requires in person participation on the Ann Arbor campus for the entire project period. Remotely based students may not participate in this project.
Internship/Summer Opportunity: No summer activity will take place on the project.