Paris, Texas or Paris, France? Entity Linking for Geographic Data Identification

Paris, Texas or Paris, France? Entity Linking for Geographic Data Identification

The sponsor, ProQuest, is a content aggregator and research and learning hub for students, librarians, and instructors. New methods for exploring and analyzing large amounts of text data are changing the way our users access and analyze our content. Accurate place name identification would have an impact across many products within the ProQuest portfolio.

This project will focus on identifying place names in more than 80 million pages of ProQuest’s historic newspaper XML. There are several challenges to this task, which make high rates of precision and recall difficult. These challenges include:

  1. Ambiguity often exists between entities: Is the newspaper referring to Ann Arbor the place or Ann Arbor the person?
  2. Place name abbreviations and pseudonyms differ based on historical, political, and language-based, naming conventions
  3. Varying degree of OCR quality and spelling variations in historical content.

 

A high quality entity-linking engine would enhance a number of ProQuest’s services as well as enable new products.

  • Location Browse: Location-based content indexing and search.
  • GIS Data Visualizations: For products such as Indian Tribe Claims or Historic Newspapers, provide a more useful way to explore content.
  • Topic-Product Development: We want to move beyond query-based, automatic product development. Rich and automated metadata production is critical for topic products.
  • The student team will deliver an end-to-end system/engine that labels geographical entities with increased precision and recall.  The team will begin their investigation with Stanford’s CoreNLP NER for GIS data extraction.

More Information

 

Students who successfully match to this project team will be required to sign the following two documents in January 2018:

Click here to view Student IP Agreement

Click here to view NDA

How to Apply

Project Features

  • Skill level All levels
  • Students 5-7 Students
  • Likely Majors CE, CS, DATA, ECE, EE, MATH, MIDAS, SI
  • Course Substitutions Honors, ECE Cognate, CS, CE MDE, Data Science, EE MDE, MIDAS
  • IP & NDA Required? Yes
  • Summer Opportunity Interview Guaranteed
  • General Programming (2 Students)

    Solid programming experience-- EECS 281 (or equivalent)

    • Likely Majors: CSE/CS-LSA, DA
  • Front-end Developer (1 Student)

    Front End Development. Web-App development in Java

    • Likely Majors: CSE/CS-LSA, SI, HCI
  • Natural Language Processing

    Experience / Strong interest in Natural Language Programming

    • Likely Majors: CSE/CS-LSA, EE, MATH
  • Machine Learning

    Experience / Strong interest in Machine Learning

    • Likely Majors: CSE/CS-LSA, EE , MATH
  • Data Visualization

    Experience / Strong interest in data visualization especially D3 (Data Driven Documents) and mapping visualizations

    • Likely Majors: CSE/CS-LSA, DS, MATH

Sponsor Mentor: John York john-york
Director of Engineering, ProQuest Dialog at ProQuestAn excellent facilitator with ability to communicate effectively with both technical and non-technical people. Passionate about great software that is simple, efficient, maintainable, and easily testable. Builder of team commitment by fostering an environment of trust, respect, responsibility and common sense.
 
 
Faculty Mentor: Sugih Jamin 
Associate Professor, EECS
Sugih Jamin is an Associate Professor in the Department of Electrical Engineering and Computer Science at the University of Michigan. He received his Ph.D. in Computer Science from the University of Southern California, Los Angeles in 1996. He received the National Science Foundation (NSF) CAREER Award in 1998, the Presidential Early Career Award for Scientists and Engineers (PECASE) in 1999, and the Alfred P. Sloan Research Fellowship in 2001. He co-founded a peer-to-peer live streaming company, Zattoo, in 2005.