Tutorial Website https://www.microsoft.com/en-us/research/project/cikm-2021-tutorial-on-large-scale-information-extraction-with-privacy-aware-constraints/
Tutorial Description
In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.
To extract information from machine generated emails (flight confirmations, hotel reservations, package delivery, etc.), one needs to classify the emails, cluster them into possible templates, build models to extract information from them, and monitor the models to maintain a high precision and recall. How are the IE techniques for private eyes-off data different compared to that for eyes-on HTML data? How to get labeled data in a privacy preserving manner? What are the different techniques for generating semi-labeled data and learning from them? How to build scalable extraction models across a number of sender domains using different ways to represent the information? How to monitor these models with minimum human intervention? In this tutorial we address all these questions from various research to production perspectives.
Tutorial Organisers
Tutorial Abstract
In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and nobody else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into publicly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.