• Print

Processing Joins On Map-Reduce [slides]

Authors: Himanshu Gupta, L V Subramaniam, Sriram Raghavan
Length: 3 hours.
Abstract: This tutorial will present the current state-of-the-art in join processing on map-reduce. The tutorial will explain, how we design efficient map-reduce algorithms in general and efficient join processing algorithms on map-reduce in particular. The tutorial will discuss various aspects of join processing on map-reduce - (a) 2-way vs multi-way join, (b) equi-join vs theta-join etc, (c) joins on a variety of data - real-valued data, interval data , spatial-data, sets etc and (d) different kinds of join predicates - equi-join, nearest-neighbor, co-location, set-similarity etc.
The tutorial will start with an introduction to the map-reduce framework. We then present why processing joins on map-reduce is not a trivial task, what are the key challenges and the differences with join processing on RDBMS. We then present an overview of the state-of-the-art algorithms for processing a number of classes of join queries. Each of these classes present a unique set of challenges and the algorithms developed showcase a unique facet of map-reducealgorithm design.  

Personal Information Management Systems [slides]

Authors: Serge Abiteboul & Amélie Marian.
Length: 1 and 1/2 hours.
Abstract: Personal data is constantly collected and saved by users, either voluntarily in files, emails, social media interactions, multimedia objects, calendar items, contacts, etc., or passively by various applications such as GPS tracking of mobile devices, records of utility usage, financial transactions, or quantified self sensors. Everywhere users go, everything they do, they leave a digital trace that acts as a digital memory of their past actions, interactions, and whereabouts.

Digital storage acts as an archive of users’ memories, keeping records of data as well as of the context in which the data was acquired. However, personal information management is complicated by the sheer amount of data available, and by the fact that data is not stored in a centralized location. Users rarely own and store their personal data. Most personal information is stored in the cloud by commercial companies who may offer some limited access to a user’s personal data. Attempting to retrieve personal information then leads to a tedious, often maddening, process of individually accessing all the relevant sources of data and manually linking information.
We believe that it is essential that everyone be in a position to manage their personal information. This is the purpose of Personal Information Management Systems (PIMS for short) that should soon be available and usable by everyone. The goal of this tutorial is to survey the state of the art in Personal Information Management Systems, present current systems and discuss open research issues.

Entity Resolution in the Big Data Era: Probabilistic DB Support to Entity Resolution [slides]

Authors: Avigdor Gal & Benny Kimelfeld. 

Length: 1 and 1/2 hours. 
Abstract: Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records to the same entity cannot be made with certainty unless these are identical in all of their attributes or have a common key. In the light of recent advancement in data accumulation, management, and analytics landscape (known as big data) the tutorial re-evaluates the entity resolution process and in particular looks at best ways to handle data veracity. The tutorial ties entity resolution with recent advances in probabilistic database research, focusing on sources of uncertainty in the entity resolution process. We shall discuss which types of uncertainties have been handled in the literature and suggest new methods for coping with various types of uncertainties, some of which are presented as future challenges.