Mining Knowledge from Databases: An Information Network Analysis Approach
Jiawei Han (UIUC), Yizhou Suny (UIUC), Xifeng Yan (UC Santa Barbara), and Philip S. Yu (UIC)
Tuesday, June 8, 10:30-12:00 and 13:30-15:00, Location: Theory
Abstract: Most people consider a database is merely a data repository that supports data storage and retrieval. Actually, a database contains rich, inter-related, multi-typed data and information, forming one or a set of gigantic, interconnected, heterogeneous information networks. Much knowledge can be derived from such information networks if we systematically develop an effective and scalable database-oriented information network analysis technology. In this tutorial, we introduce database-oriented information network analysis methods and demonstrate how information networks can be used to improve data quality and consistency, facilitate data integration, and generate interesting knowledge. This tutorial presents an organized picture on how to turn a database into one or a set of organized heterogeneous information networks, how information networks can be used for data cleaning, data consolidation, and data qualify improvement, how to discover various kinds of knowledge from information networks, how to perform OLAP in information networks, and how to transform database data into knowledge by information network analysis. Moreover, we present interesting case studies on real datasets, including DBLP and Flickr, and show how interesting and organized knowledge can be generated from database-oriented information networks.
Database Systems Research on Data Mining
Carlos Ordonez and Javier Garcia-Garcia
Tuesday, June 8, 15:30-17:00, Location: Theory
Abstract: Data mining remains a broad and challenging problem in database systems. We present a review of processing alternatives, storage mechanisms, algorithms, data structures and optimizations that enable data mining on large data sets. We focus on the computation of several well-known multidimensional statistical and machine learning models. We pay particular attention to SQL, together with User-Defined Functions, and MapReduce as two competing and complementary technologies for large-scale processing. We conclude with a summary of solved major problems and open research issues.
Information theory for data management
Suresh Venkatasubramanian (AT&T), and Divesh Srivastava (AT&T)
Wednesday, June 9, 8:30-10:00 and 10:30-12:00, Location: Theory
Abstract: We explore the use of information theory as a tool to express and quantify notions of information content and information transfer for representing and analyzing data, using examples from database design, data integration and data anonymization. We also examine the computational challenges associated with information-theoretic primitives, indicating how they might be computed efficiently.
Enterprise Information Extraction: Recent Developments and Open Challenges
Laura Chiticariu (IBM Research, Almaden), Yunyao Li (IBM Research, Almaden), Sriram Raghavan (IBM Research, Almaden), and Frederick Reiss (IBM Research, Almaden)
Thursday, June 10, 10:30-12:00 and 1:30-3:00, Location: Theory
Abstract: Information extraction (IE)-the problem of extracting structured information from unstructured text - has become an increasingly important topic in recent years. A SIGMOD 2006 tutorial outlined challenges and opportunities for the database community to advance the state of the art in Information extraction (IE) - the problem of extracting structured information from unstructured text - has become an increasingly important topic in recent years. A SIGMOD 2006 tutorial outlined challenges and opportunities for the database community to advance the state of the art in information extraction, and posed the following grand challenge: "Can we build a System R for information extraction?"
Our tutorial gives an overview of progress the database community has made towards meeting this challenge. In particular, we start by discussing design requirements in building an enterprise IE system. We then survey recent technological advances towards addressing these requirements, broadly categorized as: (1) Languages for specifying extraction programs in a declarative way, thus allowing database-style performance optimizations; (2) Infrastructure needed to ensure scalability, and (3) Development support for enterprise IE systems. Finally, we outline several open challenges and opportunities for the database community to further advance the state of the art in enterprise IE systems. The tutorial is intended for students and researchers interested in information extraction.