Past DAIS Seminars

Summer 2018

Date Content Notes
06/15/2018 Speaker: Dr. Mao Ye, Associate Professor of Finance, University of Illinois at Urbana-Champaign

Title: Big Data in Finance

Abstract: Modern financial markets generate vast quantities of data. As the data environment has become increasingly “big” and analyses increasingly computerized, the information that different market participants extract and use has grown more varied and diverse. At one extreme, high-frequency traders (HFTs) implement ultra-minimalist algorithms optimized for speed. At the other extreme, some industry practitioners apply sophisticated machine-learning techniques that take minutes, hours, or days to run. The proposed project seeks to understand this full spectrum of machine-based trading, with the purpose to inform the public policy and to augment theoretical studies on financial markets. Just as insights into human behavior from the psychology literature spawned the field of behavioral finance, insights into algorithmic behavior (or the psychology of machines) can result in an analogous blossoming of research in algorithmic behavioral finance. Most literature to date follows a simple dichotomy that pits HFTs vs. everyone else. We aim to explore diversities within cyber-traders with a particular emphasis on players who are slower than HFTs but faster than humans. This focus will fill in the gap between the literature on HFTs, which focuses on milliseconds or nanoseconds, and the literature on institutional investors, which relies on quarterly holding data.

Short bio: Mao Ye is an associate professor of finance at the University of Illinois, Urbana-Champaign. His research focuses on market microstructure, machine learning, and big data. His paper has been published in Journal of Finance, Journal of Financial Economics, and Review of Financial Studies. He is a fellow of National Bureau of Economic Research (NBER) and National Center for Supercomputing Applications (NCSA). In 2016, the University of Illinois, Urbana-Champaign named him the Educator of the Year after a campus-wide competition.
Mao Ye earned his Ph.D. degree from Cornell University. In 2006, he was elected as a trustee member of Cornell’s Board of Trustees, marking the first time an Ivy League institution had elected a trustee from Mainland China. In 2018, the New York Historical Society selected Mao Ye as one of the stories in their book “Journeys: An American Story.”

Spring 2016

Date Content Notes
01/26/2016 Speaker: Dr. Meng Jiang, Postdoctoral researcher, Computer Science department, University of Illinois at Urbana-Champaign

Title: Little Is Much: Bridging Cross-Platform Behaviors through Overlapped Crowds

Abstract: People often use multiple platforms to fulfill their different information needs. With the ultimate goal of serving people intelligently, a fundamental way is to get comprehensive understanding about user needs. How to organically integrate and bridge cross-platform information in a human-centric way is important. Existing transfer learning assumes either fully-overlapped or non-overlapped among the users. However, the real case is the users of different platforms are partially overlapped. The number of overlapped users is often small and the explicitly known overlapped users is even less due to the lacking of unified ID for a user across different platforms. In this paper, we propose a novel semi-supervised transfer learning method to address the problem of cross-platform behavior prediction, called XPTrans. To alleviate the sparsity issue, it fully exploits the small number of overlapped crowds to optimally bridge a user’s behaviors in different platforms. Extensive experiments across two real social networks show that XPTrans significantly outperforms the state-of-the-art. We demonstrate that by fully exploiting 26% overlapped users, XPTrans can predict the behaviors of non-overlapped users with the same accuracy as overlapped users, which means the small overlapped crowds can successfully bridge the information across different platforms.

Short bio: Dr. Meng Jiang is a postdoctoral researcher of Computer Science at University of Illinois at Urbana-Champaign working with Professor Jiawei Han. He completed his Ph.D. in Department of Computer Science and Technology at Tsinghua University, Beijing in 2015. He got the bachelor degree at the same department in 2010. He visited the Database Group in School of Computer Science at Carnegie Mellon University in 2012-2013. His research focuses on modeling user behaviors and mining social media. Problems he investigates vary from prediction and recommendation to suspicious behavior detection. He won the ACM SIGKDD 2014 Best Paper Finalist. (



02/02/2016 Speaker: Jinfeng Xiao, PhD student in Biophysics and Quantitative Biology at UIUC

Title: Adapted machine learning methods for better prediction on prostate cancer survival and ALS progression

Abstract: Prostate cancer and amyotrophic lateral sclerosis (ALS) are fatal diseases whose effective treatment remains unclear. Metastatic castrate-resistant prostate cancer develops resistance against androgen deprivation therapy, the mainstay of treatment. ALS progression rate can vary by an order of magnitude, indicating the underlying disease heterogeneity. More reliable prediction, based on clinical data, on prostate cancer survival and ALS progression can hopefully help understand the disease mechanism and improve the treatment. In this talk Jinfeng will present work on such prediction methods which led to his winning of two DREAM Challenges last year. By reweighing the training data with Gaussian kernel functions, Jinfeng’s team implicitly avoided over-fitting and increased the integrated area under the ROC curve of predicted prostate cancer survival by 3.5% over the baseline method. Incorporating a novel feature selection criterion into random forests has proved to be effective in predicting the ALS progression in patients from two national registries.

Short bio: Jinfeng is a 2nd-year PhD student in Biophysics and Quantitative Biology at UIUC. Before coming to Champaign, Jinfeng got his bachelor degree with 1st-class honor in Physics and Math from Hong Kong University of Science and Technology. In 2015 he took part in two DREAM Challenges and won both. Currently Jinfeng is actively involved in several projects with UIUC faculty Jian Peng, Saurabh Sinha and Jiawei Han at the intersection of computational biology and machine learning.



02/09/2016 Speaker: Prof. Bertram Ludaescher, Graduate School of Library and Information Science, UIUC

Title: The Many Faces of Provenance in Databases and Workflows

Abstract: In computer science, data provenance describes the lineage and processing history of data as it is transformed through queries and/or workflows. Many CS sub-disciplines have studied approaches to capture and exploit provenance, e.g., the systems and programming languages communities. In this talk, I will give an overview of basic research questions and results provided by the database community. Research in this area ranges from deep technical studies in database theory to very applied techniques and various engineering-level questions in-between. Provenance capture and querying capabilities are also playing an increasing role in the computational reproducibility of science.

Short bio: Bertram Ludäscher is a professor at GSLIS where he directs the Center for Informatics Research in Science and Scholarship (CIRSS). He is also a faculty affiliate with NCSA and the Department of Computer Science. From 2005 to 2014 he was a computer science professor at the University of California, Davis. Until 2004 Ludäscher was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UCSD. He received his M.S. in computer science from the Technical University of Karlsruhe (now: K.I.T.) and his PhD from the University of Freiburg, Germany.



02/16/2016 Speaker: Dr. Liangliang Cao, Yahoo Labs

Title: Data-Driven Artificial Intelligence with Applications

Abstract: Classical Artificial Intelligence research has been conducted based on knowledge presentation, reasoning, and planning. In this talk, I would like to discuss a different strategy, which builds intelligence purely from data. Such a data-driven strategy is strongly motivated by the growth of Internet data as well as the recent advance of Graphics Computing Units (GPUs). By learning from large scale data, now we can train models with millions or billions of parameters, which can potentially improve themselves by adapting to different applications. Two applications will be discussed following this strategy: (1) playing poker game (2) recognizing images. At the end of this talk, I would also like to show the potential of data driven approach in industry by demonstrating a clothes retrieval app, which has been recently deployed in Yahoo Taiwan, and received a lot of media reports.

Short bio: Liangliang Cao received his Ph.D. from UIUC in 2011 supervised by Prof. Thomas S. Huang. He was also a UIUC CSE fellow from 2009-2010 working with Prof. Thomas S. Huang and Prof. Jiawei Han. After his graduation, he worked at IBM Watson Research Center as a research staff member for four years. Now he is a senior research scientist in Yahoo Labs, New York City. He received several industrial recognition including IBM Research Division Award as well as IBM Outstanding Technical Accomplishment (with multimedia group). He was the winner of ImageNet LSVRC 2010. He was an area chair of ACM Multimedia 2012 and WACV 2014, a founding co-chair of ACM workshop on Geo-Multimedia from 2012 to 2014, and a general co-chair of GNYMV 2013 and 2014. He gave tutorials in ACM Multimedia 2013 and CVPR 2014. He is now organizing the YFCC100M image annotation grand challenge in ICMR 2016.



02/23/2016 Canceled. The department holds many relevant talks:
Speaker: Prof. ChengXiang Zhai, Professor of Computer Science and a Willett Faculty Scholar at the University of Illinois at Urbana-Champaign
Title: Statistical Methods for Integration and Analysis of Opinionated Text Data
Abstract: Opinionated text data such as blogs, forum posts, product reviews and online comments are increasingly available on the Web. They are very useful sources for public opinions about virtually any topics. However, because the opinions are scattered and abundant, it is a significant challenge for users to collect all the opinions about a topic and digest them efficiently. In this talk, I will present a suite of general statistical text mining methods developed by the Text Information Management and Analysis (TIMAN) group at UIUC that can help users integrate, summarize and analyze scattered online opinions to obtain actionable knowledge for decision making. Specifically, I will first present approaches to integration of scattered opinions by aligning them to a well-structured article or relevant ontology. Second, I will discuss several techniques for generating a concise opinion summary that can reveal the major sentiments and opinion points buried in large amounts of opinionated text data. Finally, I will present probabilistic generative models for analyzing review data in depth to discover latent aspect ratings and relative weights placed by reviewers on different aspects. These methods are general and can thus potentially help users integrate and analyze large amounts of online opinionated text data on any topic in any natural language.
Short bio: ChengXiang Zhai is a Professor of Computer Science and a Willett Faculty Scholar at the University of Illinois at Urbana-Champaign, where he is also affiliated with the Graduate School of Library and Information Science, Institute for Genomic Biology, and Department of Statistics. His research interests include Information Retrieval, Data Mining, Natural Language Processing, Machine Learning, Biomedical and Health Informatics, and Intelligent Education Systems. More information about him and his work can be found at
03/08/2016 Canceled. The department holds many relevant talks, particularly the one given by Dr. Jure Leskovec:
03/15/2016 Speaker: Prof. Shaowen Wang, Professor of Geography and Geographic Information Science at the University of Illinois at Urbana-Champaign

Title: CyberGIS and Geospatial Data Science

Abstract: CyberGIS represents an interdisciplinary field combining advanced cyberinfrastructure, geographic information science and systems (GIS), spatial analysis and modeling, and a number of geospatial domains (e.g., emergency management, public health, and smart cities) to enable broad scientific and technological advances. It has also emerged as new-generation GIS based on holistic integration of high-performance and distributed computing, data-driven knowledge discovery, visualization and visual analytics, and collaborative problem-solving and decision-making capabilities. The growing importance of cyberGIS is reflected by increasing calls for solutions to bridge the significant digital divide between advanced cyberinfrastructure and geospatial communities in the big data era. This presentation discusses challenges and opportunities for cyberGIS and geospatial data science to empower geospatial discovery and innovation through interdisciplinary approaches.

Short bio: Shaowen Wang is a Professor of Geography and Geographic Information Science with affiliate appointments in Computer Science, Library and Information Science, and Urban and Regional Planning at UIUC, where he is named a Centennial Scholar. He is also an Associate Director of NCSA and Founding Director of UIUC¹s CyberGIS Center for Advanced Digital and Spatial Studies. He was a visiting scholar at Lund University sponsored by NSF in 2006 and NCSA Fellow in 2007, and received the NSF CAREER Award in 2009. He received his BS in Computer Engineering from Tianjin University in 1995, MS in Geography from Peking University in 1998, and MS of Computer Science and PhD in Geography from the University of Iowa in 2002 and 2004 respectively. His research and teaching interests primarily include advanced cyberinfrastructure and cyberGIS, complex environmental and geospatial problems, computational and data sciences, high-performance parallel and distributed computing, and spatial analysis and modeling. He has received research funding from the US CDC, DOE, EPA, NSF, USDA, USGS, and industry; and served as Principal Investigator (PI) for more than $13 million competitive research grants, PI for tens of millions of normalized computing hours to utilize NSF supercomputing resources, and co-PI and investigator for participating in sponsored research supported with tens of millions of US dollars. He has published many peer-reviewed papers including articles in more than 20 journals. He founded the cyberGIS international conference series, and chaired CyberGIS¹12 and CyberGIS¹14. He has served as an Action Editor of GeoInformatica, Associate Editor of SoftwareX, and guest editor or editorial board member for multiple other journals, book series and proceedings. He is President-Elect of the University Consortium for Geographic Information Science, and a current member of the Board on Earth Sciences and Resources of the US National Academies.

03/22/2016 Spring Break!
03/29/2016 Canceled.
04/05/2016 Title: Mobile Query Auto-Completion: Analyzing Voluminous and Noisy Signals from Mobile Applications

Speaker: Aston Zhang, Ph.D. candidate in Computer Science at UIUC

Abstract: In recent years, users of search engines on mobile devices saved more than 60% of the keystrokes when submitting English queries by selecting suggestions from query auto-completion (QAC). In fact, people use mobile devices to perform many different activities: on average they install 95 applications (apps) and open 35 unique apps 100 times per day. If a user opens the Spotify Music app then types “sugar” on the search bar, is the user more likely looking for sugar cookie recipes or Sugar lyrics by Maroon 5? Indeed, a more accurate inference of users’ query intents via QAC may further save their typing efforts on mobile devices.

In this talk we will discuss the new mobile QAC problem to exploit mobile devices’ exclusive signals, such as those related to mobile apps. We propose AppAware, a novel QAC model using installed app and recently opened app signals to suggest queries for matching input prefixes on mobile devices. To overcome the challenge of noisy and voluminous signals, AppAware optimizes composite objectives with a lighter processing cost at a linear rate of convergence. We conduct experiments on a large commercial data set of mobile queries and apps. Installed app and recently opened app signals consistently and significantly boost the accuracy of various baseline QAC models on mobile devices.

Short Bio: Aston Zhang is a Ph.D. candidate in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He is interested in research of data mining, machine learning, and privacy, supervised by Prof. Carl A. Gunter and Prof. Jiawei Han. He received degrees of M.S. in Statistics and M.S. in Computer Science from UIUC in 2015.

04/12/2016 Canceled! Please attend Dr. Wei Wang’s talk on Monday 04/11:
04/19/2016 Location this week is changed to: room 3405 Siebel Center.

Speaker: Prof. Lav R. Varshney, Assistant Professor in the Department of Electrical and Computer Engineering, UIUC

Title: Toward an Information Theory of Information Overload

Abstract: Engineering successes of past centuries have given rise to new engineering challenges that are not just technical but sociotechnical in scope. These new challenges are problems of excess rather than of scarcity-problems such as obesity, information overload, and climate change. People’s behaviors are critical in large-scale sociotechnical systems and engineers must necessarily consider interactions between people and technical systems when considering these problems.  In this talk, I will develop mathematical models of sociotechnical information systems in the information overload regime. Then I will discuss fundamental limit theorems and optimal designs that may be developed from information-theoretic characterizations. In particular, I will focus on the key resource of human attention, and how it can be gained, maintained, and prioritized.

Short bio: Lav R. Varshney is an assistant professor in the Department of Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign, where his research interests include information and coding theory, signal processing and data analytics, collective intelligence in sociotechnical systems, neuroscience, and creativity.  He received the B. S. degree with honors in electrical and computer engineering (magna cum laude) from Cornell University, Ithaca, New York in 2004. He received the S. M., E. E., and Ph. D. degrees in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge in 2006, 2008, and 2010, respectively, where he received the J.-A. Kong Award Honorable Mention for Electrical Engineering doctoral thesis and the E. A. Guillemin Thesis Award for Outstanding Electrical Engineering S.M. Thesis.  He was a research staff member at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY from 2010 until 2013, where his work on computational creativity received widespread acclaim.  He recently received a 2015 NYC Media Lab – Bloomberg Data for Good Exchange Paper Award, was a finalist for the 2014 Bell Labs Prize, and his work appears in the anthology, The Best Writing on Mathematics 2014 (Princeton University Press).  []

Speaker: Prof. Sewoong Oh in Department of Industrial and Enterprise Systems Engineering at UIUC
Title: Hiding the source of a rumor in anonymous messaging
Abstract: Anonymous messaging platforms, such as Secret, Whisper and Yik Yak, have emerged as important social media for sharing one’s thoughts without the fear of being judged by friends, family, or the public. Further, such anonymous platforms are crucial in nations with authoritarian governments, where the right to free expression and sometimes the personal safety of the message author depends on anonymity. Whether for fear of judgment or personal endangerment, it is crucial to keep anonymous the identity of the users who initially posted sensitive messages in these platforms.
In this talk, we consider two types of adversaries, one who has a snapshot of the spread of the messages at a certain time and another with collaborating spies among the users tracking all the messages that they receive. We pose the problem of designing a messaging protocol that spreads the message fast while keeping the identity of the source hidden from the adversary. We present an anonymous messaging protocol, which we call adaptive diffusion, and show that it spreads fast and achieves (near) optimal performance in obfuscating the source. In the process, we discover interesting properties of the Polya’s urn processes for enhancing privacy, and prove a concentration result of Galton-Watson processes to analyze the performance of the proposed protocol.
Short bio:  Sewoong Oh is an Assistant Professor of Industrial and Enterprise Systems Engineering at UIUC. He received his PhD from the department of Electrical Engineering at Stanford University in 2011. Following his PhD, he worked as a postdoctoral researcher at Laboratory for Information and Decision Systems (LIDS) at MIT. He was co-awarded the Kenneth C. Sevcik outstanding student paper award at the Sigmetrics 2010 and the best paper award at the SIGMETRICS 2015. He was awarded the NSF CAREER award in 2016.

(11AM-12PM Wed)

Speaker: Prof. Alexander Kotov in the Department of Computer Science at Wayne State University

Title: Leveraging User Meta-Data in Information Retrieval and Textual Data Mining 

Abstract: Over the past decade, the World Wide Web has undergone a fundamental transformation from an informational resource into a social phenomenon, known as the Web 2.0. One of the major differences between WWW and Web 2.0 is availability of diverse meta-data about content authors, such as age, gender, geographical location and social network. However, it is not yet known how to best utilize these meta-data in information retrieval and textual data mining. In this talk, I will present our recent work aiming to address this question. 
In the first part of this talk, I will present the methods for searching the content of social media platforms, such as Twitter, which utilize information about the users of those platforms. The first method utilizes latent variable models incorporating geographical locations of users to mine word associations specific to different geographical locations and uses these associations for focused expansion of documents and queries. The second method utilizes social networks of the authors of social media posts to improve the accuracy of retrieval from collections of such posts.
In the second part of my talk, I will demonstrate how meta-data about the authors of on-line consumer reviews can be utilized to identify major themes as well as positive, negative and neutral aspects of those themes discussed in reviews by different demographic groups of users.

Short bio: Dr. Alexander Kotov is an Assistant Professor in the Department of Computer Science at Wayne State University, where he heads the Textual Data Analytics (TEANA) Lab. Dr. Kotov’s general research interests lie at the intersection of information retrieval, health informatics and textual data mining. Dr. Kotov received his PhD from the University of Illinois at Urbana-Champaign in 2011, under the supervision of Professor ChengXiang Zhai. After that, he was a Post-Doctoral Fellow at Emory University for 2 years working with Professor Eugene Agichtein. Dr. Kotov and his students won the Best Short Paper award at the 2015 Asia Information Retrieval Symposium and took first place at the Clinical Decision Support track in TREC 2015.

Fall 2015

Date Content
9/11/2015 Speaker
Marianne Winslett
Professor, Department of Computer Science, University of Illinois at Urbana-Champaign
Marianne Winslett has been a professor in the Department of Computer Science at the University of Illinois since 1987. She returned to the US in 2013 after four years as the director of Illinois’s research center in Singapore, the Advanced Digital Sciences Center. She is an ACM Fellow and the recipient of a Presidential Young Investigator Award from the US National Science Foundation. She is the former vice-chair of ACM SIGMOD and the co-editor-in-chief of ACM Transactions on the Web, and has served on the editorial boards of ACM Transactions on Database Systems, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Information and Systems Security, the Very Large Data Bases Journal, and ACM Transactions on the Web. She has received two best paper awards for research on managing regulatory compliance data (VLDB, SSS), one best paper award for research on analyzing browser extensions to detect security vulnerabilities (USENIX Security), and one for keyword search (ICDE). Her PhD is from Stanford University.
Professor Winslett’s research interests are in information management and security. Sometimes they overlap, and she gets to work on information security. But her favorite part of research is seeing young people figure out what they really want to do, and helping them remove the obstacles in their way. In this talk, she’ll give an overview of the main projects currently going on in her group: representation-independent data mining, real-time stream processing in the cloud, a data-science project on mining supercomputer logs, and a new project on security for digital manufacturing.
Bertram Ludäscher
Professor, Graduate School of Library and Information Science (GSLIS)
Scientific Data & Knowledge Management: A Research Sampler
Researchers and scientists across many disciplines increasingly need to manage and analyze large and/or complex datasets. The computational and engineering challenges arising in the burgeoning areas of data science and big data provide a fruitful ground for computer scientists and database researchers to tackle interesting problems. In this talk, I will provide an overview of a number of different research areas and associated projects in scientific workflow automation, data curation, and knowledge-representation & reasoning. In addition to application-oriented topics, I will also take a detour through “database theory-land”, showing some of the deep connections that often hide under the surface of apparently unrelated topics.
Bertram Ludäscher is a professor at the Graduate School of Library and Information Science (GSLIS) where he directs the Center for Informatics Research in Science and Scholarship (CIRSS). From 2005 to 2014 he was a computer science professor at the University of California, Davis. His research interests span the whole data-to-knowledge life-cycle, from modeling and design of databases and workflows, to knowledge representation and reasoning. His current research focus includes both theoretical foundations of provenance and practical applications, in particular to support automated data quality control and workflow-supported data curation. He is one of the founders of the open source Kepler scientific workflow system, and a member of the DataONE leadership team, focusing on data- and workflow-provenance. Until 2004 Ludäscher was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the Technical University of Karlsruhe (now: KIT) and his PhD (Dr.rer.nat.) from the University of Freiburg, both in Germany.
9/25/2015 Speaker
Prof Hari Sundaram
Associate professor in Computer Science and in Advertising at the University of Illinois
Data analysis on the continuum and other stories
Hari Sundaram is an associate professor in Computer Science and in Advertising at the University of Illinois. His primary research focus is to develop algorithms and systems for analyzing, and for shaping collective behavior in social networks. Prior to joining the University of Illinois in 2014, he was at Arizona State University; there, he had helped co-found the School of Arts, Media and Engineering. He received his Ph.D. from the Department of Electrical Engineering at Columbia University in 2002. His research has won several best paper awards from the IEEE and the ACM. He also received the Eliahu I. Jury Award for best Ph.D. dissertation in 2002. He is an associate editor for ACM Transactions on Multimedia Computing, Communications and Applications.
In this talk, I shall present a brief overview of some of the research problems of interest to my group. Two current projects—the design of an social network testbed and novel continuum representation framework for massive graphs shall be discussed in more detail. In the former, the goal is to make it rapidly prototype bespoke social networks with the aim of running different experiments on them. In the latter, the aim is to develop fast, effective approximations of social network processes such as information diffusion through a continuum representation. The continuum project is at early stage and I am hoping for some useful feedback.
Kevin Chen-Chuan Chang
Associate professor in Computer Science at the University of Illinois
I will present my current research projects at the Forward Data Lab group– 1) Interactive Big Data, marrying spreadsheet with database, 2) WISDM: Web Indexing and Search for Data Mining, and 3) BigSocial: Towards Big Social Data Platform for Entity-Centric and User-Aware Analytics.
Kevin C. Chang is an Associate Professor in Computer Science, University of Illinois at Urbana-Champaign. He received a BS from National Taiwan University and PhD from Stanford University, in Electrical Engineering. His research addresses large scale information access, for search, mining, and integration across structured and unstructured big data, with current focuses on “entity-centric” Web search/mining and social media analytics. He received two Best Paper Selections in VLDB 2000 and 2013, an NSF CAREER Award in 2002, an NCSA Faculty Fellow Award in 2003, IBM Faculty Awards in 2004 and 2005, Academy for Entrepreneurial Leadership Faculty Fellow Award in 2008, and the Incomplete List of Excellent Teachers at University of Illinois in 2001, 2004, 2005, 2006, 2010, and 2011. He is passionate to bring research results to the real world and, with his students, co-founded Cazoodle, a startup from the University of Illinois, for deepening vertical “data-aware” search over the web.
10/2/2015 Speaker
Aditya Parameswaran
Assistant Professor in Computer Science at UIUC
Supporting Visual Analytics with Scalable Visualization Recommendations
Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC). He spent the 2013-14 year visiting MIT CSAIL and Microsoft Research New England, after completing his Ph.D. from Stanford University, advised by Prof. Hector Garcia-Molina. He is broadly interested in data analytics, with research results in human computation, visual analytics, information extraction and integration, and recommender systems. Aditya is a recipient of the Arthur Samuel award for the best dissertation in CS at Stanford (2014), the SIGMOD Jim Gray dissertation award (2014), the SIGKDD dissertation award runner up (2014), a Google Faculty Research Award (2015), the Key Scientific Challenges Award from Yahoo! Research (2010), three best-of-conference citations (VLDB 2010, KDD 2012 and ICDE 2014), the Terry Groswith graduate fellowship at Stanford (2007), and the Gold Medal in Computer Science at IIT Bombay (2007). His research group is supported with funding from by the NIH, the NSF, and Google.
Data scientists rely on visualizations to interpret the data returned by queries. However, when working on large datasets, identifying and generating visualizations that show relevant or desired trends in data can be tedious and time-consuming. We present a system, SeeDB, that intelligently explores the space of visualizations, evaluates promising visualizations, and recommends those that it deems “interesting” or “useful”. As part of this system, we are designing sampling-based algorithms for generating visualizations on very large datasets rapidly, while preserving visual properties essential for drawing correct insights. My talk will cover both our initial design for SeeDB, as well as one of our scalable visualization generation algorithms.
Mangesh Bendre
PhD Candidate in Computer Science at UIUC
DATASPREAD: Unifying Databases and Spreadsheets
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current “pane” (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.
10/9/2015 Speaker
Jialu Liu
PhD Candidate, Computer Science, UIUC
Representing Documents via Latent Keyphrase Inference
Many text mining approaches adopt bag-of-words or n-grams models to represent documents. Looking beyond just the words on the surface of a document can improve a computer’s understanding of text. Being aware of this, researchers have proposed concept-based models relying on human-curated knowledge base to identify related concepts as document representation. But these methods are not desirable when applied to closed-domains (e.g., literature, enterprise, etc.) due to the low concept coverage in general knowledge base and interference from out-of-domain concepts. In this paper, we propose a data-driven model named Latent Keyphrase Inference (LAKI ) that represents document with a vector of closely related keyphrases instead of single words or concepts in the knowledge base. We show that with an auxiliary corpus from the target domain, a high quality mapping can be learned between domain keyphrases and their topical content units. Such mapping enables computer to do smart inference for latent keyphrases in the document without explicit mentions. Compared with the state-of-art document representation approaches, LAKI fills the gap between bag-of- words and concept-based models by using domain keyphrases as the basic representation unit. It removes the dependency of knowledge base but still retains strong interpretability in the representation.
Jialu Liu is a fifth-year Ph.D. student in the Department of Computer Science at UIUC, supervised by Prof. Jiawei Han. Before he joined UIUC, he received the B.S. degree from Zhejiang University, China, in 2011. Currently his research focuses on text-rich information networks.
10/27/2015 Speaker
Qiaozhu Mei
Associative Professor, University of Michigan
The Foreseer: data mining of the people, by the people, and for the people
With the growth of online communities, the Web has evolved from networks of shared documents into networks of knowledge-sharing groups and individuals. A vast amount of heterogeneous yet interrelated information is being generated, making existing information analysis techniques inadequate. Current data mining tools often neglect the actual context, creators, and consumers of information. Foreseer is a user-centric framework for the next generation of information retrieval and mining for online communities. It represents a new paradigm of information analysis through the integration of the four “C’s”: content, context, crowd, and cloud.
In this talk, I will introduce our recent effort of mining the data generated in online communities for social good, including the real world problems to which the Foreseer techniques have been successfully applied. I will highlight our recent studies on rumor detection and on finding casual factors that increase user engagement, using and Twitter as examples.
Qiaozhu Mei is an associate professor at the School of Information, the University of Michigan. He is widely interested in text mining, information retrieval, network analysis and their applications in Web search, social computing, and health informatics. He is a recipient of the NSF CAREER Award and multiple best paper awards at ICML, KDD, and other related venues.
10/30/2015 Title
Discovering Negative Links on Social Networking Sites
Huan Liu
Data Mining and Machine Learning Lab
Arizona State University, Tempe, Arizona
Social networking sites make it easy for users to connect with, follow, or “like” each other. Such a mechanism promotes positive connections and helps a social networking site to grow without direct belligerent or negative encounters. This type of one-way connections makes no distinction between indifference and dislike; in other words, two users have only, by default, positive connections. However, it is apparent that as one’s network grows, some users might not be benevolent toward each other, or negative links could form, though not explicitly stated. In this talk, we assess the need for discovering such hidden negative links, explore ways of finding negative links, and show the significance of negative links in social media applications like data classification and clustering, recommendation systems, link prediction, and tie-strength estimation.
Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in EECS at Shanghai JiaoTong University. He was recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His research interests are in data mining, machine learning, social computing, social media mining, and artificial intelligence, investigating problems that arise in real-world applications with high-dimensional data of disparate forms. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He serves on journal editorial/advisory boards and numerous conference program committees. He is a Fellow of IEEE.
11/6/2015 Speaker
Chao Zhang
PhD Candidate, CS Department, UIUC
RLED: Real-time Local Event Detection in Geo-tagged Tweet Stream
The real-time discovery of local events (e.g., protests, crimes, disasters, sport games) is of great importance to various applications, such as crime monitoring, disaster alarming, and activity recommendation. While this task seemed nearly impossible years ago due to the lack of timely and reliable data sources, the recent explosive growth in geo-tagged tweet data brings new opportunities to it. Nevertheless, how to extract quality local events from the geo-tagged tweet stream in real time is a challenging task that remains largely unsolved. We propose RLED, a two-step method that achieves effective and real-time local event detection in the geo-tagged tweet stream. The first step of RLED identifies several pivot tweets in the query window to form candidate events. The pivot tweets are identified based on: (1) a novel authority concept that captures the geo-topic correlations among tweets; and (2) an authority ascent process that finds authority maxima. In the second step, RLED ranks all the candidates by spatiotemporal burstiness. Specifically, it continuously summarizes the tweet stream, and compares each candidate against the summaries in a reference window to quantify its spatiotemporal burstiness. Finally, RLED features an updating module that finds new pivots with little time cost when the query window shifts. As such, RLED is capable of monitoring the continuous stream in real time. We used crowdsourcing to evaluate RLED on a real-life data set that contains millions of geo-tagged tweets. The results show that RLED significantly outperforms state-of-the-art methods in precision, and is orders of magnitude faster.
11/12/2015 Speaker
Douglas W. Oard, University of Maryland
Information Retrieval Research for E-Discovery
Civil litigation in this country relies on each side making relevant evidence available to the other, a process known as “discovery”. The explosive growth of information in digital form has led to an increasing focus on how search technology can best be applied to balance costs and responsiveness in what has come to be known as “e-discovery”. This is now a multi-billion dollar business, one in which new vendors are entering the market frequently, usually with impressive claims about the efficacy of their products or services. Courts, attorneys, and companies are actively looking to understand what should constitute best practice, both in the design of search technology and in how that technology is employed. In this talk I will begin with an overview of the e-discovery process. I’ll then use that background to motivate a discussion of which aspects of that process the TREC Legal Track sought to model, with a particular focus on two novel aspects of evaluation design: (1) recall-focused evaluation in large collections, and (2) modeling an interactive process for “responsive review” with fairly high fidelity. I’ll finish up by talking about some of our most recent work on e-discovery, including work on cost-sensitive design and evaluation of classifiers for responsiveness, development of an interactive tool to support review for privilege, and creation of a new email test collection.
Douglas Oard is a Professor at the University of Maryland, College Park, with joint appointments in the College of Information Studies and the Institute for Advanced Computer Studies (UMIACS). He was previously Associate Dean for Research in the College of Information Studies, and Director of the UMIACS CLIP Lab. Dr. Oard earned his Ph.D. in Electrical Engineering from the University of Maryland. His research interests center around the use of emerging technologies tosupport information seeking by end users. His professional service includes a variety of positions, including General Co-Chair of NTCIR, Program Co-Chair of ACM SIGIR, and Editor-in-Chief of the Foundations and Trends in IR. Additional information is available at
11/20/2015 Speaker
Boris Glavic
Assistant Professor of Computer Science at the Illinois Institute of Technology
Reenacting Transactional Histories to Compute Their Provenance
Provenance for database queries, information about how the outputs of a query where derived from its inputs, has recently gained traction in the database community resulting in the development of several models and their implementation in prototype systems. However, currently there is no system or model that supports transactional updates limiting the applicability of provenance to databases which are never updated. In this talk, I introduce reenactment, a novel declarative replay technique for transactional histories, and demonstrate how reenactment can be used to retroactively compute the provenance of past updates, transactions, and histories. The foundation of this research are MV-semirings, our extension of the well-established semiring provenance model for queries to updates and transactions running under multi-versioning concurrency control protocols. In this model, any transactional history (or part thereof) can be simulated through a query, i.e., any state of a relation R produced by a history can be reconstructed by a query. We call this process reenactment. These formal underpinnings are the basis of an efficient approach for computing provenance of past transactions using a standard relational DBMS. I will show how reenactment queries can be constructed from an audit log, a log of past SQL operations, and how queries with MV-semiring semantics can be encoded as standard relational queries. A naive implementation would either require replay of the complete history from the beginning or proactive materialization of provenance while transactions are run. However, as long as a transaction time history is available, reenactment can be started from any past history state. Since most modern DBMS support audit logs and time travel (querying transaction time histories) out of the box and these features incur only moderate overhead on transaction execution, this approach enables efficient provenance computation for transactions on-top of standard database systems. I present encouraging experimental results based on our implementation of these techniques in our GProM (Generic Provenance Middleware) provenance database middleware.
Boris Glavic is an Assistant Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT database group ( Before coming to IIT, Boris spend to two years as a PostDoc in the Department of Computer Science at the University of Toronto working at the Database Research Group under Renée J. Miller. He received a Diploma (Master) in Computer Science from the RWTH Aachen in Germany, and a PhD in Computer Science from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso. Boris is a professed database guy enjoying systems research based on solid theoretical foundations. His main research interests are provenance and information integration. He has build several provenance-aware systems including Perm (relational databases), Ariadne (stream processing), GProM (database provenance middleware), Vagabond, and LDV (database virtualization and repeatability). For more info of his projects see
12/4/2015 Speaker
Prof Jana Diesner
Assistant professor at the iSchool at the University of Illinois Urbana-Champaign (UIUC)
Affiliate at theDepartment of Computer Science.
The impact of the accuracy of social interaction data on network analysis
I will present our work on two topics:
First, the impact of the accuracy of social interaction data on network analysis results. “Preparing big social data for analysis and conducting analytics involves a plethora of decisions, some of which are already embedded in previously collected data and built tools. These decisions refer to the recording, indexing and representation of data and the settings for analysis methods. While these choices can have tremendous impact on research outcomes, they are not often obvious, not considered or not being made explicit. Consequently, our awareness and understanding of the impact of these decisions on analysis results and derived implications are highly underdeveloped.” I provide empirical examples for the impact of node disambiguation in terms of merging, splitting and attribution on different types of social network data, and show to what extent our understanding of network properties, topologies and underlying link formation mechanisms can get biased due to inaccurate data. (Full paper: Diesner J (2015) Small Decisions with Big Impact on Data Analytics.Journal of Big Data & Society, special issue on Assumptions of Sociality. Link:
Second, I will speak about our research on using natural language processing techniques to enhance network data with the ultimate goal of testing network theories in unprecedented ways. I give an example where we leveraged sentiment analysis to assign valence values to links in unsigned graphs in order to enable triadic balanced assessment in communication networks. Our method enables fast and systematic sign detection (we labeled 166,670 triads in one dataset), eliminates the need for surveys or manual sign labeling, and reduces issues with leveraging user-generated (meta)-data for this purpose. We applied our method to corporate email data; finding a ratio of balanced triads (on average about 88%) to unbalanced ones (12%). This ratio was relatively stable despite drastic changes in corporate performance. We also observed that people actively use a smaller set of positive terms more frequently than their larger vocabulary of negative words. (Full paper: Diesner J, Evans C (2015) Little Bad Concerns: Using Sentiment Analysis to Assess Structural Balance in Communication Networks. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, France. Diesner J, Evans C, Kim J (2015) Impact of entity disambiguation errors on social network properties. International AAAI Conference on Web and Social Media (ICWSM), Oxford, UK. Link:
Short bio
Jana Diesner is an Assistant Professor at the iSchool at UIUC. She is also an affiliate at the Department of Computer Science and a 2015 Faculty Fellow at the National Center for Supercomputing Applications (NCSA). Jana’s research in human-centered data science is at the intersection of natural language processing, social network analysis and machine learning. In her lab, they develop and advance computational solutions that help people to measure and understand the interplay of information and socio-technical networks. They also bring these solutions into various application context, currently mainly in the domain of impact assessment. For more information about her work see

Spring 2015

Date Content
Friday, Jan. 23, 2015
Place: 0216SC
Title Towards a Game-Theoretic Framework for Information Retrieval
Speaker: Prof. Chengxiang Zhai
Abstract: The task of information retrieval (IR) has traditionally been defined as to rank a collection of documents in response to a query. While this definition has enabled most research progress in IR so far, it does not model accurately the actual retrieval task in a real IR application, where users tend to be engaged in an interactive process with multipe queries, and optimizing the overall performance of an IR system on an entire search session is far more important than its performance on an individual query. In this talk, I will present a new game-theoretic formulation of the IR problem where the key idea is to model information retrieval as a process of a search engine and a user playing a cooperative game, with a shared goal of satisfying the user’s information need while minimizing the user’s effort and the resource overhead on the retrieval system. Such a game-theoretic framework offers several benefits. First, it naturally suggests optimization of the overall utility of an interactive retrieval system over a whole search session, thus breaking the limitation of the traditional formulation that optimizes ranking of documents for a single query. Second, it models the interactions between users and a search engine, and thus can optimize the collaboration of a search engine and its users, maximizing the “combined intelligence” of a system and users. Finally, it can potentially serve as a unified framework for optimizing both interactive information retrieval and active relevance judgment acquisition through crowdsourcing. I will discuss how the new framework can not only cover several emerging directions in current IR research as special cases, but also open up many interesting new research directions in IR.
Link: slides
Friday, Jan. 30, 2015
Place: 0216SC
Title DataHub: Collaborative Data Science & Dataset Version Management at Scale
Speaker: Prof. Aditya Parameswaran
Abstract: Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DATAHUB, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
Bio: Aditya Parameswaran is an Assistant Professor in Computer Science at the University of Illinois (UIUC). He spent the 2013-14 year visiting MIT CSAIL and Microsoft Research New England, after completing his Ph.D. from Stanford University, advised by Prof. Hector Garcia-Molina. He is broadly interested in data analytics, with research results in human computation, visual analytics, information extraction and integration, and recommender systems. Aditya is a recipient of the Arthur Samuel award for the best dissertation in CS at Stanford (2014), the SIGMOD Jim Gray dissertation award (2014), the SIGKDD dissertation award runner up (2014), the Key Scientific Challenges Award from Yahoo! Research (2010), three best-of-conference citations (VLDB 2010, KDD 2012 and ICDE 2014), the Terry Groswith graduate fellowship at Stanford (2007), and the Gold Medal in Computer Science at IIT Bombay (2007).
Link: video
Friday, Feb. 6, 2015
Place: 0216SC
Title Machine Learning with World Knowledge
Speaker: Yangqiu Song
Abstract: Machine learning algorithms have become pervasive in multiple domains and have started to have impact in applications. However, a key obstacle in making learning protocol realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. However, while annotated data is difficult to get, we have available large amounts of data from the Web. In this talk, I will introduce new learning paradigms which use existing world knowledge to “supervise” machine learning algorithms. By “world knowledge” we refer to general-purpose knowledge collected from the Web, and that can be used to extract both common sense knowledge and diverse domain specific knowledge and thus help supervise machine learning algorithms. I will discuss two projects, demonstrating that we can perform better machine learning and text data analytics by adapting general-purpose knowledge to domain specific tasks. For the first project, I will introduce the dataless classification algorithm which requires no labeled data to perform completely unsupervised text classification. In this case, the knowledge is used to embed the text documents and the category labels into the same semantic space. For the second project, I will discuss how to perform hierarchical clustering of short texts, e.g., Web queries and tweets, using a probabilistic concept based knowledge base, Probase. In both cases, we provide realistic and scalable algorithms to address large scale and fundamental text analytics problems.
Bio: Dr. Yangqiu Song is a post-doctoral researcher at the Cognitive Computation Group at the University of Illinois at Urbana-Champaign. Before that, he was a post-doctoral fellow at Hong Kong University of Science and Technology and visiting researcher at Huawei Noah’s Ark Lab, Hong Kong (2012-2013), an associate researcher at Microsoft Research Asia (2010-2012) and a staff researcher at IBM Research China (2009-2010) respectively. He received his B.E. and PhD degrees from Tsinghua University, China, in July 2003 and January 2009, respectively. His current research focuses on using machine learning and data mining to extract and infer insightful knowledge from big data. The knowledge helps users better enjoy their daily living and social activities, or helps data scientists do better data analytics. He is particularly interested in working on large scale learning algorithms, on natural language understanding, text mining and visual analytics, and on knowledge engineering for domain applications.
Link: video
Friday, Feb. 13, 2015
Place: 0216SC
Title Network A/B Testing: From Sampling to Estimation
Speaker: Huan Gui
Abstract: A/B testing, also known as bucket testing, split testing, or controlled experiment, is a standard way to evaluate user engagement or satisfaction from a new service, feature, or product. It is widely used in online websites, including social network sites such as Facebook, LinkedIn, and Twitter to make data-driven decisions.The goal of A/B testing is to estimate the treatment effect of a new change, which becomes intricate when users are interacting, \ie, the treatment effect of a user may spill over to other users via underlying social connections.When conducting these online controlled experiments, it is a common practice to make the Stable Unit Treatment Value Assumption (SUTVA). Though this assumption simplifies the estimation of treatment effect, it does not hold when network interference is present, and may even lead to wrong conclusion.
In this paper, we study the problem of network A/B testing in real networks, which have substantially different characteristics from the simulated random networks studied in previous works. We first examine the existence of network effect in a recent online experiment conducted at LinkedIn; Secondly, we propose an efficient and effective estimator for Average Treatment Effect (ATE) considering the interference between users in real online experiments; Finally, we apply our method in both simulations and a real world online experiment. The simulation results show that our estimator achieves better performance with respect to both bias and variance reduction. The real world online experiment not only demonstrates that large-scale network A/B test is feasible but also further validates many of our observations in the simulation studies.
Bio: Huan Gui is a doctoral candidate advised by Prof. Jiawei Han in the Department of Computer Science, University of Illinois at Urbana-Champaign. She has been working on various topics in data mining and machine learning, with a focus in information network analysis. Huan has published in several major data mining and machine learning conferences, such as CIKM, NIPS and WWW. Before entering UIUC, she obtained the Bachelor degree in Computer Science (Major) and Economics (Minor) at Peking University.
Friday, Feb. 20, 2015
Place: 0216SC
Title Random Walks on Adjacency Graphs for Mining Paradigmatic and Syntagmatic Relationships
Speaker: Shan Jiang
Abstract: Paradigmatic and syntamatic relation are two complementary types of relationships between elements in sequence data. The first is the relation between two elements that tend to occur in similar context and the second relation holds between elements that usually co-occur together. In this talk, we will introduce a possible way to discover these two relations based on the occurrences and co-occurrences patterns between elements using random walk on adjacency graph. We’ll start with representing sequence data by adjacency graph. Then several types of random walk pattern are introduced to mine different relationships. Next, we’ll show some interesting results got by our experiment and finally we’ll conclude our work and discuss possible future directions.
Bio: Shan Jiang is a second-year PhD student advised by Professor Chengxiang Zhai in the Department of Computer Science, University of Illinois at Urbana-Champaign. Her interests are text mining and information retrieval.
Friday, Feb. 27, 2015
Place: 0216SC
Title Leveraging Pattern Semantics for Extracting Entities in Enterprises
Speaker: Fangbo Tao
Abstract: Entity Extraction is a process of identifying meaningful entities from text documents. In enterprises, extracting entities improves enterprise efficiency by facilitating numerous applications, including search, recommendation, etc. However, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redundancy of enterprise entities makes previous web-based systems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enterprises (“Blue” refers to “Windows Blue”), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-to-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage public signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined semantics, and yield more accurate and complete entities. Experiments on Microsoft enterprise data show the effectiveness of our approach.
Bio: Fangbo Tao is a 3rd year Ph.D. student of Computer Science at University of Illinois at Urbana-Champaign. His advisor is Dr. Jiawei Han from Data Mining Group.
Friday, Mar. 6, 2015
Place: 0216SC
Title TBA
Speaker: Professor Guozhu Dong
Abstract: Constructing accurate numerical prediction models is fundamental for many modeling and forecasting applications, including scientific modeling, medical modeling, economic forecasting, and severe weather forecasting. In this talk I will first introduce a new type of regression models, namely pattern aided regression (PXR) models. PXR models were motivated by two observations: (1) Regression modeling applications often involve complex diverse predictor-response relationships, which occur when the optimal regression models (of existing regression model types) fitting distinct subgroups of data of given application are highly different. (2) State-of-the-art regression methods are often unable to adequately model such relationships. Roughly speaking, a PXR model relies on several pattern and local regression model pairs, which respectively serve as logical and behavioral characterizations of distinct predictor-response relationships for several key subgroups of data of the application. I will also present a contrast pattern aided regression (CPXR) method, to build accurate PXR models. In experiments on 50 real datasets that were examined in previous regression studies, the PXR models built by CPXR are very accurate in general, often outperforming state-of-the-art regression methods by big margins. Specifically, CPXR is about 42% better (relatively) than the linear regression method on average, and it is about 22% better than the best competing regression method. It reduced prediction error of the linear regression method by 60–87% in 10 of the 50 datasets. Using around seven simple patterns on average and linear local regression models, those PXR models are easy to interpret. CPXR is especially effective for high-dimensional data. I will also discuss how to use CPXR methodology for analyzing prediction models and correcting their prediction errors. Finally I will discuss how to use CPXR for classification, including results on medical risk prediction of traumatic brain injury and heart failure.
Bio: Guozhu Dong is a full professor at Wright State University. He earned a PhD in Computer Science from the University of Southern California. His main research interests are data mining and machine learning, data science, bioinformatics, and databases. He has published over 150 articles
and two books entitled “Sequence Data Mining” and “Contrast Data Mining,” and he holds 4 US patents. He is widely known for his work on contrast/emerging pattern mining and applications, and for his work on first-order maintenance of recursive and transitive closure queries/views. His papers have received 5900+ citations ( and his h-index is 36. He is a recipient of the Best Paper Awards from the 2005 IEEE ICDM and the 2014 PAKDD, and a recipient of the Research Excellence Award at College of CECS of WSU. He is a senior member of both IEEE and ACM.
Link: video
Friday, Mar. 13, 2015
Place: 0216SC
Title ResearchNet and NewsNet: Two Nets that May Excite Everyone
Speaker: Professor Jiawei Han
Abstract: This talk introduces two currently on-going projects in the Data Mining research group in DAIS. Our observation is that massive amounts of data are unstructured, noisy, untrustworthy, but interconnected, implicitly forming gigantic heterogeneous information networks. Methods can be developed to organize such unstructured data and construct multiple semi-structured heterogeneous information networks. One important reason that we would like to promote these projects is that research data and news data are widely available, accessible, and comprehensible by ourselves, forming ideal cases for research on construction and mining of heterogeneous information networks. In this talk, I am going to introduce the progress we have made so far in this direction and will outline the major challenges and work-plans at construction and mining of these networks. We believe these two networks may attract many smart minds. We hope more researchers will be able to contribute to these projects.
Bio: Jiawei Han is an Abel Bliss professor in engineering, in the Department of Computer Science at the University of Illinois. He has been researching into data mining, information network analysis, and database systems, with over 600 publications. He served as the founding editor-in-chief of the ACM Transactions on Knowledge Discovery from Data (TKDD) and on the editorial boards of several other journals. He has received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE Computer Society W. Wallace McDowell Award (2009), and Daniel C. Drucker Eminent Faculty Award at UIUC (2011). He is currently the director of Information Network Academic Research Center (INARC) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of US Army Research Lab. His book Data Mining: Concepts and Techniques (Morgan Kaufmann) has been used worldwide as a textbook. He is a fellow of the ACM and IEEE.
Link: slides video
Friday, Mar. 20, 2015
Place: 0216SC
Title Coreference Resolution with Knowledge
Speaker: Haoruo Peng
Abstract: Coreference resolution is a key problem in natural language understanding that still escapes reliable solutions. One fundamental difficulty has been that of resolving instances involving pronouns since they often require deep language understanding and use of background knowledge. In this talk, we propose an algorithmic solution that involves a new representation for the knowledge required to address hard coreference problems, along with a constrained optimization framework that uses this knowledge in coreference decision making. Our representation, Predicate Schema, is instantiated with knowledge acquired in an unsupervised way, and is compiled automatically into constraints that impact the coreference decision. We present a general coreference resolution system that significantly improves state-of-the-art performance on hard, Winograd-style, pronoun resolution cases, while still performing at the state-of-the-art level on standard coreference resolution datasets. In this talk, I will also introduce several other recent advances in coreference resolution.
Bio: Haoruo Peng is a second-year PhD student from UIUC advised by Prof. Dan Roth in Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on various topics in natural language processing, with a focus in coreference resolution. Haoruo has published in several ML and NLP conferences, such as ECML, IJCAI and NAACL. Before entering UIUC, he obtained the Bachelor degree in Computer Science at Tsinghua University.
Friday, Apr. 3, 2015
Place: 0216SC
Title Mining Quality Phrases from Massive Text Corpora
Speaker: Jingbo Shang
Abstract: Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgement. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.
Bio: Jingbo Shang is a first year Ph.D. student in UIUC supervised by Prof. Jiawei Han. His research interests lie in large scale data mining and machine learning problems. More specifically, he is now working on constructing heterogeneous information network from massive corpora. He received a B.E. from Shanghai Jiao Tong University (SJTU) in 2014. As a leader of the team representing SJTU, he won 2nd place in ACM/ICPC World Finals 2013. Besides, he is the recipient of Computer Science Excellence Fellowship (UIUC) and National Scholarship of China.
Friday, Apr. 10, 2015
Place: 0216SC
Title Structured Learning for Spatial Information Extraction from Biomedical Text
Speaker: Dr. Parisa Kordjamshidi
Abstract: The aim is to automatically extract species names of bacteria and their locations from webpages. This task is important for exploiting the vast amount of biological knowledge which is expressed in diverse natural language texts and putting this knowledge in databases for easy access by biologists. The task is challenging and the previous results are far below an acceptable level of performance, particularly for extraction of localization relationships. I design a new structured output prediction model for joint extraction of biomedical entities and the localization relationship. My model is based on a spatial role labeling (SpRL) model designed for spatial understanding of unrestricted text. I extend SpRL to extract discourse level spatial relations in the biomedical domain and apply it on the BioNLP-ST 2013, BB-shared task. I highlight the main differences between general spatial language understanding and spatial information extraction from the scientific text which is the focus of this work. I exploit the text’s structure and discourse level global features. The experimental results indicate that a joint learning model over all entities and relationships in a document outperforms a model which extracts entities and relationships independently. My global learning model significantly improves the state-of-the-art results on this task and has a high potential to be adopted in other natural language processing (NLP) tasks in the biomedical domain.
Bio: Parisa Kordjamshidi is a postdoctoral researcher at the University of Illinois at Urbana-Champaign, working in Cognitive Computation Group. She obtained her PhD degree from KULeuven in July 2013. Her main research interests are machine learning and natural langue understanding. During her PhD research she introduced the first Semantic Evaluation task and benchmark for Spatial Role Labeling. She has worked on structured output prediction and relational learning models to map natural language onto formal spatial representations, appropriate for spatial reasoning as well as to extract knowledge from biomedical text. Currently, she is involved in an NIH (National Institute of Health) project, extending her research experience on structured and relational learning to Learning Based Programming (LBP) for biological data analysis. The results of her research have been published in several international peer-reviewed conferences and journals including ACM-TSLP, JWS, BMC-Bioinformatics.
Friday, Apr. 17 (2:00pm-3:15pm), 2015
Place: 0216SC
Title Democratizing Data Science
Speaker: Professor Bill Howe
Abstract: Advances from data science (and data-intensive science) appear to be derived primarily from the composition, integration, and broad application of existing techniques and technologies rather than (solely) the development of new techniques. But this problem of technology “delivery” receives relatively little research attention. At the UW eScience Institute and in the UW Database Group, we are building platforms to democratize advanced data management, curation, and analytics across all fields of science and across all levels of expertise. In this talk, I’ll describe our findings from a multi-year deployment of a database-as-a-service system called SQLShare, and recent results in the context of the Myria project, a federated data management and analytics system that supports multiple backend engines, iteration as a first-class citizen, new algorithms, built-in visualization and performance profiling, and a language interface that balances imperative and declarative features. I’ll wrap up with a tour of our efforts to develop organizational infrastructure to complement the software infrastructure, including an incubator program for interdisciplinary projects, new educational initiatives, and cross-campus collaborations in data-intensive science.
Bio: Bill Howe is the Associate Director of the UW eScience Institute and holds an Affiliate Faculty appointment in Computer Science & Engineering. His research interests are in data management, curation, analytics, and visualization in the sciences. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, has had two papers selected for VLDB Journal’s “Best of Conference” issues (2004 and 2010), and co-authored what are currently the most-cited papers from both VLDB 2010 and SIGMOD 2012. Howe serves on the program and organizing committees for a number of conferences in the area of databases and scientific data management, and developed a first MOOC on data science that attracted over 200,000 students across two offerings. He has a Ph.D. in Computer Science from Portland State University and a Bachelor’s degree in Industrial & Systems Engineering from Georgia Tech.
Friday, Apr. 24, 2015
Place: 0216SC
Title Exploring linguistic complexity of proteins
Speaker: Professor Jian Peng
Abstract: In this talk, I will present my understanding on the analogy between protein sequence and natural language. Inspired by such analogy, many machine learning techniques used in NLP can be applied to proteins. In particular, probabilistic graphical models provide a natural representation for both protein sequence and structure. I will introduce several graphical models for protein sequence modeling, structure prediction and function prediction. I will also discuss their analogies to the tasks in NLP. Finally, I will discuss some recent progress and potential future directions for graphical models.
Bio: Jian Peng is an Assistant Professor in Computer Science at UIUC. Before joining UIUC, he worked as a postdoctoral researcher at MIT CSAIL. He received his PhD from TTI-Chicago in 2013. His current research interests include network biology, large-scale genomics, approximate inference and probabilistic graphical models.
Jian is a recipient of Microsoft Research Fellowship (2010), Young Investigator Award in CROI (2011) and several best poster awards. His algorithms won the Crowdscale Challenge (2011), the Breast cancer cell line pharmacogenomics challenge (2011), the 2nd place in several CASP protein structure prediction experiments (2008, 2010, 2012) and selected as the most innovative method in CASP 2009 meeting.
Friday, May. 1, 2015
Place: 0216SC
Title TBA
Speaker: Professor Ranjitha Kumar

Fall 2014

Date Content
Tuesday, Aug. 26, 2014
Place: 0216SC
Title Overview of DAIS Research
Speaker: Prof. Kevin Chang, Prof. Aditya Parameswarm, Prof. Hari Sundaram, Prof. Jian Peng
Abstract: DAIS faculty will give a brief overview of their research.
Link video slides
Tuesday, Sept. 9, 2014
Place: 0216SC
Title Overview of DAIS Research (cont.)
Speaker: Prof. Jiawei Han, Prof. Chengxiang Zhai, Prof. Tandy Warnow, Prof. Saurabh Sinha
Abstract: DAIS faculty will give a brief overview of their research.
Link video slides
Tuesday, Sept. 16, 2014
Place: 0216SC
Title User Search Behaviors within the UIUC Gateway: A Transaction Log Analysis
Speaker: Prof. William H. Mischo
Abstract: Our knowledge of user searching patterns within library-based online systems, particularly within online catalogs, is incomplete and sometimes contradictory. Likewise, there is evidence that user searching behaviors in online bibliographic retrieval systems are different are different than user information seeking patterns in web search engines. To address these questions, the UIUC Library has been collecting and analyzing custom transaction log data over the last seven years from the Library gateway. These analyses have informed the development and implementation of search assistance mechanisms designed to facilitate search strategy modification and assist in search navigation. These search assistance mechanisms are deployed within the Library gateway in a locally developed federated search and recommender system called Easy Search. The transaction log analysis has generated data on terms per search query, queries per session, the use of search assistance, and the importance of supporting known-item searching.
Bio: William Mischo is Head, Grainger Engineering Library Information Center; Information Systems Research & Development Librarian; and Professor at the UIUC Library. He has been involved in numerous grant funded digital library projects, including the NSF Digital Library Initiative, several IMLS National Leadership grants, and two NSF National Science Digital Library grants. Bill has published some 70 articles and conference papers in library and information science and presented at more than 80 national and international conferences. He was the recipient of the 2009 Frederick G. Kilgour Award for Research in Library and Information Technology from the American Library Association and OCLC and the 2001 Homer I. Bernhardt Distinguished Service Award from the American Society for Engineering Education Engineering Libraries Division.
Link video
Tuesday, Sept. 23, 2014
Place: 0216SC
Title Generating a Billion Personalized Newspapers: How Facebook Ranks News Feed Stories and News Feed Ads
Speaker: Yintao Yu
Abstract: Facebook news feed ranking’s goal is to provide our users with over a billion personalized newspapers. We strive to provide the most compelling content to each user, personalized to them so that they are most likely to see the content that is most interesting to them. Carrying on the newspaper analogy, putting the right stories above the fold has always been critical to engaging customers and interesting them in the rest of the paper. In feed ranking, we face a similar challenge, but on a grander scale. Each time a user visits, we need to find the best piece of content out of all the available stories and put that at the top of feed where people are most likely to see it. To accomplish this, we do large-scale machine learning to model each user, figure out which friends, pages and topics they care about, and use whatever signals we can come up with to pick the stories each particular user is interested in. The typical user has well over 1500 stories available to them each day, but only has time to consume a small fraction of those, so it’s important that we separate the best stories from the rest. Ads, aka sponsored stories, are ranked in a similar way and we control the quality of ads carefully to ensure that Facebook users still have a high-quality experience with ads inserted in their news feed, while advertisers can drive their value. Then, we run auctions to select the best combinations of organic news feed stories and ads, and set a price for per billing event for each selected ad. At the end of the talk, I will also talk about some findings and lessons we have learned when we built the large-scale machine learning systems used for Facebook news feed ranking and ads ranking.
Bio: Yintao Yu is a senior research scientist and tech lead at Facebook. At Facebook, his main contributions include building the current production machine learning system of Facebook news feed ads prediction models (click-through rate, conversion rate, quality, position discount models) from scratch, grew the project from himself to a team, replaced the legacy system in 2013, and drastically improved Facebook news feed ads revenue, quality and engagement metrics. Previously he has worked on Facebook DSP (Demand Side Platform) in which he led the real time bidding modeling, organic news feed ranking, ads auction mechanism and pricing, right-hand side ads feature engineering. He received his B.S. in Electrical Engineering from Shanghai Jiao Tong University, and M.S. in Computer Science from University of Illinois at Urbana-Champaign, where he was also a Ph.D. candidate on data mining. He co-authored research papers that have received over 500 citations, and co-invented 4 U.S. patents.
Link video
Tuesday, Sept. 30, 2014
Place: 0216SC
Title Data Mining for Software Engineering: Achievements and Challenges
Speaker: Prof. Tao Xie
Abstract: Data Mining for Software Engineering: Achievements and Challenges
A huge wealth of various data exists in software life cycle, including source code, feature specifications, bug reports, test cases, execution traces/logs, and real-world user feedback, etc. Data plays an essential role in modern software development, because hidden in the data is information about the quality of software and services as well as the dynamics of software development. In the past decade, data mining for software engineering has become as a popular and high-impact subfield in software engineering. For example, International Working Conference on Mining Software Repositories (MSR) has become the most attended co-located event with ICSE. In recent years, software analytics has emerged for attracting researchers to make high impact on software industry with data mining and other analytics techniques. In this talk, the speaker will present an overview of achievements and challenges in data mining for software engineering.
Bio: Tao Xie is an Associate Professor in the Department of Computer Science at University of Illinois at Urbana-Champaign, USA. He received his Ph.D. in Computer Science from the University of Washington in 2005. His research interests are in Software Engineering, focusing on software testing, program analysis, and software analytics. He was a Program Co-Chair of 2011 and 2012 International Working Conference on Mining Software Repositories (MSR). He is the Program Chair of 2015 International Symposium on Software Testing and Analysis (ISSTA). His related papers on software analytics can be found at His homepage is at
Link slides
Tuesday, Oct. 7, 2014
Place: 0216SC
Title Unifying Why and Why-Not Explanations using First-Order Query Evaluation Games
Speaker: Prof. Bertram Ludaescher
Abstract: After a brief, high-level overview of my current research areas and projects, I will focus on a foundational problem in database theory, i.e., how to explain the presence or absence of a tuple in a query result. To this end, I will present a new model of provenance, based on a game-theoretic approach to query evaluation: First, we’ll consider graph-based games G in their own right, and ask how to explain that a position X in a game graph G iswon, lost, or drawn. The resulting notion of game provenance is closely related to winning strategies, and excludes from provenance all “bad moves”, i.e., those which unnecessarily allow the opponent to improve the outcome of a play. In this way, the value of a position is determined solely by its game provenance. We then define provenance games by viewing the evaluation of a first-order query as a game between two players who argue whether a tuple is in the query answer or not. For positive relational algebra (RA+) queries, game provenance is equivalent to the most general semiring of provenance polynomials NX. Variants of our game yield other known semirings. However, unlike semiring provenance, game provenance also provides a very natural, “built-in” way to handle negation and thus to answer Why-Not questions: In (provenance) games, the reason why X is not won, is the same as why X is lost or drawn (the latter is possible for games with draws). Since first-order provenance games are draw-free, they yield a new provenance model that elegantly combines how- and why-not provenance.
Bio: Bertram Ludäscher is a professor at the Graduate School of Library and Information Science (GSLIS). Prior to joining the iSchool at Illinois he was a professor at the Department of Computer Science and the Genome Center at the University of California, Davis. His research interests span the whole data to knowledge life-cycle, from modeling and design of databases and workflows, to knowledge representation and reasoning. His current research focus includes both theoretical foundations of provenance and practical applications, in particular to support automated data quality control and workflow-supported data curation. He is one of the founders of the open source Kepler scientific workflow system, and a member of the DataONE leadership team, focusing on data and workflow provenance.Until 2004 Ludäscher was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE Department at UC San Diego. He received his M.S. (Dipl.-Inform.) in computer science from the Technical University of Karlsruhe (now: K.I.T.) and his PhD (Dr.rer.nat.) from the University of Freiburg, both in Germany.
Link video
Tuesday, Oct. 14, 2014
Place: 0216SC
Title Fast Topic Discovery in Large Document Collections by Nonnegative Matrix Factorization
Speaker: Prof. Haesun Park
Abstract: The Nonnegative matrix factorization (NMF) has been an important tool of choice for numerous data analytic problems in text analysis, image analysis, and computer vision, etc. A distinguishing feature of the NMF is the requirement of nonnegativity in the factors that represent the matrix in a lower rank, which enhances the interpretability and modeling capability for many applications. In this talk, we show some foundational properties of the NMF, offer new methods using the framework of NMF for efficient and effective hierarchical clustering and topic modeling of large scale text data for multi-scale analysis. In addition, we present an interactive visual analytics system that allows interactive topic discovery for producing more relevant solutions than a completely automated method. Our substantial experimental results show that rank-2 NMF based hierarchical and flat topic discovery methods called HierNMF2 and FlatNMF2 are far superior to other existing methods such as LDA (Latent Dirichlet Allocation) and k-means in terms of both scalability and solution quality, and are more amenable for interactive visual analytics due to more consistent results they produce. This work is supported in part by DARPA XDATA and NSF/DHS FODAVA programs.
Bio: Dr. Haesun Park is a SIAM Fellow and professor in the School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, U.S.A. Her research areas include numerical algorithms, data analysis, visual analytics, text mining, and parallel computing. She has played major leadership roles in these areas as the Executive Director of the Center for Data Analytics, Georgia Tech, Director of NSF/DHS funded FODAVA (Foundations of Data and Visual Analytics) Center, general chair for the SIAM Conference on Data Mining, and editorial board member of SIAM and IEEE journals, and as a plenary keynote speaker at major conferences. She received a Ph.D. and M.S. in Computer Science from Cornell University in 1987 and 1985, and B.S. in Mathematics from Seoul National University with the University President’s Medal for the top graduate.
Link video
Tuesday, Oct. 21, 2014
Place: 0216SC
Title Learning and Mining in Large-scale Time Series Data
Speaker: Prof. Yan Liu
Abstract: Many emerging applications of machine learning involve time series and spatio-temporal data. In this talk, I will discuss a collection of machine learning approaches to effectively analyze and model large-scale time series and spatio-temporal data, including temporal causal models, sparse extreme-value models, and fast tensor-based forecasting models. Experiment results will be shown to demonstrate the effectiveness of our models in practical applications, such as climate science, social media and biology.
Bio: Yan Liu is an assistant professor in Computer Science Department at University of Southern California from 2010. Before that, she was a Research Staff Member at IBM Research. She received her M.Sc and Ph.D. degree from Carnegie Mellon University in 2004 and 2006. Her research interest includes developing scalable machine learning and data mining algorithms with applications to social media analysis, computational biology, climate modeling and business analytics. She has received several awards, including NSF CAREER Award, Okawa Foundation Research Award, ACM Dissertation Award Honorable Mention, Best Paper Award in SIAM Data Mining Conference, Yahoo! Faculty Award and the winner of several data mining competitions, such as KDD Cup and INFORMS data mining competition.
Link video
Tuesday, Oct. 28, 2014
Place: 0216SC
Title Natural Language Processing Tools from the Cognitive Computation Group
Speaker: Mark Sammons
Abstract: The Cognitive Computation Group (CCG) at the University of Illinois has developed and released a number of state-of-the art Natural Language Processing (NLP) tools. These include the best available named entity recognizer, co-reference resolution tool, semantic role labeler, Wikifier, and others. These tools support deeper analysis of natural language English text for text analytics, data mining and other practical applications and are being used broadly in the research community and commercially. This presentation gives an overview of CCG’s NLP software, with an emphasis on programmatic use for new applications. It will also introduce Learning-Based Java, which enables direct programmatic use of supervised learning algorithms as an integral part of new applications.
Bio: Mark Sammons is a Principal Research Scientist working with the Cognitive Computation Group at the University of Illinois at Urbana-Champaign. His primary interests are in Natural Language Processing and Machine Learning. Mark has published on several topics in Natural Language Processing and has co-authored a book on Textual Entailment. In addition to his research into Recognizing Textual Entailment, Co-reference, and Spelling and Grammar Correction, he coordinates and participates in the Cognitive Computation Group’s development and release of NLP software. Mark received his MSC in Computer Science from the University of Illinois in 2004, and his PhD in Mechanical Engineering from the University of Leeds, England, in 2000.
Link video
Tuesday, Nov. 4, 2014
Place: 0216SC
Title New HMM-based Methods in Sequence Alignment, Phylogenetics, and Metagenomics
Speaker: Prof. Tandy Warnow
Abstract: Multiple sequence alignment of datasets containing many thousands of sequences is a challenging problem with applications in phylogeny estimation, protein structure and function prediction, taxon identification of metagenomic data, etc. However, few methods can analyze large datasets, and none have been shown to have good accuracy on datasets with more than about 10,000 sequences, especially if the sequence datasets have evolved with high rates of evolution.
In this talk, I will present a new method to obtain highly accurate estimations of large-scale multiple sequence alignments and phylogenies. The basic idea is to use an ensemble of Hidden Markov Models (HMMs) to represent a “seed alignment”, and then align all the remaining sequences to the seed alignment. Our method, UPP, returns very accurate alignments, and trees on these alignments are also very accurate – even on datasets with as many as 1,000,000 sequences, or datasets that contain many fragmentary sequences. Furthermore, UPP is both fast and very scalable, so that analysis of the 1-million taxon dataset took only 24 hours using 12 cores and small amounts of memory. Finally, this Ensemble of HMMs technique improves the accuracy of methods for other bioinformatics problems, including phylogenetic placement and taxon identication of metagenomic data.
This is joint work with Nam-phuon Nguyen and Siavash Mirarab.
Bio: Tandy Warnow is the Founder Professor of Bioengineering and Computer Science at the University of Illinois at Urbana-Champaign; she is also an affiliate in the Departments of Mathematics, Entomology, and Statistics, and a member of the Institute for Genomic Biology. Tandy’s research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics. Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavare and Michael Waterman at USC. She received the National Science Foundation Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006, and a Guggenheim Foundation Fellowship for 2011. Her current research focuses on phylogeny and alignment estimation for very large datasets (10,000 to 1,000,000 sequences), estimating species trees and phylogenetic networks from collections of gene trees, and metagenomics.http://tandy.cs.illinois.eduLink video
Tuesday, Nov. 11, 2014
Place: SC2016
Title Scalable learning algorithms for structured prediction
Speaker: Kai-Wei Chang
Abstract: In many machine learning tasks, high accuracy requires training on a lot of data, adding more expressive features and/or exploring complex input and output structures, often resulting in scalability problems. Nevertheless, we observe that by carefully selecting and caching samples, structures, or latent items, we can reduce the problem size and improve the training speed and eventually improve performance. Based on this observation, we develop efficient algorithms for learning structured prediction models and online clustering models. We show that our selective algorithms and caching techniques are able to learn expressive models from large amounts of annotated data and achieve state-of-the art performance on several natural language processing tasks.
Bio: Kai-Wei Chang is a doctoral candidate advised by Prof. Dan Roth in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on various topics in Machine learning and Natural Language Processing, including large-scale learning, structured learning, coreference resolution, and relation extraction. Kai-Wei was awarded the KDD Best Paper Award in 2010 and won the Yahoo! Key Scientific Challenges Award in 2011. He was also involved in developing machine learning packages such as LIBLINEAR and Illinois-SL.
Link video
Tuesday, Nov. 18, 2014
Place: 0216SC
Title TBA
Speaker: Xiang Ren
Abstract: TBA
Bio: TBA
Link video
Tuesday, Dec. 2, 2014
Place: 0216SC
Title TBA
Speaker: Sean Massung
Abstract: TBA
Bio: TBA
Link video

Spring 2014

Date Content
Wednesday, Jan. 22, 2014
Place: SC3405
Title Big graph search and analytics: a journey of usability and scalability
Speaker: Yinghui Wu
Abstract: Real-life graphs are messy and huge. These bring two challenges to the applications of graph data analytics: how to make real-life graphs usable and useful? and how to scale graph data analytics to the growth of data? In this talk, I will share our experience on the journey of improving uscability and scalability for big graph analytics, and in particular, for the general graph search problem. (1) Query writing and result understanding are among the first daunting tasks for end users. We proposed summarization techniques to help users understand complex results and refine their search, without inspecting answers one by one. (2) This said, potential matches are hard to capture using conventional similarity metrics even for refined search. We developed transformation-based graph search to identify these matches. Specifically, we propose efficient ontology-based graph search to harvest external ontologies for interpreting query semantics. (3) We further examine the challenge for automatically learning a proper ranking model, integrating a set of transformations that leads to top ranked matches. Putting these together, we propose a user-friendly graph search system that enable easy graph data access, search and exploration. Finally, I will briefly introduce our ongoing work on network causality analysis, a real-life application of graph analytics.
Short Bio: Yinghui Wu is a research scientist at Department of Computer Science, University of California Santa Barbara, and a member of Network Science Collaborativie Technique Alliance (NS-CTA). His research iterests mainly focus on graph databases, graph analytics and network science, with applications in social/information network analytics and network security. He receives his Ph.D. from the University of Edinburgh in 2010.
Link video
Wednesday, Jan. 29, 2014
Place: SC3405
Title Entity Recommendation in Heterogeneous Information Networks
Speaker: Xiao Yu
Abstract: Recommender systems, which provide users with recommendations for products or services, have seen widespread implementation in various domains. In many scenarios, the entity recommendation problem exists in a heterogeneous information network environment with multi-typed relationships between users and entities. In this talk, Xiao will first explore the relationship heterogeneity in information networks and introduce an entity recommendation approach which generates personalized recommendation models for different users. Motivated by this study, he will then introduce a large-scale real-world application, which is a personalized entity recommendation system for search engine users, using search engine user log and the freebase knowledge graph, to integrate entity recommendation into users’ search experience. A scalable, robust and time-aware recommendation framework is proposed for this application. Experiments demonstrate the effectiveness of the proposed approaches in both studies.
Short Bio: Xiao Yu is a Ph.D candidate in the Department of Computer Science, at University of Illinois at Urbana-Champaign. He is advised by Prof. Jiawei Han. Xiao is broadly interested in data mining, information retrieval and machine learning with a focus on entity search and recommendation in information networks, cyber-physical network analysis and large-scale data mining algorithms and applications. Xiao has over 20 publications in major data mining and information retrieval journals and conferences, such as KDD, WSDM, SDM and ICDE.
Link video
Wednesday, Feb. 5, 2014
Place: SC3405
Title Towards Large Scale Open Domain Natural Language Processing
Speaker: Gourab Kundu
Abstract: Machine Learning and Inference methods are becoming ubiquitous ñ a broad range of scientific advances and technologies rely on machine learning techniques. In particular, the big data revolution heavily depends on our ability to use statistical machine learning methods to make sense of the large amounts of data we have.
Research in Natural Language Processing has both benefited and contributed to the advancement of machine learning and inference methods. However multiple problems still hinder the broad application of some of these methods. Domain adaptation is one of the key problems hindering widespread deployment of natural language processing
tools. In this talk, I will present techniques for domain adaptation “on the fly”, that allows
adaptation to test domains using the same model from training domain, thus saving time and making possible
the adaptation of complex pipeleine systems as black box. For this, we formulate the prediction
problem as an integer program where task / domain specific knowledge is incorporated as constraints.
Formulating prediction problem as an integer program is currently widespread in NLP, from semantic role
labeling, sentiment analysis, dependency parsing etc. The later part of the talk will focus on improving the
scalability of all these tools with complex prediction stage to meet the challenges of big data.
I will show how we can amortize the cost of prediction over the lifetime of any NLP tool if the prediction problem can be represented as an integer linear program. I will present exact and approximate theorems for reusing solutions of integer programs from the past to speed up the solution time of future integer programs.
Short Bio: Gourab Kundu is a doctoral candidate in Computer Science Department of University of Illinois at Urbana-Champaign.
He is supervised by Professor Dan Roth. He has also worked in IBM research and Google for summer internships.
He has worked on a range of NLP problems like semantic role labeling, named entity recogntion, entity relation extraction etc. He is broadly interested in transfer learning and large scale inference. He has published in top tier NLP conferences along with a best student paper in CoNLL 2011.
Link video
Wednesday, Feb. 12, 2014
Place: SC3405
Title Big Network Analytics: Online and Active learning Approaches
Speaker: Quanquan Gu
Abstract: We are living in the Internet Age, in which information entities and objects are interconnected, thereby forming gigantic information networks. Examples of real-world information networks include social networks, bibliographic networks, gene regulation and protein interaction networks, knowledge graph, and the World Wide Web. It is critical to quickly process and understand these networks in order to enable data-driven applications. However, there are two main challenges for analyzing big networks. First, modern networks grow and involve over time, we require learning algorithms which are able to work on the fly and are adaptive to the variation of the networks. Second, the labels of the nodes or edges in big networks are scarce, it is urgent to optimize the process by which the labels are collected. In this talk, to address the above challenges, I will present several online and active learning algorithms for big network analytics, which are both statistically and computationally efficient, and with provable guarantee on their performance. Empirical studies on real-world networked data validate the effectiveness of the proposed algorithms.
Short Bio: Quanquan Gu is a Ph.D. candidate in Department of Computer Science, University of Illinois at Urbana-Champaign, supervised by Prof. Jiawei Han. He received his MS and BS degrees in Tsinghua University, China. He is the recipient of IBM PhD Fellowship for 2013-2014. His main research interests include theory and algorithms for data mining and machine learning, with focus on networked data.
Link video
Wednesday, Feb. 19, 2014
Place: SC3405
Title Distributed Optimization over Graphs
Speaker: Angelia Nedich
Abstract: Recent advances in wired and wireless technology necessitate the development of theory, models and tools to cope with new challenges posed by large-scale networks and various problems arising in current and anticipated applications over such networks. In this talk, optimization problems and algorithms for distributed multi-agent networked systems will be discussed. The distributed nature of the problem is reflected in agents having their own local (private) information while they have a common goal to optimize the sum of their objectives through some limited information exchange. The inherent lack of a central coordinator is compensated through the use of network to communicate certain estimates and the use of appropriate local-aggregation schemes. The overall approach allows agents to achieve the desired optimization goal without sharing the explicit form of their locally known objective functions. However, the agents are willing to cooperate with each other locally to solve the problem by exchanging some estimates of relevant information. Distributed algorithms will be discussed for synchronous and asynchronous implementations together with their basic convergence properties. A special attention will be devoted to directed graphs.
Short Bio: Angelia Nedich received her B.S. degree from the University of Montenegro (1987) and M.S. degree from the University of Belgrade (1990), both in Mathematics. She received her Ph.D. degrees from Moscow State University (1994) in Mathematics and Mathematical Physics, and from Massachusetts Institute of Technology in Electrical Engineering and Computer Science (2002). She has been at the BAE Systems Advanced Information Technology from 2002-2006. In Fall 2006, as Assistant Professor, she has joined the Department of industrial and Enterprise Systems Engineering at the University of Illinois at Urbana-Champaign, USA. Her general interest is in optimization including fundamental theory, models, algorithms, and applications. Her current research interest is focused on large scale convex optimization, distributed multi-agent optimization, and duality theory with applications in decentralized optimization. She received an NSF Faculty Early Career Development (CAREER) Award in 2008 in Operations Research.
Link video
Monday, Feb. 24, 2014
Place: SC3405
Title Toward Multi-level Query Understanding – From Query Lexicon to Query Semantics
Speaker: Yanen Li
Abstract: Search technologies have significantly transformed the way people seek information and acquire knowledge from the internet. To further improve the search accuracy and usability of the current-generation search engines, one of the most important research challenges is to understand a user’s intent or information need underlying the query. However, understanding a query in the form of plain text is a non-trivial task. In this talk I will first introduce a framework in which a query is interpreted and represented in multiple levels. Then I will briefly overview our efforts on addressing key research questions from query lexicon, query syntactic, to query semantic understanding. In the rest of the talk I will present our recent work on query auto-completion in which we aim at predicting query representation given only a short prefix.
Short Bio: Yanen Li is a 5rd year Ph.D student in the Department of Computer Science at University of Illinois at Urbana-Champaign; his Ph.D advisor is Prof. ChengXiang Zhai. His research interests include information retrieval, data mining and medical informatics, with special focus on systematic query understanding in web search by mining query logs. He is a winner of the Microsoft Speller Challenge 2011. Before entering UIUC, he obtained the Bachelor and Master Degree both at the Department of Computer Science at Huazhong University of Science and Technology, China.
Link video
Wednesday, Mar. 5, 2014
Place: SC3405
Title MedSafe: Measurement-driven Accident Analysis for Safety-critical Medical Devices
Speaker: Homa Alemzadeh
Abstract: Medical device incidents are one of the major causes of serious injury and death in the United States. In 2011, about 1,190 recalls, 92,600 patient injuries, and 4,590 deaths were reported to the US Food and Drug Administration (FDA). The FDA recalls and adverse event reports provide valuable insights on the past failures and safety issues of medical devices and how the designs could be improved to prevent catastrophic patient impacts in the future. However, those reports are mainly composed of unstructured natural language text written by the manufacturers and volunteer reporters and are often difficult to analyze without considering domain-specific semantics and contextual factors. We present MedSafe, a framework for automated analysis of medical device reports to identify the causes of device failures and their impact on patients. We propose an ontology model based on the control-system structures that involve humans in the loop, to formalize the semantic interpretation of the reports and facilitate causal analysis of accidents. We demonstrate the effectiveness of MedSafe by showing sample results on analysis of 18,200 recall records reported for various types of medical devices during 2006-2013, and about 5,400 adverse events reported for robotic surgical systems, over the 13-year period of 2000-2012.
Short Bio: Homa Alemzadeh is a PhD candidate in electrical and computer engineering and a graduate research assistant at Coordinated Science Laboratory at UIUC. She received her BSc and MSc degrees in computer engineering from the University of Tehran, Iran. Her research interests include measurement-based dependability evaluation and accident analysis, hardware-based techniques for improving safety and reliability, and design of medical monitoring systems.
Link video
Wednesday, Mar. 12, 2014
Place: SC3405
Title Lost in Publications? Let Text Mining Help!
Speaker: Zhiyong Lu
Abstract: The explosion of biomedical information in the past decade or so has created new opportunities for discoveries to improve the treatment and prevention of human diseases. But the large body of knowledge mostly captured as free text in journal articles and the interdisciplinary nature of biomedical research also presents a grand new challenge: how can scientists and health care professionals find and assimilate all the publications relevant to their research and practice? In this regard, in the first part of the talk, I will present our research on text mining and its application for improved information access for the worldwide scientific community Real-world use cases of text mining research in PubMed will be demonstrated. Next, I will present our effort on computer-assisted literature curation, with a focus on our recent experience in BioCreative, a community-based worldwide challenge event in biomedical text mining.
Short Bio: Dr. Lu is a Stadtman investigator at the National Institutes of Health, where he joined immediately after earning a PhD in Bioinformatics at the University of Colorado School of Medicine. His research group is developing computational methods for analyzing and making sense of natural language data in biomedical literature and clinical text. Several of his recent research has been successfully integrated into and widely used in PubMed and other NCBI databases. Dr. Lu is an Associate Editor for BMC Bioinformatics and serves on the editorial board for the Journal Database. He is also involved in the organization of several international scientific meetings such as the BioCreative challenge series, PSB sessions on computational drug repurposing, and IEEE conference on health informatics.
Link video
Wednesday, Mar. 19, 2014
Place: SC3405
Title Similarity Query Processing Techniques for Text Data
Speaker: Younghoon Kim
Abstract: With the widespread use of the internet, text-based data sources have become ubiquitous and the demand for effective support of similarity matching queries in text data continues to increase. While the applications for text similarity queries are diverse, similarity queries are essential and useful in many applications. In this talk, I will first introduce the optimal and approximate exact substring matching algorithms to find the best query plan utilizing inverted variable-length gram indexes. Then, I will present efficient algorithms for top-k approximate substring matching utilizing our novel lower bounds for substring edit distance. Furthermore, I want to briefly introduce an parallel algorithm developed for top-k approximate string joins.
Short Bio: He is a postdoctoral researcher at the Department of Computer Science of UIUC hosted by Professor Jiawei Han. He received a Ph.D under the supervision of Professor Kyuseok Shim from Seoul National University in 2013 and a B.S. degree in Computer Science from Seoul National University in 2006. He has been working in the area of substring query processing in database and text mining using probabilistic modeling in social network services.
Link video
Friday, Apr. 4, 2014
Place: SC0216
Title Big Trajectory Data: from fundamentals to performance
Speaker: Xiaofang Zhou
Abstract: Spatial trajectory data record movement history of objects in the geographical space. They can be used to find behaviours and patterns and make predications for individual objects as well as a group of objects. Spatiotemporal data management and query processing have been an active research topic over the last three decades, spanning a wide range of areas including databases, geographical information systems and data mining. With more and more trajectory data available and an increasing amount of interest from business communities, we now need to revisit trajectory database research from some basic questions such as trajectory data representation in databases,trajectory similarity measures, to more advanced questions such as how we can take advantages of modern hardware platforms to support TB level trajectory data processing. In this talk we will share our thoughts on these issues, and discuss some recent work at the University of Queensland.
Short Bio: Xiaofang Zhou is a Professor of Computer Science at the University of Queensland. He received his BSc and MSc degrees in Computer Science from Nanjing University, China, and PhD in Computer Science from the University of Queensland. Before joining UQ in 1999, he worked as a researcher in Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Australia, leading its Spatial Information Systems group. He has been working in the area of spatial and multimedia databases, data quality, high performance query processing, Web information systems and bioinformatics, co-authored over 250 research papers with many published in top journals and conferences such as SIGMOD, VLDB , ICDE, ACM Multimedia, The VLDB Journal, ACM Transactions and IEEE Transactions. He was a Program Committee Co-chair the 29th International Conference on Data Engineering (ICDE 2013), and a General Co-chair of ACM Multimedia conference in 2015. He has been on the program committees of numerous international conferences, including SIGMOD, VLDB, ICDE, WWW and ACM Multimedia. Currently he is an Associate Editor of The VLDB Journal, IEEE Transactions on Cloud Computing, World Wide Web Journal, and Distributed and Parallel Databases. He is a current member of IEEE Technical Committee on Data Engineering (TCDE) Executive Committee, IEEE TCDE Award Committee, and the Steering Committees of DASFAA, WISE, APWeb and Australasian Database Conferences. In the past he was an Associate Editor of IEEE Transactions on Knowledge and Data Engineering (2009-2013) and Information Processing Letters. Xiaofang is a specially appointed Adjunct Professor under the Chinese National Qianren Scheme hosted by Renmin University of China (2010-2013), and by Soochow University since July 2013 where he leads the Research Center on Advanced Data Analytics (ADA).
Link video