Spring 2020 DAIS PhD Qualification Exam Reading List

Data Mining

Basic Concepts
- Data mining: frequent pattern mining (be able to list half a dozen typical methods), sequential pattern mining (be able to list at four or five typical methods), correlation analysis, classification (be able to list half a dozen typical methods), clustering (be able to list half a dozen typical methods)
- Basic machine learning and deep learning concepts
- Data warehousing: star schema, data cube (be able to list half a dozen typical data cube computation methods), multi-dimensional analysis (OLAP)
Background
- J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 2nd edition. Chapters 2-10, Morgan Kaufmann 2011.
More advanced topics
- Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, “Automated Phrase Mining from Massive Text Corpora”, IEEE Transactions on Knowledge and Data Engineering, 30(10):1825-1837 (2018)
- Fangbo Tao, Chao Zhang, Xiusi Chen, Meng Jiang, Tim Hanratty, Lance Kaplan, and Jiawei Han, “Doc2Cube: Allocating Documents to Text Cube without Labeled Data“, in Proc of 2018 IEEE Int. Conf. on Data Mining (ICDM’18), Singapore, Nov. 2018
- Leman Akoglu, Hanghang Tong, Danai Koutra: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3): 626-688 (2015)
- Chen Chen, Ruiyue Peng, Lei Ying, Hanghang Tong: Network Connectivity Optimization: Fundamental Limits and Effective Algorithms. KDD 2018: 1167-1176
- Harshay Shah, Suhansanu Kumar, and Hari Sundaram, Growing attributed networks through local processes, The World Wide Web Conference – WWW ’19, ACM Press, May 2019, pp. 3208–3214. https://doi.org/10.1145/3308558.3313640

Database Management Systems

Basic concepts
- Hardware: disk sector, track, block, seek, latency, how to lay out a database page
- Data modeling: ER, OO, and Object-Relational approaches
- Concurrency control and recovery: ACID, serializability, two-phase locking, two-phase commit, logging and recovery, the impact of data replication
- Theory: normalization, dependencies
- Queries: access methods (hashing, B-trees, multidimensional access methods), how to optimize a query, SQL
- Benchmarks: TPC-C and TPC-H, OLAP, Data Cubes
Background
- You can use any database textbook you like to study the most basic of the concepts listed above; for example, CS411 teaches these concepts. (Note that you will be expected to be able to demonstrate your understanding of the concepts by applying them (as opposed to simply being able to define them).) A good reference for the background content is the “Database Systems” textbook by Garcia-Molina, Widom, and Ullman. In addition, we would like you to study the following papers:
  - E. F. Codd: A Relational Model of Data for Large Shared Data Banks. CACM 13(6): 377-387. 1970
  - Architecture of a DBMS. Stonebraker et al. Foundations and Trends in Databases, 2007
More advanced topics
Please note that databases are a very broad field. The papers listed here will be changed frequently, to reflect this breadth
- Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57. PDF
- Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. “Open Language Learning for Information Extraction.” In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 523–534. EMNLP-CoNLL ’12
- M. Zhou, H. Wang, and K. C.-C. Chang: Learning to Rank from Distant Supervision: Exploiting Noisy Redundancy for Relational Entity Search. In ICDE 2013, pages 829-840, 2013
- Wanyun Cui, Yanghua Xiao, Haixun Wang, Yangqiu Song, Seung-won Hwang, Wei Wang: KBQA: Learning Question Answering over QA Corpora and Knowledge Bases. PVLDB 10(5): 565-576, 2017

Information Retrieval

Basic concepts
- Vector-space retrieval model, TF-IDF weighting, relevance/pseudo feedback, query expansion, mean average precision (MAP), normalized discounted cumulative gain (NDCG), query-likelihood retrieval model, language model smoothing, PageRank, inverted index, probabilistic topic model (i.e., Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation), learning to rank.
Background
- Mark Sanderson, Bruce Croft, The History of Information Retrieval Research, available at http://ciir-publications.cs.umass.edu/getpdf.php?id=1066.
- ChengXiang Zhai, Sean Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM and Morgan & Claypool Publishers, 2016. PDF available at http://dl.acm.org.proxy2.library.illinois.edu/citation.cfm?id=2915031&CFID=829214366&CFTOKEN=31045423 (Chapter 1, Chapter 3, Chapters 5-10, and Chapter 17 are most important)
- Hang Li, Learning to Rank for Information Retrieval and Natural Language Processing, Morgan & Claypool Publishers, second edition, 2014. https://www-morganclaypool-com.proxy2.library.illinois.edu/doi/abs/10.2200/S00607ED2V01Y201410HLT026?journalCode=hlt
- ChengXiang Zhai, Statistical Language Models for Information Retrieval, Morgan and Claypool Publishers, 2008. (Chapter 1 Introduction, Chapter 2 Overview of IR Models, and Chapter 3 Simple Query Likelihood Retrieval Model). Online version
More advanced topics
- Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, and Hongning Wang. 2019. Variance Reduction in Gradient Exploration for Online Learning to Rank. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 835-844. URL: http://www.library.illinois.edu/proxy/go.php?url=https://doi.org/10.1145/3331184.3331264
- Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures Are “Good”?. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 595-604. URL: http://www.library.illinois.edu/proxy/go.php?url=https://doi.org/10.1145/3331184.3331215.
- Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 15-24. DOI: https://doi-org.proxy2.library.illinois.edu/10.1145/3331184.3331269
- Krisztian Balog, Filip Radlinski, and Shushan Arakelyan. 2019. Transparent, Scrutable and Explainable User Models for Personalized Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 265-274. DOI: https://doi-org.proxy2.library.illinois.edu/10.1145/3331184.3331211