Fall 2019 Reading List for the DAIS Qualifying Exam

Data Mining and Data Warehousing

  • Basic Concepts
    • Data warehousing: star schema, data cube (be able to list half a dozen typical data cube computation methods), multi-dimensional analysis (OLAP)
    •  Data mining: frequent pattern mining (be able to list half a dozen typical methods), sequential pattern mining (be able to list at four or five typical methods), correlation analysis, classification (be able to list half a dozen typical methods), clustering (be able to  list half a dozen typical methods)
  • Background
    • J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 2nd edition. Chapters 2-10, Morgan Kaufmann 2011.
  • More advanced topics
    • J. Yang, J. McAuley, and J. Leskovec. Community detection in networks with node attributes. In Data Mining (ICDM), 2013 IEEE 13th international conference on, pages 1151–1156. IEEE, 2013.
    • Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss, Jiawei Han, “Automated Phrase Mining from Massive Text Corpora”, IEEE Transactions on Knowledge and Data Engineering, 30(10):1825-1837 (2018)
    • Alon Halevy, Natalya Noy, Sunita Sarawagi, Steven Euijong Whang, Xiao Yu, “Discovering Structure in the Universe of Attribute Names”  Proc. 25th Int. Conf. on World Wide Web (WWW) (2016)   paper link
    • Manish Gupta, Jing Gao, Charu C. Aggarwal, Jiawei Han: Outlier Detection for Temporal Data: A Survey. IEEE Trans. Knowl. Data Eng. 26(9): 2250-2267 (2014)
    • Leman Akoglu, Hanghang Tong, Danai Koutra: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3): 626-688 (2015)

Database Management Systems

  • Basic concepts
    • Hardware: disk sector, track, block, seek, latency, how to lay out a database page
    • Data modeling: ER, OO, and Object-Relational approaches
    • Concurrency control and recovery: ACID, serializability, two-phase locking, two-phase commit, logging and recovery, the impact of data replication
    • Theory: normalization, dependencies
    • Queries: access methods (hashing, B-trees, multidimensional access methods), how to optimize a query, SQL
    • Benchmarks: TPC-C and TPC-H, OLAP, Data Cubes
  • Background
    • You can use any database textbook you like to study the most basic of the concepts listed above; for example, CS411 teaches these concepts. (Note that you will be expected to be able to demonstrate your understanding of the concepts by applying them (as opposed to simply being able to define them).) A good reference for the background content is the “Database Systems” textbook by Garcia-Molina, Widom, and Ullman. In addition, we would like you to study the following papers:
      • Generalized Search Trees for Database Systems. Hellerstein et al. VLDB 1995.
      • Implementing Data Cubes Efficiently. Harinarayan et al. SIGMOD 1996
      • Architecture of a DBMS. Stonebraker et al. Foundations and Trends in Databases, 2007
  • More advanced topics
    Please note that databases are a very broad field. The papers listed here will be changed frequently, to reflect this breadth

      • E. F. Codd: A Relational Model of Data for Large Shared Data Banks. CACM 13(6): 377-387 (1970). PDF
      • Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57. PDF
      • Longbin La et. al.: Scalable Subgraph Enumeration in MapReduce. PVLDB 8(10): 974-985 (2015). PDF
      • Olston et. al: Pig latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110. PDF

Information Retrieval

  • Basic concepts
    • Vector-space retrieval model, TF-IDF weighting, relevance/pseudo feedback, query expansion, mean average precision (MAP), normalized discounted cumulative gain (NDCG), query-likelihood retrieval model, language model smoothing, PageRank, inverted index, probabilistic topic model (i.e., Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation), learning to rank.
  • Background
  • More advanced topics
    • Huazheng Wang, Sonwoo Kim, Eric McCord-Snook, Qingyun Wu, and Hongning Wang. 2019. Variance Reduction in Gradient Exploration for Online Learning to Rank. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 835-844. URL: http://www.library.illinois.edu/proxy/go.php?url=https://doi.org/10.1145/3331184.3331264
    •   Tetsuya Sakai and Zhaohao Zeng. 2019. Which Diversity Evaluation Measures Are “Good”?. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 595-604. URL: http://www.library.illinois.edu/proxy/go.php?url=https://doi.org/10.1145/3331184.3331215.
    • Rolf Jagerman, Harrie Oosterhuis, and Maarten de Rijke. 2019. To Model or to Intervene: A Comparison of Counterfactual and Online Learning to Rank from User Interactions. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 15-24. DOI: https://doi-org.proxy2.library.illinois.edu/10.1145/3331184.3331269
    • Krisztian Balog, Filip Radlinski, and Shushan Arakelyan. 2019. Transparent, Scrutable and Explainable User Models for Personalized Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). ACM, New York, NY, USA, 265-274. DOI: https://doi-org.proxy2.library.illinois.edu/10.1145/3331184.3331211