Spring 2018 Qualifying Exam Reading List

Data Mining and Data Warehousing

  • Basic Concepts
    • Data warehousing: star schema, data cube (be able to list half a dozen typical data cube computation methods), multi-dimensional analysis (OLAP)
    •  Data mining: frequent pattern mining (be able to list half a dozen typical methods), sequential pattern mining (be able to list at four or five typical methods), correlation analysis, classification (be able to list half a dozen typical methods), clustering (be able to  list half a dozen typical methods)
  • Background
    • J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 2nd edition. Chapters 2-10, Morgan Kaufmann 2011.
  • More advanced topics

Database Management Systems

  • Basic concepts
    • Hardware: disk sector, track, block, seek, latency, how to lay out a database page
    • Data modeling: ER, OO, and Object-Relational approaches
    • Concurrency control and recovery: ACID, serializability, two-phase locking, two-phase commit, logging and recovery, the impact of data replication
    • Theory: normalization, dependencies
    • Queries: access methods (hashing, B-trees, multidimensional access methods), how to optimize a query, SQL
    • Benchmarks: TPC-C and TPC-H, OLAP, Data Cubes
  • Background
    • You can use any database textbook you like to study the most basic of the concepts listed above; for example, CS411 teaches these concepts. (Note that you will be expected to be able to demonstrate your understanding of the concepts by applying them (as opposed to simply being able to define them).) A good reference for the background content is the “Database Systems” textbook by Garcia-Molina, Widom, and Ullman. In addition, we would like you to study the following papers:
      • Generalized Search Trees for Database Systems. Hellerstein et al. VLDB 1995.
      • Implementing Data Cubes Efficiently. Harinarayan et al. SIGMOD 1996
      • Architecture of a DBMS. Stonebraker et al. Foundations and Trends in Databases, 2007
  • More advanced topics
    Please note that databases are a very broad field. The papers listed here will be changed frequently, to reflect this breadth. Note that while the following papers listed will guide many of the questions, a basic understanding of databases (CS411/511; or equivalently, the Complete Book by Garcia-Molina, Widom, and Ullman) will be necessary to both make sense of these papers and contextualize them with respect to prior work. 

      • Pavlo et al. A Comparison of Approaches to Large-Scale Data Analysis. SIGMOD 2009. PDF
      • Stonebraker et al. C-Store: A Column-Oriented DBMS. VLDB 2005. PDF 
      • Agarwal et al. BlinkDB: Queries with Bounded Error and Bounded Response Times on Very Large Data. EuroSys 2013. PDF
      • Malewicz et al. Pregel: A System for Large-Scale Graph Processing. SIGMOD 2010 PDF

Information Retrieval

  • Basic concepts
    • Vector-space retrieval model, TF-IDF weighting, relevance/pseudo feedback, query expansion, mean average precision (MAP), normalized discounted cumulative gain (NDCG), query-likelihood retrieval model, language model smoothing, PageRank, inverted index, probabilistic topic model (i.e., Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation).
  • Background
  • More advanced topics
    • Mostafa Dehghani, Hamed Zamani, Aliaksei Severyn, Jaap Kamps, and W. Bruce Croft. 2017. Neural Ranking Models with Weak Supervision. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). ACM, New York, NY, USA, 65-74. DOI: https://doi.org/10.1145/3077136.3080832
    • Fan Zhang, Yiqun Liu, Xin Li, Min Zhang, Yinghui Xu, and Shaoping Ma. 2017. Evaluating Web Search with a Bejeweled Player Model. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17). ACM, New York, NY, USA, 425-434. DOI: https://doi.org/10.1145/3077136.3080841