Spring 2017 Reading List for the DAIS Qualifying Exam

Data Mining and Data Warehousing

  • Basic Concepts
    • Data warehousing: star schema, data cube (be able to list half a dozen typical data cube computation methods), multi-dimensional analysis (OLAP)
    •  Data mining: frequent pattern mining (be able to list half a dozen typical methods), sequential pattern mining (be able to list at four or five typical methods), correlation analysis, classification (be able to list half a dozen typical methods), clustering (be able to  list half a dozen typical methods)
  • Background
    • J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 2nd edition. Chapters 2-10, Morgan Kaufmann 2011.
  • More advanced topics

Database Management Systems

  • Basic concepts
    • Hardware: disk sector, track, block, seek, latency, how to lay out a database page
    • Data modeling: ER, OO, and Object-Relational approaches
    • Concurrency control and recovery: ACID, serializability, two-phase locking, two-phase commit, logging and recovery, the impact of data replication
    • Theory: normalization, dependencies
    • Queries: access methods (hashing, B-trees, multidimensional access methods), how to optimize a query, SQL
    • Benchmarks: TPC-C and TPC-H, OLAP, Data Cubes
  • Background
    • You can use any database textbook you like to study the most basic of the concepts listed above; for example, CS411 teaches these concepts. (Note that you will be expected to be able to demonstrate your understanding of the concepts by applying them (as opposed to simply being able to define them).) A good reference for the background content is the “Database Systems” textbook by Garcia-Molina, Widom, and Ullman. In addition, we would like you to study the following papers:
      • Generalized Search Trees for Database Systems. Hellerstein et al. VLDB 1995.
      • Implementing Data Cubes Efficiently. Harinarayan et al. SIGMOD 1996
      • Architecture of a DBMS. Stonebraker et al. Foundations and Trends in Databases, 2007
  • More advanced topics
    Please note that databases are a very broad field. The papers listed here will be changed frequently, to reflect this breadth

      • E. F. Codd: A Relational Model of Data for Large Shared Data Banks. CACM 13(6): 377-387 (1970). PDF
      • Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57. PDF
      • Longbin La et. al.: Scalable Subgraph Enumeration in MapReduce. PVLDB 8(10): 974-985 (2015). PDF
      • Olston et. al: Pig latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110. PDF

Information Retrieval

  • Basic concepts
    • Vector-space retrieval model, TF-IDF weighting, relevance/pseudo feedback, query expansion, mean average precision (MAP), normalized discounted cumulative gain (NDCG), query-likelihood retrieval model, language model smoothing, PageRank, inverted index, probabilistic topic model (i.e., Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation).
  • Background
  • More advanced topics