top of page

CSC2525

Research Topics in Database Management

Instructor: Niv Dayan

Lectures: Wednesday 14:00-16:00 (HS 618)

Office Hours: after each class

Abstract Bridge
image.png

Bigger, Faster, and Stronger Systems

This is a research seminar course on database systems and their inner data structures. We will read exciting papers to aquaint ourselves with the state-of-the-art constructions to understand where the field is and where it's going. You will also pursue a fun research project of your choice, in which you will impleemnt and evaluate some of the methods we learn about. 

Reading & Presenting Research Papers

A core part of this course is reading and digesting research papers from the database research community. Throughout the course, we will cover approximately 20 research papers. Each student will present two papers to the class along with a classmate. Each presentation will be followed by a Q&A session with the class. 

Research Project

Throughout the course, you will propose and pursue a research project that involves implementing and benchmarking some technqiue or method that we covered in class. Project proposals will be due approximately half-way through the course. These projects will be done in pairs. A project report will be due in the end of the course. 

Office Hours

There will be office hours immediately after each class. In addition, students are encouraged to book office hours with the instructor, especially before class presentations to ensure high quality. 

Student Participation

As this is a course with a small class, students are required to address classes in person. Students are strongly encouraged to particpate in the Q&A session after each paper presentation. 

Class Structure

Each class will begin with a 30-40 minute lecture by the instructor covering core techniques. The rest of each class will consist of two paper presentations by your peers. 

Oral Exam

At the end of the course, the instructor will hold an oral exam to test students about the course content and project. 

Grade Components 

The final grade for each student will derived based on their presentations, project, and class participation. The precise grade breakdown will be announced in the first few weeks of the course. 

Prerequisites

Students should have taken the courses listed in here, or have equivalent knowledge in algorithms, data structures, relational algebra, SQL, and operating systems. Hands-on experience with C/C++, Java or Python is also required. Ideally, students should also have taken a course similar to CSC2525H/CSC443H

Academic Integrity

The project hand-ins must be the group’s own work. It is an academic violation to copy code or experimental results from other groups, whether you copy yourself or let someone else copy. That said, we encourage you to discuss course material widely with your fellow students within and across groups. 

Piazza

We will be using Piazza as our main discussion board. You are responsible for reading all postings made by me or the TAs. Please use Piazza to ask questions about assignments and course lecture materials so that everyone can benefit.

Contact

Course announcements will arrive through Piazza. Aside to that, this course website is required reading. It contains essential material and will be updated throughout the semester. Please use Piazza whenever possible to ask questions about course material. For personal questions, email me to nivdayan@gmail.com. Please include "CSC2525" in the subject line along with your full name. If you do not hear back quickly, we are always available during office hours to help.

Accessibility

The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.

Course schedule

  • Week 1: Introduction to the Course
    This week we'll introduce the course and what's to come. Please read Chapter 1 in the textbook. The lecture slides are here. The lecture video is here. Passcode: Y3%pm*76fD
  • Week 2: Storage Devices
    This week will take a deep look at the memory hierarchy. As we will see throughout the course, the properties of the memory hierarchy give rise to how modern databases are architected. For information on storage, read Chapter 8 Part 8.1, and Chapter 9 Parts 9.1 and 9.2. As you will notice, the textbook was written before SSDs became popular. I have written some background on SSDs here. Please study this carefully. You may also skim the following article. The slides on storage are here. The slides on RAID are here.
  • Week 3: Table and Buffer Management
    A table in a relational database contains data, but how do we store tables in storage devices such that we can scan them quickly and also insert, update and delete elements efficiently? This week we’ll learn how to efficiently store tables. We’ll also learn how to bring this data in and out of main memory efficiently for processing using buffer management techniques. In the textbook, read all of Chapter 9. The slides on table management are here. The slides on buffer management are here. The lecture video is here. Passcode: 7N0DzjO6@x
  • Week 4: Indexing with Hash Tables and B-Trees
    Some queries only need to access a small amount of data in a given table. To avoid having to scan the entire table, indexes are used to navigate to a given data entry quickly. This week, we'll learn about the numerous indexing options of database systems. There is a lot of textbook reading for this week, so start early. Please read all of Chapter 8 as an introduction to indexing. Chapter 10 will provide an in-depth look at B-trees (you can pay less attention to Part 10.2 as this technology is largely obsolete, though it's still worth reading about to understand how technology evolves). Please also read Chapter 11 Parts 11.1 and 11.2. You are encouraged to read the rest of Chapter 11 as well, but we won't cover it in the course. The lecture slides are here. The lecture video is here. Passcode: *$f0Za5Ukq
  • Week 5: Write-Optimized Indexing
    Traditional indexing techniques are optimized for reading but slow at ingesting new data. This week, we'll examine other types of database indexes used to optimize for data ingestion. Write-optimized indexes became popular over the past 15 years, so they do not appear in your textbook. Please read sections 1, 2, and 3 in the following: https://nivdayan.github.io/monkeykeyvaluestore.pdf The lecture slides are here. The tutorial slides are here. The lecture video is here. Passcode: d%Wd7dH$1k I'm not certain why but the tutorial recording failed. I am therefore adding the tutorial video from last year. It's the same questions and answers, but an older version of the slides. The video is here. Passcode: 710Q7B^qUQ
  • Week 6: Research Lecture on Scaling LSM-Trees
    In recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive. In this talk, we will show how to properly co-design LSM-tree with filters to achieve substantially superior performance properties. Required reading: Sections 1-4 in https://nivdayan.github.io/monkeykeyvaluestore.pdf and Sections 1-4 in: https://nivdayan.github.io/dostoevsky.pdf If you find this material to be interesting, you are also welcome to read the following papers, though we won't cover them in the course. https://nivdayan.github.io/LSM-bush.pdf https://nivdayan.github.io/chucky.pdf https://nivdayan.github.io/spooky.pdf The slides are here. The video lecture is here. Passcode: c345XV5$?$ The tutorial slides are here. The tutorial video is here. Passcode: n5!fwAk6!i
  • Week 7: External Sorting
    Sorting is a basic operation in database systems to construct indices or return sorted output for queries. You've probably encountered sorting algorithms like merge-sort and quick-sort in the past. However, it turns out these algorithms are no longer the best choice when most of our data reside in storage. This week, we'll learn about multi-way sort-merge, a sorting algorithm for vast amounts of data. Read Chapter 13 in the textbook. The lecture slides are here. The lecture video is here. Passcode: gQ3&2%^Gwv The tutorial slides are here. The tutorial video is here. Passcode: aU#w1kuDab
  • Week 8: Midterm
    The midterm will span two hours. It will cover all the material covered so far in the course. Be prepared and ask for help if you need it! See the Quercus front page for the tutorial video where we go through the exam questions and answers.
  • Week 9: Circular Logs & Cuckoo Filters
    This week, we'll take a look at an emerging KV-stores paradigm that involves logging data in storage and indexing it from memory. We’ll study how to implement this index efficiently using Cuckoo hashing and Cuckoo filters. We’ll also examine the importance of separating hot and cold data to lower garbage-collection overheads, and we’ll introduce the count-min sketch to help us achieve this. We will also discuss how to recover the index when power fails. First, have a look at Bitcask, a very simple and easy to understand KV-store that indexes the full keys of data entries in storage Then, have a look at Cuckoo hashing, a hash table design that’s able to achieve very good hash table utilization and CPU efficiency at the same time. Then, read up on Cuckoo filters, which store fingerprints rather than full keys within a cuckoo hash table to further reduce memory requirements. The wikipedia articles on these topics are also recommended. Finally, have a look at FlashStore, a more memory-efficient KV-strore design than bitcask that stores fingerprints rather than full keys in memory and employs a Cuckoo filter variant to index them (since Cuckoo filters had still not been invented at the time). Two KV-stores out in the wild that employ this paradigm are FASTER and ForestDB. Feel free to browse them. The notion of hot/cold data identification and separation has been explored more in the context of SSDs and flash translation layers. The following paper, for instance, employs a similar construction to count-min. The slides are here. The lecture video is here. Passcode: WmY&p5fA*^
  • Week 10: Reading Week - Rest & Study
    A good time to finish up the project :)
  • Week 11: Relational Operators & Query Optimization
    This week, we'll take a deeper look at how queries in a relational database are executed and planned. Please read Chapters 12, 14 and 15 in the textbook. The slides are here. The lecture video is here. Passcode: 1Ln8W=h7W0 The tutorial slides are here. The tutorial video is here. Passcode: H!dO*jm5Gf
  • Week 12: Column-Stores
    While traditional databases store data in each table row by row, a newer breed of databases is storing the data column by column to optimize more for analytical queries and statistical calculations. This week, we'll learn about column-stores, the cornerstone of modern data warehousing. As Column-Stores rose to prominence over the past 15 years, they do not appear in your textbook. Therefore, please have a look at "The Design and Implementation of Modern Column-Oriented Database Systems" by Abadi et al: https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf Read Parts 1, 2, 3, 4.1, 4.2, 4.3, 4.4, 4.7, 4.8, and 5. The lecture slides are here. The lecture video is here. Passcode: P!Pf&5xa67 The tutorial slides are here. The tutorial lecture is here. Passcode: hJcUM&*426
  • Week 13: Transaction Management & Concurrency Control
    We'll introduce Transaction Management and Concurrency Control this week, the part of the database that ensures different operations can execute in parallel yet correctly, even if the system might fail at any moment. Please read Chapters 16 and 17 in the textbook. The lecture slides are here. The lecture video is here. Passcode: d6!*YPF%j8 The tutorial slides are here. The tutorial video is here. Passcode: f=9d!EjEq& Note that the tutorial video is from last semester as I forgot to record it this time. However, it's the same questions.
  • Week 14: Recovery
    This week, we'll look at how to recover the database into a consistent state if, say, power suddenly fails. Please read Chapter 18 in the textbook. The lecture slides are here. The tutorial slides are here. The lecture and tutorial video is here. Passcode: ^P1Qu#UT+x

We hope this course will get you excited about research. For students who excel in this course and seek research opportunites, check out my home page and get in touch. 

Contact
bottom of page