Niv Dayan
Assistant Professor in Computer Science
University of Toronto
CSC2525
Research Topics in Database Management
Instructor: Niv Dayan
Lectures: Wednesday 14:00-16:00 (HS 618)
Office Hours: after each class
Bigger, Faster, and Stronger Systems
This is a research seminar course on database systems and their inner data structures. We will read exciting papers to aquaint ourselves with the state-of-the-art constructions to understand where the field is and where it's going. You will also pursue a fun research project of your choice, in which you will impleemnt and evaluate some of the methods we learn about.
Reading & Presenting Research Papers
A core part of this course is reading and digesting research papers from the database research community. Throughout the course, we will cover approximately 20 research papers. Each student will present two papers to the class along with a classmate. Each presentation will be followed by a Q&A session with the class.
Research Project
Throughout the course, you will propose and pursue a research project that involves implementing and benchmarking some technqiue or method that we covered in class. Project proposals will be due approximately half-way through the course. These projects will be done in pairs. A project report will be due in the end of the course.
Office Hours
There will be office hours immediately after each class. In addition, students are encouraged to book office hours with the instructor, especially before class presentations to ensure high quality.
Student Participation
As this is a course with a small class, students are required to address classes in person. Students are strongly encouraged to particpate in the Q&A session after each paper presentation.
Class Structure
Each class will begin with a 30-40 minute lecture by the instructor covering core techniques. The rest of each class will consist of two paper presentations by your peers.
Oral Exam
At the end of the course, the instructor will hold an oral exam to test students about the course content and project.
Grade Components
The final grade for each student will derived based on their presentations, project, and class participation. The precise grade breakdown will be announced in the first few weeks of the course.
Prerequisites
Students should have taken the courses listed in here, or have equivalent knowledge in algorithms, data structures, relational algebra, SQL, and operating systems. Hands-on experience with C/C++, Java or Python is also required. Ideally, students should also have taken a course similar to CSC2525H/CSC443H.
Academic Integrity
The project hand-ins must be the group’s own work. It is an academic violation to copy code or experimental results from other groups, whether you copy yourself or let someone else copy. That said, we encourage you to discuss course material widely with your fellow students within and across groups.
Piazza
We will be using Piazza as our main discussion board. You are responsible for reading all postings made by me or the TAs. Please use Piazza to ask questions about assignments and course lecture materials so that everyone can benefit.
Contact
Course announcements will arrive through Piazza. Aside to that, this course website is required reading. It contains essential material and will be updated throughout the semester. Please use Piazza whenever possible to ask questions about course material. For personal questions, email me to nivdayan@gmail.com. Please include "CSC2525" in the subject line along with your full name. If you do not hear back quickly, we are always available during office hours to help.
Accessibility
The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.
Course schedule
-
Week 1: Introduction & StorageWe'll first introduce the course and what's to come. Please read Chapter 1 in the textbook. We'll then take a deep look at the memory hierarchy. As we will see throughout the course, the properties of the memory hierarchy give rise to how modern databases are architected. For information on storage, read Chapter 8 Part 8.1, and Chapter 9 Parts 9.1 and 9.2. As you will notice, the textbook was written before SSDs became popular. I have written some background on SSDs here. Please study this carefully. This will also be useful when we study circular logs. You may also skim the following article. The intro slides are here. The storage slides are here. The intro lecture video is here. Passcode: RL.u2$x*rG The slides on RAID are here. The RAID lecture video is here. Passcode: LM&t7QzG7J
-
Week 2: Table and Buffer ManagementA table in a relational database contains data, but how do we store tables in storage devices such that we can scan them quickly and also insert, update and delete elements efficiently? This week we’ll learn how to efficiently store tables. We’ll also learn how to bring this data in and out of main memory efficiently for processing using buffer management techniques. In the textbook, read all of Chapter 9. The lecture slides are here. The lecture video is here. Passcode: Y4iiB^P#88 The tutorial slides are here. The tutorial video is here. Passcode: d^dDULY58b
-
Week 3: Indexing with Hash Tables and B-TreesSome queries only need to access a small amount of data in a given table. To avoid having to scan the entire table, indexes are used to navigate to a given data entry quickly. This week, we'll learn about the numerous indexing options of database systems. There is a lot of textbook reading for this week, so start early. Please read all of Chapter 8 as an introduction to indexing. Chapter 10 will provide an in-depth look at B-trees (you can pay less attention to Part 10.2 as this technology is largely obsolete, though it's still worth reading about to understand how technology evolves). Please also read Chapter 11 Parts 11.1 and 11.2. You are encouraged to read the rest of Chapter 11 as well, but we won't cover it in the course. The slides are here. The lecture video is here. Passcode: 3i3!P14?XQ The tutorial slides are here. The tutorial video is here. Passcode: vrkp^2TSBH
-
Week 4: Write-Optimized IndexingTraditional indexing techniques are optimized for reading but slow at ingesting new data. This week, we'll examine other types of database indexes used to optimize for data ingestion. Write-optimized indexes became popular over the past 15 years, so they do not appear in your textbook. Please read sections 1, 2, and 3 in the following: https://nivdayan.github.io/monkeykeyvaluestore.pdf The lecture slides are here. The lecture video is here. Passcode: A@@7a*Ek8Y The tutorial slides are here. The tutorial video is here. Passcode: 6*02DrmQV%
-
Week 5: Bloom Filters & Research Lecture on LSM-TreesWe will begin this lecture by studying Bloom filters and their use in storage systems - especially with LSM-trees. We will then have a one hour research lecture covering advanced topics the co-optimization of LSM-trees and Bloom filters. Required reading: Section 1 in: https://www.eecs.harvard.edu/~michaelm/postscripts/tempim3.pdf (you may also skim section 2 for the mathematical analysis) Sections 1-4 in https://nivdayan.github.io/monkeykeyvaluestore.pdf Sections 1-4 in: https://nivdayan.github.io/dostoevsky.pdf If you find this material to be interesting, you are also welcome to read the following papers, though we won't cover them in the course. https://nivdayan.github.io/LSM-bush.pdf https://nivdayan.github.io/chucky.pdf https://nivdayan.github.io/spooky.pdf The lecture slides are here. The lecture video is here. Passcode: Q!#8Qx0$pv The tutorial slides are here. The tutorial video is here. Passcode: d0M+T7?dn?
-
Week 6: External SortingSorting is a basic operation in database systems to construct indices or return sorted output for queries. You've probably encountered sorting algorithms like merge-sort and quick-sort in the past. However, it turns out these algorithms are no longer the best choice when most of our data reside in storage. This week, we'll learn about multi-way sort-merge, a sorting algorithm for vast amounts of data. Read Chapter 13 in the textbook. The lecture slides are here. The lecture video is here. Passcode: A87dFt3M+A The tutorial slides are here. The tutorial video is here. Passcode: 7&cGLetZ.f
-
Week 7: MidtermThe midterm will span two hours. It will cover all the material covered so far in the course. Be prepared and ask for help if you need it! See the Quercus front page for the tutorial video where we go through the exam questions and answers.
-
Week 8: Circular Logs & Cuckoo FiltersThis week, we'll take a look at an emerging KV-stores paradigm that involves logging data in storage and indexing it from memory. We’ll study how to implement this index efficiently using Cuckoo hashing and Cuckoo filters. We’ll also examine the importance of separating hot and cold data to lower garbage-collection overheads, and we’ll introduce the count-min sketch to help us achieve this. We will also discuss how to recover the index when power fails. First, have a look at Bitcask, a very simple and easy to understand KV-store that indexes the full keys of data entries in storage Then, have a look at Cuckoo hashing, a hash table design that’s able to achieve very good hash table utilization and CPU efficiency at the same time. Then, read up on Cuckoo filters, which store fingerprints rather than full keys within a cuckoo hash table to further reduce memory requirements. The wikipedia articles on these topics are also recommended. Finally, have a look at FlashStore, a more memory-efficient KV-strore design than bitcask that stores fingerprints rather than full keys in memory and employs a Cuckoo filter variant to index them (since Cuckoo filters had still not been invented at the time). Two KV-stores out in the wild that employ this paradigm are FASTER and ForestDB. Feel free to browse them. The notion of hot/cold data identification and separation has been explored more in the context of SSDs and flash translation layers. The following paper, for instance, employs a similar construction to count-min.
-
Week 9: Reading Week - Rest & StudyA good time to finish up the project :)
-
Week 10: Relational Operators & Query OptimizationThis week, we'll take a deeper look at how queries in a relational database are executed and planned. Please read Chapters 12, 14 and 15 in the textbook.
-
Week 11: Column-StoresWhile traditional databases store data in each table row by row, a newer breed of databases is storing the data column by column to optimize more for analytical queries and statistical calculations. This week, we'll learn about column-stores, the cornerstone of modern data warehousing. As Column-Stores rose to prominence over the past 15 years, they do not appear in your textbook. Therefore, please have a look at "The Design and Implementation of Modern Column-Oriented Database Systems" by Abadi et al: https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf Read Parts 1, 2, 3, 4.1, 4.2, 4.3, 4.4, 4.7, 4.8, and 5.
-
Week 12: Transaction Management & Concurrency ControlWe'll introduce Transaction Management and Concurrency Control this week, the part of the database that ensures different operations can execute in parallel yet correctly, even if the system might fail at any moment. Please read Chapters 16 and 17 in the textbook.
-
Week 13: RecoveryThis week, we'll look at how to recover the database into a consistent state if, say, power suddenly fails. Please read Chapter 18 in the textbook.