Niv Dayan
Assistant Professor in Computer Science
University of Toronto
CSC443/CSC2234
Database System Technology
Instructor: Niv Dayan
Lectures: Wedenday 13:00-15:00 (MP 137)
Tutorials: Friday 13:00-14:00 (MP 137)
Office Hours: BA5230 after each class
Course Description
Database management systems are the bookkeepers of modern civilization. This course covers the algorithms and data structures that constitute the guts, bones, and arteries of these systems. Much of this course is about understanding the properties of modern hardware devices and designing data structures and access methods that optimize for them. We will also see that there is usually never one "correct" way for storing data: it is a question of managing trade-offs to optimize for what the user is trying to do. This course is important for anyone who intends to work with, tune, extend, or build data management systems.
Piazza
We will be using Piazza as our main discussion board. You are responsible for reading all postings made by me or the TAs. Please use Piazza to ask questions about assignments and course lecture materials so that everyone can benefit.
Contact
Course announcements will arrive through Quercus. Aside to that, this course website is required reading. It contains essential material and will be updated throughout the semester. Please use Piazza whenever possible to ask questions about course material. For personal questions, email me to nivdayan@gmail.com. Please include "csc443" in the subject line along with your full name. If you do not hear back quickly, we are always available during office hours to help.
Project
The course involves a major group programming assignment. It is described in the following document. A few experimental results examples are given here.
Midterm & Final
The midterm will take place during normal lecture hours. The final exam will be scheduled by Arts & Science and take place during the examination period.
Accessibility
The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.
Prerequisites
Students should have taken the courses listed in here, or have equivalent knowledge in algorithms, data structures, relational algebra, SQL, and operating systems. Hands-on experience with C/C++, Java or Python is also required.
Reading Material
The course textbook is "Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke, 3rd Edition. It is available new and used at the UofT bookstore and in the library. We are interested in Parts III to V. While this textbook is classic, it is also dated (20 years old). Therefore, the reading material will also draw from various sources, including research papers and articles, about state-of-the-art advances in the field.
Marking Scheme
There will be a midterm (20%), a final (40%) and a group project (40%) for this course.
Academic Integrity
The project hand-ins must be the group’s own work. It is an academic violation to copy code or experimental results from other groups, whether you copy yourself or let someone else copy. That said, we encourage you to discuss course material widely with your fellow students within and across groups.
Course schedule
-
Week 1: Introduction & StorageWe'll first introduce the course and what's to come. Please read Chapter 1 in the textbook. We'll then take a deep look at the memory hierarchy. As we will see throughout the course, the properties of the memory hierarchy give rise to how modern databases are architected. For information on storage, read Chapter 8 Part 8.1, and Chapter 9 Parts 9.1 and 9.2. As you will notice, the textbook was written before SSDs became popular. I have written some background on SSDs here. Please study this carefully. This will also be useful when we study circular logs. You may also skim the following article. The intro slides are here. The storage slides are here. The intro lecture video is here. Passcode: RL.u2$x*rG The slides on RAID are here. The RAID lecture video is here. Passcode: LM&t7QzG7J
-
Week 2: Table and Buffer ManagementA table in a relational database contains data, but how do we store tables in storage devices such that we can scan them quickly and also insert, update and delete elements efficiently? This week we’ll learn how to efficiently store tables. We’ll also learn how to bring this data in and out of main memory efficiently for processing using buffer management techniques. In the textbook, read all of Chapter 9. The lecture slides are here. The lecture video is here. Passcode: Y4iiB^P#88 The tutorial slides are here. The tutorial video is here. Passcode: d^dDULY58b
-
Week 3: Indexing with Hash Tables and B-TreesSome queries only need to access a small amount of data in a given table. To avoid having to scan the entire table, indexes are used to navigate to a given data entry quickly. This week, we'll learn about the numerous indexing options of database systems. There is a lot of textbook reading for this week, so start early. Please read all of Chapter 8 as an introduction to indexing. Chapter 10 will provide an in-depth look at B-trees (you can pay less attention to Part 10.2 as this technology is largely obsolete, though it's still worth reading about to understand how technology evolves). Please also read Chapter 11 Parts 11.1 and 11.2. You are encouraged to read the rest of Chapter 11 as well, but we won't cover it in the course.
-
Week 4: Write-Optimized IndexingTraditional indexing techniques are optimized for reading but slow at ingesting new data. This week, we'll examine other types of database indexes used to optimize for data ingestion. Write-optimized indexes became popular over the past 15 years, so they do not appear in your textbook. Please read sections 1, 2, and 3 in the following: https://nivdayan.github.io/monkeykeyvaluestore.pdf
-
Week 5: Research Lecture on Scaling LSM-TreesIn recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive. In this talk, we will show how to properly co-design LSM-tree with filters to achieve substantially superior performance properties. Required reading: Sections 1-4 in https://nivdayan.github.io/monkeykeyvaluestore.pdf and Sections 1-4 in: https://nivdayan.github.io/dostoevsky.pdf If you find this material to be interesting, you are also welcome to read the following papers, though we won't cover them in the course. https://nivdayan.github.io/LSM-bush.pdf https://nivdayan.github.io/chucky.pdf https://nivdayan.github.io/spooky.pdf
-
Week 6: External SortingSorting is a basic operation in database systems to construct indices or return sorted output for queries. You've probably encountered sorting algorithms like merge-sort and quick-sort in the past. However, it turns out these algorithms are no longer the best choice when most of our data reside in storage. This week, we'll learn about multi-way sort-merge, a sorting algorithm for vast amounts of data. Read Chapter 13 in the textbook.
-
Week 7: MidtermThe midterm will span two hours. It will cover all the material covered so far in the course. Be prepared and ask for help if you need it! See the Quercus front page for the tutorial video where we go through the exam questions and answers.
-
Week 8: Circular Logs & Cuckoo FiltersThis week, we'll take a look at an emerging KV-stores paradigm that involves logging data in storage and indexing it from memory. We’ll study how to implement this index efficiently using Cuckoo hashing and Cuckoo filters. We’ll also examine the importance of separating hot and cold data to lower garbage-collection overheads, and we’ll introduce the count-min sketch to help us achieve this. We will also discuss how to recover the index when power fails. First, have a look at Bitcask, a very simple and easy to understand KV-store that indexes the full keys of data entries in storage Then, have a look at Cuckoo hashing, a hash table design that’s able to achieve very good hash table utilization and CPU efficiency at the same time. Then, read up on Cuckoo filters, which store fingerprints rather than full keys within a cuckoo hash table to further reduce memory requirements. The wikipedia articles on these topics are also recommended. Finally, have a look at FlashStore, a more memory-efficient KV-strore design than bitcask that stores fingerprints rather than full keys in memory and employs a Cuckoo filter variant to index them (since Cuckoo filters had still not been invented at the time). Two KV-stores out in the wild that employ this paradigm are FASTER and ForestDB. Feel free to browse them. The notion of hot/cold data identification and separation has been explored more in the context of SSDs and flash translation layers. The following paper, for instance, employs a similar construction to count-min.
-
Week 9: Reading Week - Rest & StudyA good time to finish up the project :)
-
Week 10: Relational Operators & Query OptimizationThis week, we'll take a deeper look at how queries in a relational database are executed and planned. Please read Chapters 12, 14 and 15 in the textbook.
-
Week 11: Column-StoresWhile traditional databases store data in each table row by row, a newer breed of databases is storing the data column by column to optimize more for analytical queries and statistical calculations. This week, we'll learn about column-stores, the cornerstone of modern data warehousing. As Column-Stores rose to prominence over the past 15 years, they do not appear in your textbook. Therefore, please have a look at "The Design and Implementation of Modern Column-Oriented Database Systems" by Abadi et al: https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf Read Parts 1, 2, 3, 4.1, 4.2, 4.3, 4.4, 4.7, 4.8, and 5.
-
Week 12: Transaction Management & Concurrency ControlWe'll introduce Transaction Management and Concurrency Control this week, the part of the database that ensures different operations can execute in parallel yet correctly, even if the system might fail at any moment. Please read Chapters 16 and 17 in the textbook.
-
Week 13: RecoveryThis week, we'll look at how to recover the database into a consistent state if, say, power suddenly fails. Please read Chapter 18 in the textbook.