Database System Technology
Instructor: Niv Dayan
Lectures: Tuesday 2-4 pm, MS2170
Tutorials: Thursday, 3-4 pm, SS2110
TAs: Akshay Bapat & Pooyan Habibi
Office Hours: Tuesday & Thursday at 4 pm
Database management systems are the bookkeepers of modern civilization. This course covers the algorithms and data structures that constitute the guts, bones, and arteries of these systems. Much of this course is about understanding the properties of modern hardware devices and designing data structures and access methods that optimize for them. We will also see that there is usually never one "correct" way for storing data: it is a question of managing trade-offs to optimize for what the user is trying to do. This course is important for anyone who intends to work with, tune, extend, or build data management systems.
We will be using Piazza as our main discussion board. You are responsible for reading all postings made by me or the TAs. Please use Piazza to ask questions about assignments and course lecture materials so that everyone can benefit.
Course announcements will arrive through Quercus. Aside to that, this course website is required reading. It contains essential material and will be updated throughout the semester. Please use Piazza whenever possible to ask questions about course material. For personal questions, email me to email@example.com. Please include "csc443" in the subject line along with your full name. If you do not hear back quickly, we are always available during office hours to help.
the final exam will be scheduled by Arts & Science and take place during the examination period.
The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.
Students should have taken the courses listed in this calendar entry or have equivalent knowledge in algorithms, data structures, relational algebra, SQL, and operating systems. Hands-on experience with C/C++, Java or Python is also required.
The course textbook is "Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke, 3rd Edition. It is available new and used at the UofT bookstore and in the library. We are interested in Parts III to V. While this textbook is classic, it is also dated (20 years old). Therefore, the reading material will also draw from various sources, including research papers and articles, about state-of-the-art advances in the field.
Homeworks should be done individually and the work you submit must be your own. It is an academic offence to copy someone else's work. Whether you copy or let someone else copy, it is an offence. Academic offences are taken very seriously. At the same time, we want you to benefit from working with other students. It is appropriate to discuss course material related to homeworks, and we encourage you to do so.
There will be a midterm (20%), a final (40%) and a group project (40%) for this course.
The project hand-ins must be the group’s own work. It is an academic violation to copy code or experimental results from other groups, whether you copy yourself or let someone else copy. That said, we encourage you to discuss course material widely with your fellow students within and across groups.
Week 1: Introduction to the CourseThis week we'll introduce the course and what's to come. Please read Chapter 1 in the textbook. The lecture slides are here. The lecture video is here. Passcode: Y3%pm*76fD
Week 2: Storage DevicesThis week will take a deep look at the memory hierarchy. As we will see throughout the course, the properties of the memory hierarchy give rise to how modern databases are architected. For information on storage, read Chapter 8 Part 8.1, and Chapter 9 Parts 9.1 and 9.2. As you will notice, the textbook was written before SSDs became popular. I have written some background on SSDs here. Please study this carefully. You may also skim the following article. The slides on storage are here. The slides on RAID are here. The tutorial slides are here. I was not able to record the lecture on Sep 12, and so I am adding links to the recordings from last year. Please ignore anything that's stated in these recordings about course logistics as this there is a risk of hearing information that is now out-of-date. These recordings also do not cover RAID 4 and 5, though these are a part of the course. Please refer to the slides from this year for the most up-to-date content for this iteration course. The link is here. The tutorial recording is here. Passcode: %3L5$1U.VV Note: as pointed out in a Piazza announcement, there was some ambiguity about RAID 1 in the original slides and tutorial. Since then, we fixed RAID 1 for this class to mean pure duplication of one disk across all disks in the array. The updated slides now reflect this, but the videos do not. This means the slides supersede the material in the videos.
Week 3: Table and Buffer ManagementA table in a relational database contains data, but how do we store tables in storage devices such that we can scan them quickly and also insert, update and delete elements efficiently? This week we’ll learn how to efficiently store tables. We’ll also learn how to bring this data in and out of main memory efficiently for processing using buffer management techniques. In the textbook, read all of Chapter 9. The slides on table management are here. The slides on buffer management are here. The tutorial slides are here. The lecture video is here. Passcode: 7N0DzjO6@x The tutorial video is here. Passcode: Cx0zbUTuM&. Note that question 3 had issues, discussed more in Piazza. The corrected version is in the slides.
Week 4: Indexing with Hash Tables and B-TreesSome queries only need to access a small amount of data in a given table. To avoid having to scan the entire table, indexes are used to navigate to a given data entry quickly. This week, we'll learn about the numerous indexing options of database systems. There is a lot of textbook reading for this week, so start early. Please read all of Chapter 8 as an introduction to indexing. Chapter 10 will provide an in-depth look at B-trees (you can pay less attention to Part 10.2 as this technology is largely obsolete, though it's still worth reading about to understand how technology evolves). Please also read Chapter 11 Parts 11.1 and 11.2. You are encouraged to read the rest of Chapter 11 as well, but we won't cover it in the course. The lecture slides are here. The lecture video is here. Passcode: *$f0Za5Ukq
Week 5: Write-Optimized IndexingTraditional indexing techniques are optimized for reading but slow at ingesting new data. This week, we'll examine other types of database indexes used to optimize for data ingestion. Write-optimized indexes became popular over the past 15 years, so they do not appear in your textbook. Please read sections 1, 2, and 3 in the following: https://nivdayan.github.io/monkeykeyvaluestore.pdf The slides are here.
Week 6: Research Lecture on Scaling LSM-TreesIn recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive. In this talk, we will show how to properly co-design LSM-tree with filters to achieve substantially superior performance properties. Required reading: Sections 1-4 in https://nivdayan.github.io/monkeykeyvaluestore.pdf and Sections 1-4 in: https://nivdayan.github.io/dostoevsky.pdf Feel free to also skim the following: https://nivdayan.github.io/LSM-bush.pdf https://nivdayan.github.io/chucky.pdf https://nivdayan.github.io/spooky.pdf The slides are here.
Week 7: External SortingSorting is a basic operation in database systems to construct indices or return sorted output for queries. You've probably encountered sorting algorithms like merge-sort and quick-sort in the past. However, it turns out these algorithms are no longer the best choice when most of our data reside in storage. This week, we'll learn about multi-way sort-merge, a sorting algorithm for vast amounts of data. Read Chapter 13 in the textbook. The slides are here.
Week 8: MidtermThe midterm will span two hours. It will cover all the material from weeks 1 to 6 in the course. Be prepared and ask for help if you need it! See the Quercus front page for the tutorial video where we go through the exam questions and answers.
Week 9: Circular Logs & Cuckoo FiltersThis week, we'll take a look at an emerging KV-stores paradigm that involves logging data in storage and indexing it from memory. We’ll study how to implement this index efficiently using Cuckoo hashing and Cuckoo filters. We’ll also examine the importance of separating hot and cold data to lower garbage-collection overheads, and we’ll introduce the count-min sketch to help us achieve this. We will also discuss how to recover the index when power fails. First, have a look at Bitcask, a very simple and easy to understand KV-store that indexes the full keys of data entries in storage Then, have a look at Cuckoo hashing, a hash table design that’s able to achieve very good hash table utilization and CPU efficiency at the same time. Then, read up on Cuckoo filters, which store fingerprints rather than full keys within a cuckoo hash table to further reduce memory requirements. The wikipedia articles on these topics are also recommended. Finally, have a look at FlashStore, a more memory-efficient KV-strore design than bitcask that stores fingerprints rather than full keys in memory and employs a Cuckoo filter variant to index them (since Cuckoo filters had still not been invented at the time). Two KV-stores out in the wild that employ this paradigm are FASTER and ForestDB. Feel free to browse them. The notion of hot/cold data identification and separation has been explored more in the context of SSDs and flash translation layers. The following paper, for instance, employs a similar construction to count-min. The slides are here. Update since class: a note on tombstones is added, and the cost models at the end are updated.
Week 10: Reading Week - Rest & StudyA good time to finish up the project :)
Week 11: Relational Operators & Query OptimizationThis week, we'll take a deeper look at how queries in a relational database are executed and planned. Please read Chapters 12, 14 and 15 in the textbook. The slides are here.
Week 12: Column-StoresWhile traditional databases store data in each table row by row, a newer breed of databases is storing the data column by column to optimize more for analytical queries and statistical calculations. This week, we'll learn about column-stores, the cornerstone of modern data warehousing. As Column-Stores rose to prominence over the past 15 years, they do not appear in your textbook. Therefore, please have a look at "The Design and Implementation of Modern Column-Oriented Database Systems" by Abadi et al: https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf Read Parts 1, 2, 3, 4.1, 4.2, 4.3, 4.4, 4.7, 4.8, and 5. The lecture slides are here.
Week 13: Transaction Management & Concurrency ControlWe'll introduce Transaction Management and Concurrency Control this week, the part of the database that ensures different operations can execute in parallel yet correctly, even if the system might fail at any moment. Please read Chapters 16 and 17 in the textbook. The lecture slides are here.
Week 14: RecoveryThis week, we'll look at how to recover the database into a consistent state if, say, power suddenly fails. Please read Chapter 18 in the textbook. The lecture slides are here.