top of page

CSC443H1

Database System Technology

Instructor: Niv Dayan

Lectures: Tuesday 2-4 pm, MS2170

Tutorials: Thursday, 3-4 pm, SS2110

TAs: Akshay Bapat & Pooyan Habibi

Office Hours: Tuesday & Thursday at 4 pm

Artificial Intelligence

Course Description

Database management systems are the bookkeepers of modern civilization. This course covers the algorithms and data structures that constitute the guts, bones, and arteries of these systems. Much of this course is about understanding the properties of modern hardware devices and designing data structures and access methods that optimize for them. We will also see that there is usually never one "correct" way for storing data: it is a question of managing trade-offs to optimize for what the user is trying to do. This course is important for anyone who intends to work with, tune, extend, or build data management systems. 

Piazza

We will be using Piazza as our main discussion board. You are responsible for reading all postings made by me or the TAs. Please use Piazza to ask questions about assignments and course lecture materials so that everyone can benefit.

Contact

Course announcements will arrive through Quercus. Aside to that, this course website is required reading. It contains essential material and will be updated throughout the semester. Please use Piazza whenever possible to ask questions about course material. For personal questions, email me to nivdayan@gmail.com. Please include "csc443" in the subject line along with your full name. If you do not hear back quickly, we are always available during office hours to help.

Final Project

The course involves a major group programming assignment. as described in the following google doc. The cheatsheet is here. The grading sheet is here

Final Exam

the final exam will be scheduled by Arts & Science and take place during the examination period. 

Accessibility

The University of Toronto is committed to accessibility. If you require accommodations or have any accessibility concerns, please visit Accessibility Services as soon as possible.

Prerequisites

Students should have taken the courses listed in this calendar entry or have equivalent knowledge in algorithms, data structures, relational algebra, SQL, and operating systems. Hands-on experience with C/C++, Java or Python is also required.

Reading Material

The course textbook is "Database Management Systems" by Raghu Ramakrishnan and Johannes Gehrke, 3rd Edition. It is available new and used at the UofT bookstore and in the library. We are interested in Parts III to V. While this textbook is classic, it is also dated (20 years old). Therefore, the reading material will also draw from various sources, including research papers and articles, about state-of-the-art advances in the field. 

Academic Integrity

Homeworks should be done individually and the work you submit must be your own. It is an academic offence to copy someone else's work. Whether you copy or let someone else copy, it is an offence. Academic offences are taken very seriously. At the same time, we want you to benefit from working with other students. It is appropriate to discuss course material related to homeworks, and we encourage you to do so.

Marking Scheme

There will be a midterm (20%), a final (40%) and a group project (40%) for this course. 

Academic Integrity

The project hand-ins must be the group’s own work. It is an academic violation to copy code or experimental results from other groups, whether you copy yourself or let someone else copy. That said, we encourage you to discuss course material widely with your fellow students within and across groups. 

Course schedule

  • Week 1: Introduction to the Course
    This week we'll introduce the course and what's to come. Please read Chapter 1 in the textbook. The lecture slides are here. The lecture video is here. Passcode: Y3%pm*76fD
  • Week 2: Storage Devices
    This week will take a deep look at the memory hierarchy. As we will see throughout the course, the properties of the memory hierarchy give rise to how modern databases are architected. For information on storage, read Chapter 8 Part 8.1, and Chapter 9 Parts 9.1 and 9.2. As you will notice, the textbook was written before SSDs became popular. I have written some background on SSDs here. Please study this carefully. You may also skim the following article. The slides on storage are here. The slides on RAID are here.
  • Week 3: Table and Buffer Management
    A table in a relational database contains data, but how do we store tables in storage devices such that we can scan them quickly and also insert, update and delete elements efficiently? This week we’ll learn how to efficiently store tables. We’ll also learn how to bring this data in and out of main memory efficiently for processing using buffer management techniques. In the textbook, read all of Chapter 9. The slides on table management are here. The slides on buffer management are here. The lecture video is here. Passcode: 7N0DzjO6@x
  • Week 4: Indexing with Hash Tables and B-Trees
    Some queries only need to access a small amount of data in a given table. To avoid having to scan the entire table, indexes are used to navigate to a given data entry quickly. This week, we'll learn about the numerous indexing options of database systems. There is a lot of textbook reading for this week, so start early. Please read all of Chapter 8 as an introduction to indexing. Chapter 10 will provide an in-depth look at B-trees (you can pay less attention to Part 10.2 as this technology is largely obsolete, though it's still worth reading about to understand how technology evolves). Please also read Chapter 11 Parts 11.1 and 11.2. You are encouraged to read the rest of Chapter 11 as well, but we won't cover it in the course. The lecture slides are here. The lecture video is here. Passcode: *$f0Za5Ukq
  • Week 5: Write-Optimized Indexing
    Traditional indexing techniques are optimized for reading but slow at ingesting new data. This week, we'll examine other types of database indexes used to optimize for data ingestion. Write-optimized indexes became popular over the past 15 years, so they do not appear in your textbook. Please read sections 1, 2, and 3 in the following: https://nivdayan.github.io/monkeykeyvaluestore.pdf The lecture slides are here. The tutorial slides are here. The lecture video is here. Passcode: d%Wd7dH$1k I'm not certain why but the tutorial recording failed. I am therefore adding the tutorial video from last year. It's the same questions and answers, but an older version of the slides. The video is here. Passcode: 710Q7B^qUQ
  • Week 6: Research Lecture on Scaling LSM-Trees
    In recent years, the log-structured merge-tree (LSM-tree) has become the mainstream core data structure used by key-value stores to ingest and persist data quickly. LSM-tree enables fast writes by buffering incoming data in memory and flushing it as independent sorted batches to storage whenever the buffer is full. To enable fast reads, LSM-tree sort-merges batches in storage to restrict the number that reads have to search, and it also uses in-memory Bloom filters to enable point reads to probabilistically skip accessing batches that do not contain a target entry. In this talk, we show that such LSM-tree based designs do not scale well: as the data size increases, both reads and writes take increasingly long to execute. We pinpoint the problem to suboptimal core design: the Bloom filters have been attached to LSM-tree as an afterthought and are therefore not optimized to minimize the overall probability of access to storage. Point reads are therefore unnecessarily expensive. To compensate, more merging than necessary has to take place thereby making writes unnecessarily expensive. In this talk, we will show how to properly co-design LSM-tree with filters to achieve substantially superior performance properties. Required reading: Sections 1-4 in https://nivdayan.github.io/monkeykeyvaluestore.pdf and Sections 1-4 in: https://nivdayan.github.io/dostoevsky.pdf If you find this material to be interesting, you are also welcome to read the following papers, though we won't cover them in the course. https://nivdayan.github.io/LSM-bush.pdf https://nivdayan.github.io/chucky.pdf https://nivdayan.github.io/spooky.pdf The slides are here. The video lecture is here. Passcode: c345XV5$?$ The tutorial slides are here. The tutorial video is here. Passcode: n5!fwAk6!i
  • Week 7: External Sorting
    Sorting is a basic operation in database systems to construct indices or return sorted output for queries. You've probably encountered sorting algorithms like merge-sort and quick-sort in the past. However, it turns out these algorithms are no longer the best choice when most of our data reside in storage. This week, we'll learn about multi-way sort-merge, a sorting algorithm for vast amounts of data. Read Chapter 13 in the textbook. The lecture slides are here. The lecture video is here. Passcode: gQ3&2%^Gwv The tutorial slides are here. The tutorial video is here. Passcode: aU#w1kuDab
  • Week 8: Midterm
    The midterm will span two hours. It will cover all the material covered so far in the course. Be prepared and ask for help if you need it! See the Quercus front page for the tutorial video where we go through the exam questions and answers.
  • Week 9: Circular Logs & Cuckoo Filters
    This week, we'll take a look at an emerging KV-stores paradigm that involves logging data in storage and indexing it from memory. We’ll study how to implement this index efficiently using Cuckoo hashing and Cuckoo filters. We’ll also examine the importance of separating hot and cold data to lower garbage-collection overheads, and we’ll introduce the count-min sketch to help us achieve this. We will also discuss how to recover the index when power fails. First, have a look at Bitcask, a very simple and easy to understand KV-store that indexes the full keys of data entries in storage Then, have a look at Cuckoo hashing, a hash table design that’s able to achieve very good hash table utilization and CPU efficiency at the same time. Then, read up on Cuckoo filters, which store fingerprints rather than full keys within a cuckoo hash table to further reduce memory requirements. The wikipedia articles on these topics are also recommended. Finally, have a look at FlashStore, a more memory-efficient KV-strore design than bitcask that stores fingerprints rather than full keys in memory and employs a Cuckoo filter variant to index them (since Cuckoo filters had still not been invented at the time). Two KV-stores out in the wild that employ this paradigm are FASTER and ForestDB. Feel free to browse them. The notion of hot/cold data identification and separation has been explored more in the context of SSDs and flash translation layers. The following paper, for instance, employs a similar construction to count-min. The slides are here. The lecture video is here. Passcode: WmY&p5fA*^
  • Week 10: Reading Week - Rest & Study
    A good time to finish up the project :)
  • Week 11: Relational Operators & Query Optimization
    This week, we'll take a deeper look at how queries in a relational database are executed and planned. Please read Chapters 12, 14 and 15 in the textbook. The slides are here. The lecture video is here. Passcode: 1Ln8W=h7W0 The tutorial slides are here. The tutorial video is here. Passcode: H!dO*jm5Gf
  • Week 12: Column-Stores
    While traditional databases store data in each table row by row, a newer breed of databases is storing the data column by column to optimize more for analytical queries and statistical calculations. This week, we'll learn about column-stores, the cornerstone of modern data warehousing. As Column-Stores rose to prominence over the past 15 years, they do not appear in your textbook. Therefore, please have a look at "The Design and Implementation of Modern Column-Oriented Database Systems" by Abadi et al: https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf Read Parts 1, 2, 3, 4.1, 4.2, 4.3, 4.4, 4.7, 4.8, and 5. The lecture slides are here. The lecture video is here. Passcode: P!Pf&5xa67 The tutorial slides are here. The tutorial lecture is here. Passcode: hJcUM&*426
  • Week 13: Transaction Management & Concurrency Control
    We'll introduce Transaction Management and Concurrency Control this week, the part of the database that ensures different operations can execute in parallel yet correctly, even if the system might fail at any moment. Please read Chapters 16 and 17 in the textbook. The lecture slides are here. The lecture video is here. Passcode: d6!*YPF%j8 The tutorial slides are here. The tutorial video is here. Passcode: f=9d!EjEq& Note that the tutorial video is from last semester as I forgot to record it this time. However, it's the same questions.
  • Week 14: Recovery
    This week, we'll look at how to recover the database into a consistent state if, say, power suddenly fails. Please read Chapter 18 in the textbook. The lecture slides are here. The tutorial slides are here. The lecture and tutorial video is here. Passcode: ^P1Qu#UT+x

We hope this course will get you excited about research. For students who excel in this course and seek research opportunites, check out my home page and get in touch. 

Contact
bottom of page