Why relations matter

Neo Ver
6 min readSep 29, 2018

--

Once upon a time, no, actually yesterday, I was working on a project where we faced a problem that got me thinking, why do relations matter.. even in Database!

Okay, let me tell you the problem first. I had to store multiple sets of tuples, each corresponding to it’s type associated with a key in a relational database.

Overwhelmed? Don’t worry, the image below will hep you understand.

Keys have defined set of Types which contains a set of Tuples

This data was going to be accessed via the keys to get the associated sets of tuples for different types. I had two choices:-

a. Dump the data into the DB via converting those sets to delimited tuples (CSV may be), and create a set out of them while reading from database by splitting them using the same delimiter or,

b. Store the tuples into their respective “type” tables and assign a uniquely generated key to each of them. Similarly, assign generated ID to the main “key” and dump it to the database, and then, create “association” tables to map the relationships between the entities. Tuples will be mapped to Keys via Foreign Keys in the Association tables.

Obviously, option (a) looks easy peasy and in fact, makes sense at first too; looking at how the data is going to be accessed.

But obviously, as you’ve already guessed by now, I chose option (b) instead!

Reason? well, read on…

  1. We obeyed First Normal Form — 1NF is the least one could expect from a relational database. A database is in first normal form if it satisfies the following conditions:-
  • Contains only atomic values
  • There are no repeating groups

2. Optimal Storage utilization — 1NF helped us remove redundancy in storage and thus saving space. Even if a tuple occurs thousands of time for different keys, we’ll only save that tuple once in our database in option (b) which is not the case with option (a) where we’ll have to store the tuples as many times they occur.

3. Easier to query and analyze data — Consider the following questions to be answered via SQL queries:

i. Get the list of keys having a particular tuple under a given type — It’s a simple SQL join for option (b) while we’ll have to scan all the rows for this column containing value like the tuple we’re looking for in option (a)

ii. Get the key and the count of tuples containing maximum number of tuples for that particular type — As you know in option (b), by using functions like “COUNT” and “MAX” you are already ready with the answer, while for getting the same results in option (a) we’ll have to split the values first by the delimiter and then take that count and do the same for all rows in that table. Done? now consider doing this for a table containing 2 Million rows and each column containing around 25 Thousand delimited tuples. How’re you feeling?

4. Performant scans by creating index — To improve the performance of scanning we can create indices over the tuples, keys and the uniquely generated IDs under option (b). While indexing over the columns containing delimited tuples will be of no use and thus indexing can never help improve our lookup performance in option (a)

5. Maintenance made easy with partitions and sharding — If the data grows huge(which is usually the case) queries against it can no longer be completed within SLA requirements due to the amount of data that needs to be examined. We’ll have to then partition the data to make it manageable and improve query fetch times. But even if we partition the data in both the options (a) and (b) based on keys. Partitioned data in (a) will still be huge because over time, it grows horizontally and not vertically. Thus, making the effect of partitions go void.

6. What if my set grows enormously and breaches the specified limits — If we go via option (a), we can’t store a set longer than what fits in the column right? But fortunately, we’ll never encounter such limits in option (b), because the association mapping table can grow vertically as much as you want, that is what it the relational database has been designed for.

7. What if I want to access all tuples in some specified order —In cases where we want to list all distinct tuples in alphabetically sorted order. You understand how long will it take in option (a) where it has to identify all the unique values first(which will take ages to finish) and then sort them, while in option (b) they already are unique and sorting is trivial of course.

8. What if we want to join the tuples with some other table in future — As we know, we don’t know the future. If for some use case, we need to map the tuples associated with keys to some other data table in future, we’ll be doomed via option (a), while option (b) will smile at us!

9. Better understanding of your data — We can gain understanding over our data via simple SQLs, like what is the average tuple size, or what is the average number of tuples associated to the key for a mentioned type if we choose option (b). In fact, while inserting data we found out that some of the tuples being inserted were just being saved as “?” literals and not the actual characters. That’s when we got to know that the Collation type of our table didn’t support those Arabic characters, we changed that to appropriate type and voila! We could have never sensed that while persisting them in sets through option (a) and those tuples would have been sitting there, never being accessed and not to forget at the cost of removing the original values.

10. No delimiter can be God — No matter which delimiter you choose, you can never sleep with peace knowing that at some point, the delimiter you’ve chosen to split those tuples could be a part of the tuple itself. Option (a) has that vulnerability. As soon as any of our tuples start containing our chosen delimiter, things will start going haywire.

Still not convinced ? OK, look what people say…

Sometimes you face the unfortunate situation that someone has stored list of comma-separated values in a table column. This is a very bad design, which is in direct conflict with the mindset that relational databases are designed from, to wit that a cell should hold a single value. These columns with comma-separated values should almost always be redesigned to be child tables. (The use of almost in the previous sentence is highly questionable. The exceptions are extremely rare.

http://bit.ly/2IoKYQi

Accessing delimited data values embedded within a larger field generally results in [very] poor application performance. It almost always results in “Table Scans” because indexes can’t be meaningfully used.

The first antipattern that Bill Karwin discusses in his book “SQL Antipatterns” is Jaywalking (because an intersection is avoided). The antipattern specifically lists out the problems associated with storing comma separated values when you need to use them separately (for instance, querying the table becomes complicated)

http://bit.ly/2DDbnLm

Doing it like you have suggested by cramming the tags into one field may seem to make sense now, but it will prove to be difficult to maintain and difficult/time consuming to pull the values out efficiently as your application grows in size or the amount of data grows a lot larger.

http://bit.ly/2xYpNjg

In conclusion, I believe we should always try to go with relational schemas in a relational database unless we have a very compelling reason not to do so!

Let me know your thoughts…

--

--