Should I create many different tables and join or fewer tables?

Question

This may have been answered in one of the videos, but I didn't catch it. It seems like a lot of work to create a bunch of tables that only have one or two data categories. In the example in the video, why couldn't we just create one table that includes the Movie, Year made, and the genre? Why create two different tables and then join them?

Is it strictly for speed or does it provide more flexibility when combining data tables?

Thanks

Answer 1 · 2017-02-28T04:14:21Z

February 28, 2017 4:14am

Alex, This is a good question because it's an important step in learning how to build relational database systems. The common wisdom is that you should "normalize" all tables, which means reduce duplication (redundancy) of data as much as possible. The reason to do this is to make your database easier to maintain and less likely to contain errors (it should also keep your overall database size smaller). Normalizing does help with flexibility, but it does not help with speed; in fact, it will make all of your SQL responses (infinitesimally) slower [because it's a more complicated request for SQL to interpret] -- but unless you are handling an enterprise-level amount of requests per second it's a meaningless difference. What's funny is that I've heard that super-huge database-driven sites (like Facebook) actually do what you're suggesting, and go against common wisdom and use the anti-pattern of keeping more data in fewer tables, in order to get that teeny-tiny speed gain which, for a site like Facebook, would expand out to real performance savings. Let me reiterate though, that any database you build, even for a decently successful corporate client, will not even come close to needing to resort to this bad practice, so you should normalize your tables and make your database-administration life a lot easier.

Here are some off-the-cuff examples of why to normalize:

Using the movies database example, let's say you have 150,000 movies in the table, with the genre written out in the same record -- no separate "genres" table. Now you want your website to have a drop-down menu so users can select a genre to filter movies by. How do you create that drop-down? If your genres are all hard-written into the 150,000 records, you'd have to query every record in the entire database and return a set of all the unique genre values. That's a pretty significant performance problem. And what if some instances of a genre are spelled wrong? On the other hand, if you have a table named "genres" (and just use the genre's ID in the "movies" table), you simply have to loop through the dozen or so records in that table, which is a trivial SQL command (SELECT genre_name FROM genres), and you're done. You also eliminate any possibility of stray spelling mistakes.

Another example. Let's say you have a database for a small social-sharing website. One of your tables keeps track of a post's "likes." It is not normalized. Its fields are: the post itself, the username that liked it, and the date/time they liked it. Let's say this website has been running a couple years, it has 20,000 users and you have 6 million records in this table. The website allows users to edit their posts, so whenever a post is edited, every record for that post in this table -- not to mention in every other table where the post is referenced -- has to be updated to reflect the new content. The website also allows users to change their username, so whenever any user changes their username, the database has to update every record in this table -- and every other table where their username is referenced -- to match their new username. Do either of these seem like a good idea? Or do they seem like major performance problems? Not only in speed of execution, but also in their impact on your data's integrity: if something about the query was mis-formed, either of these updates could easily compromise hundreds or thousands of records, damaging posts's like-counts and users's like-histories across your site. If, on the other hand, the "likes" table simply used the post's ID and user's ID, and there was a separate table called "posts" that kept track of the post's ID, content, etc., and a third table called "users" that kept track of the user's ID, username, etc., then whenever a post gets updated, or a user changes their username, only that single record in the "posts" or "users" table needs to be updated, and all relationships between all the data remain intact.

The more you can keep your types of data separate, and only use IDs and "mapping tables" to draw relationships between those types of data, the easier you'll be able to access and maintain all the data. It's easy to think, when you have 10 records in a starter database, that normalizing seems like overkill, but as you get more comfortable making database architecture decisions, always imagine how your plan would work if the table had 10 million records in it. (A whole lot of real-use databases do.)

Hope that helps you a little, let me know if you have any questions or thoughts.

Welcome to the Treehouse Community

Looking to learn something new?

Alex Flores

Alex Flores

Should I create many different tables and join or fewer tables?

1 Answer

Eric Butler

Eric Butler

Alex Flores

Alex Flores