Learn Database Normalization - 1NF, 2NF, 3NF, 4NF, 5NF

Decomplexify
21 Nov 202128:34

Summary

TLDR本视频深入探讨了数据库规范化的概念,解释了为什么要进行规范化以及如何进行规范化。规范化旨在防止数据冗余和不一致性,提高数据完整性。视频从第一范式到第五范式,通过实例讲解了每个范式的规则及其重要性,帮助观众理解规范化带来的益处和不规范化可能引发的问题。

Takeaways

  • 😀 数据库规范化是一个确保数据一致性和可维护性的过程。
  • 🔍 规范化可以防止数据插入、更新和删除异常。
  • 📚 第一范式(1NF)要求表的每一列都是不可分割的基本数据项,且每个字段都是原子性,不能进一步分解。
  • 🔑 第二范式(2NF)要求表中的非主键列必须完全依赖于主键,不能只依赖主键的一部分。
  • 🛠️ 第三范式(3NF)要求非主键列只能依赖于主键,不能依赖于其他非主键列,即禁止传递依赖。
  • 🎯 第四范式(4NF)要求表中只能存在对主键的多值依赖,不能存在非主键列之间的多值依赖。
  • 🌐 第五范式(5NF)要求表不能是其他表的连接结果,即表中的数据必须彼此独立,没有冗余关联。
  • 🚫 违反规范化原则可能导致数据不一致,比如一个客户有两个出生日期的情况。
  • 🔄 规范化可以减少数据冗余,提高数据的一致性和完整性。
  • 📈 通过规范化,数据库表更容易理解、维护和扩展。

Q & A

  • 什么是数据库的规范化(Normalization)?

    -规范化是数据库设计中的一种方法,它通过确保数据的逻辑一致性和减少数据冗余来优化数据库结构。

  • 为什么要进行数据库规范化?

    -进行数据库规范化可以防止数据的插入、更新和删除异常,提高数据的一致性和完整性,使数据库结构更易于理解和扩展。

  • 如果数据库没有进行规范化,可能会发生哪些问题?

    -不进行规范化可能会导致数据不一致、冗余信息、插入异常、更新异常和删除异常等问题。

  • 第一范式(1NF)要求数据库表满足哪些条件?

    -第一范式要求表的每一列都是不可分割的基本数据项,每一列的值都是原子的,不可以再分解,并且表中的每一行都有唯一的主键。

  • 第二范式(2NF)是如何定义的?

    -第二范式要求表中的非主键属性必须完全依赖于主键,即没有部分依赖于主键的情况。

  • 第三范式(3NF)的主要规则是什么?

    -第三范式要求表中的非主键属性不能依赖于其他非主键属性,即不存在传递依赖。

  • 什么是博伊斯-科德范式(Boyce-Codd Normal Form, BCNF)?

    -博伊斯-科德范式是第三范式的一种加强形式,要求表中每个属性都直接依赖于整个主键,而非其一部分。

  • 第四范式(4NF)的主要要求是什么?

    -第四范式要求表中只允许有关于主键的多值依赖,即表中不应该存在非平凡的且非候选键之间的多值依赖。

  • 第五范式(5NF)是如何定义的?

    -第五范式要求表不能被描述为是其他表的连接结果,即表中不应该存在连接依赖,除非它依赖于候选键。

  • 如何判断一个表是否满足第一范式?

    -要判断一个表是否满足第一范式,需要检查表中的每一列是否都是原子性的,并且表中是否有一个唯一的主键。

  • 在设计数据库时,如何避免插入异常?

    -为了避免插入异常,需要确保数据库表的设计满足第二范式,即非主键属性完全依赖于主键,这样可以避免因为主键不完整而导致无法插入记录的情况。

  • 什么是重复组(Repeating Groups)?为什么它违反了第一范式?

    -重复组是指在单个行中多次出现的数据集合。它违反了第一范式,因为这种设计允许在单个行中存储多个值,而这些值应该被分割到不同的行中,每行只包含一个值,以确保数据的原子性和一致性。

Outlines

00:00

🔍 数据库规范化简介

本段介绍了数据库规范化(Normalization)的概念、目的和方法。规范化是数据库设计中用来减少数据冗余和提高数据一致性的过程。通过实例讲解了规范化的好处,包括防止数据自身矛盾、提高数据库表的可理解性、便于扩展、防止插入、更新和删除异常。同时,解释了规范化的不同级别,即从第一范式到第五范式,并用桥梁安全评估的类比来形象说明不同范式之间的关系。

05:03

📚 第一范式(1NF)详解

详细解释了第一范式(1NF)的要求,包括:1. 不能使用行顺序来传达信息;2. 同一列内不允许混合数据类型;3. 表中必须有主键;4. 不允许出现重复组。通过举例说明违反1NF的情况,如行顺序传达信息、列中数据类型混合、缺少主键和重复组。同时,提供了符合1NF的设计方法,如为身高信息添加单独的列,确保每个列都有明确的数据类型,并为每个玩家设计单独的行来避免重复组。

10:04

🛠️ 第二范式(2NF)与删除、更新异常

本段讨论了第二范式(2NF)的概念,即表中的非键属性必须完全依赖于整个主键。通过玩家库存表的例子,说明了2NF防止删除异常和更新异常的重要性。如果非键属性只依赖于主键的一部分,就会产生问题,如玩家等级信息可能会因为库存变化而丢失。正确的做法是将玩家信息和库存信息分开存储,确保每个属性都依赖于整个主键。

15:09

🌐 第三范式(3NF)与传递依赖

第三范式(3NF)要求表中的非键属性只能依赖于主键,不能有传递依赖。通过玩家技能等级和玩家等级的例子,说明了3NF的重要性。如果玩家等级依赖于玩家技能等级,而技能等级又依赖于主键(玩家ID),则会产生数据不一致的问题。解决这个问题的方法是将玩家等级从玩家表中移除,创建一个新的表来存储技能等级和等级之间的关系。

20:13

🎨 第四范式(4NF)与多值依赖

第四范式(4NF)处理的是多值依赖问题,要求表中的多值依赖必须依赖于主键。通过DesignMyBirdhouse.com网站的例子,说明了如果颜色和风格依赖于模型,但表设计不当,可能会导致数据不一致。正确的做法是将模型、颜色和风格分开存储,确保任何多值依赖都依赖于主键。

25:16

🍨 第五范式(5NF)与连接依赖

第五范式(5NF)是最高级别的规范化,要求表不能是其他表连接的结果。通过冰激淋品牌和口味的例子,说明了如果表设计不当,可能会导致数据不一致。正确的做法是将品牌、口味和个人喜好分开存储,确保表的设计不会引入逻辑上的连接依赖。

Mindmap

Keywords

💡规范化(Normalization)

规范化是数据库设计中的一种技术,旨在减少数据冗余和提高数据完整性。在视频中,规范化通过多个‘正规形式’来实现,从第一正规形式到第五正规形式,每一步都旨在解决特定的数据一致性问题。例如,规范化可以防止一个客户有两个出生日期的情况出现,这在逻辑上是不可能的,属于数据完整性的失败。

💡第一正规形式(First Normal Form, 1NF)

第一正规形式要求关系型数据库表的每一列都是不可分割的基本数据项,即每一列的值都是原子性的,不可以再分解。在视频中,提到了违反1NF的几种情况,包括使用行顺序来传达信息、同一列中混合数据类型、没有主键以及存在重复组。例如,如果一个表中没有主键,那么就违反了1NF,因为无法保证每行的唯一性。

💡第二正规形式(Second Normal Form, 2NF)

第二正规形式是在满足1NF的基础上,要求表中的非主键列必须完全依赖于主键,不能只依赖主键的一部分。在视频中,通过玩家库存表的例子说明了2NF的应用,如果玩家的等级信息依赖于玩家标识,而不是整个主键(玩家标识和物品类型),那么就违反了2NF,可能会导致更新异常和删除异常。

💡第三正规形式(Third Normal Form, 3NF)

第三正规形式要求非主键列只能依赖于主键,而不能依赖于其他非主键列。在视频中,通过玩家技能等级和玩家评级的关系来说明3NF的重要性,如果玩家评级依赖于玩家技能等级(这是一个非主键属性),那么就违反了3NF,可能导致数据不一致性。

💡冗余(Redundancy)

冗余在数据库中指的是不必要的数据重复。在视频中,冗余信息可能导致数据不一致性,例如一个客户有两个出生日期。规范化的主要目的之一就是减少冗余,确保数据的一致性和准确性。

💡插入异常(Insertion Anomaly)

插入异常是指在向表中插入新数据时可能遇到的问题,由于表设计不当,可能无法插入某些数据。在视频中,如果一个新玩家没有任何库存,那么在设计不当的库存表中就无法记录这个玩家的等级,这就是插入异常的一个例子。

💡更新异常(Update Anomaly)

更新异常是指在更新表中的数据时可能遇到的问题,可能导致数据不一致。在视频中,如果一个玩家的等级更新了,但是其评级没有相应更新,就会产生不一致性,这就是更新异常的一个例子。

💡删除异常(Deletion Anomaly)

删除异常是指在删除表中的数据时可能遇到的问题,可能导致数据丢失。在视频中,如果一个玩家的所有库存都被删除了,那么该玩家的等级信息也会丢失,这就是删除异常的一个例子。

💡主键(Primary Key)

主键是表中的一个或多个列的组合,用于唯一标识表中的每一行。在视频中,主键的概念被用来解释1NF,因为没有主键的表无法确保每行的唯一性。例如,Beatle_Height表中的'Beatle'列被指定为主键,以确保每行只描述一个特定的Beatle。

💡外键(Foreign Key)

外键是一个表中的列或列组合,它在另一个表中作为主键,用于建立两个表之间的关系。虽然视频中没有直接提到外键,但是在规范化过程中,外键通常用于保持数据的引用完整性。

💡多值依赖(Multivalued Dependency)

多值依赖是指一个属性集合依赖于另一个属性集合的关系,其中左边的属性集合可以取多于一个值。在视频中,第四正规形式(4NF)要求表中的多值依赖必须依赖于整个主键,而不是主键的一部分。例如,鸟屋模型与颜色和风格的关联就是多值依赖。

Highlights

数据库规范化的目的和好处。

规范化的基本概念和实际操作。

不进行规范化可能导致的问题。

第一范式(1NF)的定义和要求。

违反第一范式的几种情况。

第二范式(2NF)的定义和要求。

如何识别和解决插入、更新和删除异常。

第三范式(3NF)的定义和要求。

第三范式与博伊斯-科德范式(BCNF)的关系。

第四范式(4NF)的定义和要求。

多值依赖和它的处理方法。

第五范式(5NF)的定义和要求。

如何避免表的逻辑连接导致的异常。

规范化的实用指南和设计原则。

规范化对数据库性能和一致性的影响。

规范化在实际数据库设计中的应用。

作者对视频内容的总结和观众反馈邀请。

Transcripts

play00:00

If you’ve had some exposure to relational  databases, you’ve probably come across the term  

play00:05

“normalization”. But what is normalization?  Why do we do it? How do we do it? And what  

play00:12

bad things can happen if we don’t do it? In this video, we’re going to explore database  

play00:18

normalization from a practical perspective.  We’ll keep the jargon to a minimum,  

play00:23

and we’ll use lots of examples as we go. By the  end of it, you’ll understand the so-called normal  

play00:29

forms from First Normal Form all the way up to  Fifth Normal Form – and you’ll have a clear sense  

play00:35

of what we gain by doing normalization,  and what we lose by failing to do it.  

play00:41

This is Decomplexify, bringing a welcome  dose of simplicity to complex topics.  

play00:48

Data: it’s everywhere. And some of it is wrong.  

play00:54

By and large, even a good database  design can’t protect against bad data.  

play00:59

But there are some cases of bad data that a good  database design can protect against. These are  

play01:05

cases where the data is telling us something  that logically cannot possibly be true:  

play01:12

One customer with two dates of birth is  logically impossible. It’s what we might  

play01:17

call a failure of data integrity. The data can’t  be trusted because it disagrees with itself.  

play01:24

When data disagrees with itself, that’s  more than just a problem of bad data.  

play01:29

It’s a problem of bad database design.  

play01:32

Specifically, it’s what happens when a  database design isn’t properly normalized.  

play01:38

So what does normalization mean?  When you normalize a database table,  

play01:43

you structure it in such a way that  can’t express redundant information.  

play01:48

So, for example, in a normalized table, you  wouldn’t be able to give Customer 1001 two dates  

play01:54

of birth even if you wanted to. Very broadly, the  table can only express one version of the truth.  

play02:03

Normalized database tables are not only  protected from contradictory data, they’re also:  

play02:08

easier to understand easier to enhance and extend  

play02:14

protected from insertion  anomalies, update anomalies,  

play02:18

and deletion anomalies (more on these later)  

play02:23

How do we determine whether a table isn’t  normalized enough – in other words, how do  

play02:27

we determine if there’s a danger that redundant  data could creep into the table? Well, it turns  

play02:34

out that there are sets of criteria we can use to  assess the level of danger. These sets of criteria  

play02:40

have names like “first normal form”, “second  normal form”, “third normal form”, and so on.  

play02:47

Think of these normal forms by analogy to safety  assessments. We might imagine an engineer doing a  

play02:54

very basic safety assessment on a bridge. Let’s  say the bridge passes the basic assessment,  

play03:00

which means it achieves “Safety Level  1: Safe for Pedestrian Traffic”.  

play03:05

That gives us some comfort, but suppose we want to  know if cars can safely drive across the bridge?  

play03:11

To answer that question, we need the engineer to  perform an even stricter assessment of the bridge.  

play03:16

Let’s imagine that the engineer goes ahead and  does this stricter assessment, and again the  

play03:21

bridge passes, achieving “Safety Level 2: Safe  for Cars”. If even this doesn’t satisfy us,  

play03:27

we might ask the engineer to assess the bridge  for “Safety Level 3: Safe for Trucks.” And so on.  

play03:35

The normal forms of database  theory work the same way.  

play03:39

If we discover that a table meets the  requirements of first normal form,  

play03:43

that’s a bare minimum safety guarantee. If  we further discover that the table meets  

play03:48

the requirements of second normal form, that’s  an even greater safety guarantee. And so on.  

play03:55

So let’s begin at the beginning,  with First Normal Form.  

play03:59

Suppose you and I are both  confronted by this question:  

play04:02

“Who were the members of the Beatles?” You might answer “John, Paul, George, and Ringo”.  

play04:08

I might answer “Paul, John, Ringo, and George”. Of course, my answer and your answer are  

play04:14

equivalent, despite having the  names in a different order.  

play04:18

When it comes to relational databases, the same  principle applies. Let’s record the names of the  

play04:23

Beatles in a table, and then let’s ask the  database to return those names back to us.  

play04:30

The results will get returned to us in an  arbitrary order. For example, they might  

play04:35

get returned like this. Or like this.  

play04:41

Or in any other order. There is no “right” order. Are there ever situations where there’s a right  

play04:48

order? Suppose we write down the members  of the Beatles from tallest to shortest,  

play04:53

like this. We title our list “Members Of  The Beatles From Tallest To Shortest”.  

play05:02

In this list, it’s not just the names that  convey meaning. The order of the names conveys  

play05:06

meaning too. Paul is the tallest, John is the  second-tallest, and so on. Lists like this are  

play05:12

totally comprehensible to us – but they’re not  normalized. Remember, there’s no such thing as row  

play05:19

order within a relational database table. So here  we have our first violation of First Normal Form.  

play05:25

When we use row order to convey information,  we’re violating First Normal Form.  

play05:31

The solution is very simple. Be explicit – if  we want to capture height information, we should  

play05:37

devote a separate column to it – like this. Or even better, like this.  

play05:45

So far, we’ve seen one way in  which a design can fail to achieve  

play05:49

First Normal Form. But there are others. A second way of violating First Normal Form  

play05:54

involves mixing data types. Suppose our  Beatle_Height dataset looked like this.  

play06:02

If you’re accustomed to spreadsheets, you’ll be  aware that they typically won’t stop you from  

play06:06

having more than one datatype within a single  column – for example, they won’t stop you from  

play06:11

storing both numbers and strings in a column. But  in a relational database, you’re not allowed to be  

play06:17

cagey or ambiguous about a column’s data type.  The values that go in the Height_In_Cm column  

play06:23

can’t be a mix of integers and strings. Once you  define Height_In_Cm as being an integer column,  

play06:29

then every value that goes into that column  will be an integer – no strings, no timestamps,  

play06:35

no data types of any kind other than  integers. So: mixing datatypes within a column  

play06:41

is a violation of First Normal Form, and in fact  the database platform won’t even let you do it.  

play06:48

A third way of violating First Normal Form is by  designing a table without a primary key. A primary  

play06:55

key is a column, or combination of columns,  that uniquely identifies a row in the table.  

play07:01

For example, in the table Beatle_Height,  our intention is that each row should tell  

play07:06

us about one particular Beatle, so we ought to  designate “Beatle” as the primary key of the  

play07:11

Beatle_Height table. The database platform will  need to know about our choice of primary key,  

play07:18

so we’ll want to get the primary key into  the database by doing something like this.  

play07:24

With the primary key in place, the  database platform will prevent multiple  

play07:27

rows for the same Beatle from ever  being inserted. That’s a good thing,  

play07:32

because multiple rows for the same Beatle would  be nonsensical, and perhaps contradictory.  

play07:38

Obviously, a Beatle can’t have  two different heights at once.  

play07:42

Every table we design should have a primary key.  If it doesn’t, it’s not in First Normal Form.  

play07:49

The last way of failing to achieve  First Normal Form involves the notion of  

play07:52

“repeating groups”. Suppose we’re designing  a database for an online multiplayer game.  

play07:58

At a given time, each player has a number  of items of different types, like arrows,  

play08:02

shields, and copper coins. We might  represent the situation like this.  

play08:09

A player’s inventory is what we call a “repeating  group”. Each inventory contains potentially many  

play08:15

different types of items: arrows, shields,  copper coins, and so on; and in fact there  

play08:21

may be hundreds of different types of items  that a player might have in their inventory.  

play08:26

We could design a database table that  represents the Inventory as a string of text:  

play08:32

But this is a terrible design because  there’s no easy way of querying it.  

play08:36

For example, if we want to know which players  currently have more than 10 copper coins,  

play08:41

then having the inventory data  lumped together in a text string  

play08:45

will make it very impractical to write  a query that gives us the answer.  

play08:50

We might be tempted to  represent the data like this.  

play08:54

This lets us record up to 4 items per inventory.  But given that a player can have an inventory  

play09:00

consisting of hundreds of different types of  items, how practical is it going to be to design  

play09:04

a table with hundreds of columns? Even if we were  to go ahead and create a super-wide table to hold  

play09:10

all possible inventory data, querying  it would still be extremely awkward.  

play09:16

The bottom line is that storing a repeating group  of data items on a single row violates First  

play09:21

Normal Form. So what sort of alternative  design would respect First Normal Form?  

play09:28

It would be this. To communicate the fact that  

play09:32

trev73 owns 3 shields, we have a row for Player  “trev73”, Item_Type “shields”, Item_Quantity 3.  

play09:39

To communicate the fact that  trev73 also owns 5 arrows,  

play09:43

we have a row for Player “trev73”, Item_Type  “arrows”, Item_Quantity 5. And so on.  

play09:49

And because each row in the table tells  us about one unique combination of Player  

play09:54

and Item_Type, the primary key is the  combination of Player and Item_Type.  

play10:00

So let’s review what we know  about First Normal Form.  

play10:04

1. using row order to convey  information is not permitted  

play10:08

2. mixing data types within the  same column is not permitted  

play10:13

3. having a table without a  primary key is not permitted  

play10:17

4. repeating groups are not permitted Next up: Second Normal Form.  

play10:25

Let’s look again at our Player Inventory table. This table is fully normalized. But suppose we  

play10:32

enhance the table slightly. Let’s imagine  that every player has a rating: Beginner,  

play10:37

Intermediate, or Advanced. We want to record the  current rating of each player – and to achieve  

play10:43

that, we simply include in our table  an extra column called Player_Rating.  

play10:49

Notice what’s happening here. Player  jdog21 has a Player_Rating of Intermediate,  

play10:55

but because jdog21 has two rows in the table,  both those rows have to be marked Intermediate.  

play11:01

Player trev73 has a Player_Rating of Advanced,  

play11:05

but because trev73 has four rows in the table, all  four of those rows have to be marked Advanced.  

play11:11

This is not a good design. Why not? Well,  suppose player gila19 loses all her copper coins,  

play11:19

leaving her with nothing in her inventory.  The single entry that she did have in the  

play11:24

Player_Inventory table is now gone. If we try to query the database to find  

play11:30

out what gila19’s Player Rating is, we’re out  of luck. We can no longer access gila19’s Player  

play11:36

Rating because the database no longer knows it.  This problem is known as a deletion anomaly.  

play11:43

And that’s not all. Suppose jdog21 improves  his rating from Intermediate to Advanced.  

play11:49

To capture his new Advanced rating  in the Player_Inventory table,  

play11:53

we run an update on his two records.  But let’s imagine the update goes wrong.  

play11:59

By accident, only one of jdog21’s records gets  updated, and the other record gets left alone.  

play12:06

Now the data looks like this. As far as the database is concerned,  

play12:11

jdog21 is somehow both Intermediate  and Advanced at the same time.  

play12:16

Our table design has left the door open  for this type of logical inconsistency.  

play12:20

This problem is called an update anomaly. Or suppose a new player called tina42 comes along.  

play12:27

She’s a Beginner and she doesn’t have anything  in her inventory yet. We want to record the fact  

play12:32

that she’s a Beginner, but because she  has nothing in her inventory, we can’t  

play12:37

insert a tina42 row into the Player_Inventory  table. So her rating goes unrecorded. This  

play12:42

problem is known as an insertion anomaly. The reason our design is vulnerable to these  

play12:49

problems is that it isn’t in Second Normal  Form. Why not? What is Second Normal Form?  

play12:56

Second Normal Form is about how a table’s non-key  columns relate to the primary key. In our table,  

play13:02

the non-key columns – or to use slightly  different terminology, non-key attributes – are  

play13:08

Item_Quantity and Player_Rating. They are columns  (also called attributes), that don’t belong  

play13:15

to the primary key. As we saw earlier, the primary  key is the combination of Player and Item Type.  

play13:22

Now we’re in a position to give a  definition of Second Normal Form.  

play13:25

The definition we’re going to give is  an informal one which leaves out some  

play13:29

nuances – but for most practical  purposes, that shouldn’t matter.  

play13:34

Informally, what Second Normal Form says  is that each non-key attribute in the table  

play13:38

must be dependent on the entire primary key. How does our table measure up to this definition?  

play13:45

Let’s examine our non-key attributes, which are  the attributes Item_Quantity and Player_Rating.  

play13:51

Does Item_Quantity depend on the entire primary  key? Yes, because an Item_Quantity is about a  

play13:57

specific Item_Type owned by specific  Player. We can express it like this.  

play14:03

The arrow signifies a dependency – or to give  it its proper name, a functional dependency.  

play14:08

This simply means that each value of the thing  on the left side of the arrow is associated with  

play14:13

exactly one value of the thing on the right side  of the arrow. Each combination of Player_ID and  

play14:18

Item_Type is associated with a specific value  of Item_Quantity – for example the combination  

play14:25

of Player_ID jdog21 / Item_Type “amulets”  is associated with an Item_Quantity of 2.  

play14:31

As far as Second Normal Form is  concerned, this dependency is fine,  

play14:34

because it’s a dependency on the entire primary  key. But what about the other dependency?  

play14:40

Does Player_Rating depend on the entire primary  key? No, it doesn’t. Player_Rating is a property  

play14:46

of the Player only. In other words, for any  given Player, there’s one Player_Rating.  

play14:52

This dependency on Player is the problem.  It’s a problem because Player isn’t the  

play14:58

primary key – Player is part of the  primary key, but it’s not the whole key.  

play15:02

That’s why the table isn’t in Second Normal Form,  and that’s why it’s vulnerable to problems.  

play15:08

At what point did our design go wrong, and  how can we fix it? The design went wrong  

play15:13

when we chose to add a Player_Rating column  to a table where it didn’t really belong.  

play15:18

The fact that a Player_Rating is a property  of a Player should have helped us to realise  

play15:22

that a Player is an important concept in its own  right – so surely Player deserves its own table:  

play15:29

Nothing could be simpler than that. A Player  table will contain one row per Player,  

play15:34

and in it we can include as columns the ID of  the player, the rating of the player, as well  

play15:39

as all sorts of other properties of the player  – maybe the player’s date of birth, for example,  

play15:43

maybe the player’s email address. Our other  table, Player_Inventory, can stay as it was.  

play15:50

For both tables, we can say that  there are no part-key dependencies.  

play15:54

In other words, it’s always the case that every  attribute depends on the whole primary key,  

play15:59

not just part of it. And so our  tables are in Second Normal Form.  

play16:06

Now let’s move on to Third Normal Form.  Suppose we decide to enhance the Player table.  

play16:12

We decide to add a new column  called Player_Skill_Level.  

play16:16

Imagine that in this particular multiplayer  game, there’s a nine-point scale for skill level.  

play16:22

At one extreme, a player with skill  level 1 is an absolute beginner;  

play16:25

at the opposite extreme, a player with skill  level 9 is as skilful as it’s possible to be.  

play16:30

And let’s say that we’ve defined exactly how  Player Skill Levels relate to Player Ratings.  

play16:36

“Beginner” means a skill level between  1 and 3. “Intermediate” means a skill  

play16:40

level between 4 and 6. And “Advanced”  means a skill level between 7 and 9.  

play16:46

But now that both the Player_Rating and the  Player_Skill_Level exist in the Player table,  

play16:51

a problem can arise. Let’s say that tomorrow,  player gila19’s skill level increases from 3  

play16:56

to 4. If that happens, we’ll update her row in the  Player table to reflect this new skill level.  

play17:04

By rights, we should also update her Player_Rating  to Intermediate – but suppose something goes  

play17:10

wrong, and we fail to update the Player_Rating.  Now we’ve got a data inconsistency. gila19’s  

play17:15

Player_Rating says she’s a Beginner, but her  Player_Skill_Level implies she’s Intermediate.  

play17:22

How did the design allow this happen? Second  Normal Form didn’t flag up any problems. There’s  

play17:27

no attribute here that depends only partially  on the primary key – as a matter of fact,  

play17:31

the primary key doesn’t have any parts; it’s  just a single attribute. And both Player_Rating  

play17:37

and Player_Skill_Level are dependent on it. But in what way are they dependent on it? Let’s  

play17:42

look more closely. Player_Skill_Level  is dependent on Player_ID.  

play17:48

Player_Rating is dependent on Player ID  too, but only indirectly – like this.  

play17:54

A dependency of this kind is called a transitive  dependency. Player Rating depends on Player Skill  

play18:00

Level which in turn depends on the primary  key: Player ID. The problem is located just  

play18:07

here – because what Third Normal Form forbids is  exactly this type of dependency: the dependency of  

play18:13

a non-key attribute on another non-key attribute. Because Player Rating depends on Player Skill  

play18:21

Level – which is a non-key attribute –  this table is not in Third Normal Form.  

play18:26

There’s a very simple way of repairing the  design to get it into Third Normal Form.  

play18:30

We remove Player Rating from the Player table;  so now the Player table looks like this.  

play18:35

And we introduce a new table  called Player_Skill_Levels.  

play18:40

The Player Skill Levels table tells us everything  we need to know about how to translate a player  

play18:44

skill level into a player rating. Third Normal Form is the culmination of everything  

play18:49

we’ve covered about database normalization so  far. It can be summarised in this way: Every  

play18:55

non-key attribute in a table should depend on  the key, the whole key, and nothing but the key.  

play19:01

If you commit this to memory, and keep it  constantly in mind while you’re designing a  

play19:05

database, then 99% of the time you will  end up with fully normalized tables.  

play19:12

It’s even possible to shorten this guideline  slightly by knocking out the phrase  

play19:15

“non-key” – giving us the revised guideline: every  attribute in a table should depend on the key, the  

play19:21

whole key, and nothing but the key. And this new  guideline represents a slightly stronger flavor of  

play19:27

Third Normal Form known as Boyce-Codd Normal Form.  In practice, the difference between Third Normal  

play19:33

Form and Boyce-Codd Normal Form is extremely  small, and the chances of you ever encountering  

play19:39

a real-life Third Normal Form table that doesn’t  meet Boyce-Codd Normal Form are almost zero.  

play19:45

Any such table would have to have what we call  multiple overlapping candidate keys – which gets  

play19:50

us into realms of obscurity and theoretical  rigor that are a little bit beyond the scope  

play19:55

of this video. So as a practical matter, just  follow the guideline that every attribute in a  

play20:00

table should depend on the key, the whole  key, and nothing but the key, and you can  

play20:05

be confident that the table will be in both  Third Normal Form and Boyce-Codd Normal Form.  

play20:13

In almost all cases, once you’ve normalized  a table this far, you’ve fully normalized  

play20:17

it. There are some instances where this  level of normalization isn’t enough.  

play20:22

These rare instances are dealt with  by Fourth and Fifth Normal Form.  

play20:27

So let’s move on to Fourth Normal Form. We’ll  look at an example of a situation where Third  

play20:32

Normal Form isn’t quite good enough and something  a bit stronger is needed. In our example, there’s  

play20:37

a website called DesignMyBirdhouse.com – the  world’s leading supplier of customized birdhouses.  

play20:44

On DesignMyBirdhouse.com, customers can  choose from different birdhouse models,  

play20:49

and, for the model they’ve selected,  they can choose both a custom color  

play20:53

and a custom style. Each model has its  own range of available colors and styles.  

play21:00

One way of capturing this information  is to put it all the possible  

play21:02

combinations in a single table, like this. This table is in Third Normal Form. The primary  

play21:10

key consists of all three columns: {Model,  Color, Style}. Everything depends on the key,  

play21:16

the whole key, and nothing but the key. And yet this table is still vulnerable  

play21:21

to problems. Let’s look at the rows  for the birdhouse model “Prairie”:  

play21:27

The available colors for the “Prairie”  birdhouse model are brown and beige.  

play21:32

Now suppose DesignMyBirdhouse.com decides  to introduce a third available color for  

play21:36

the “Prairie” model: green. This will mean we’ll  have to add two extra “Prairie” rows to the table:  

play21:44

one for green bungalow, and  one for green schoolhouse.  

play21:49

If by mistake we only add a row for green  bungalow, and fail to add the row for green  

play21:53

schoolhouse, then we have a data inconsistency. Available colors are supposed to be completely  

play22:00

independent of available styles. But our  table is saying that a customer can choose  

play22:05

green only for the bungalow style, not for  the schoolhouse style. That makes no sense.  

play22:11

The prairie birdhouse model is available in green,  so all its styles should be available in green.  

play22:16

Something about the way the table is designed has  allowed us to represent an impossible situation.  

play22:22

To see what’s gone wrong, let’s have a  closer look at the dependencies among Models,  

play22:26

Colors, and styles. Can we say that Color  has a functional dependency on Model?  

play22:33

Actually no, because a specific Model  isn’t associated with just one Color.  

play22:38

And yet it does feel as though Color has some  relationship to Model. How can we express it?  

play22:44

We can say that each Model has a specific set  of available Colors. This kind of dependency is  

play22:50

called a multivalued dependency, and we express  it with a double-headed arrow, like this:  

play22:58

And it’s equally true that each Model  has a specific set of available Styles.  

play23:04

What Fourth Normal Form says is that the only  kinds of multivalued dependency we’re allowed  

play23:08

to have in a table are multivalued dependencies  on the key. Model is not the key; so the table  

play23:15

Model_Colors_And_Styles_Available  is not in Fourth Normal Form.  

play23:20

As always, the fix is to split  things out into multiple tables.  

play23:26

Now, if DesignMyBirdhouse.com expands the range of  Prairie-Model colors to include green, we simply  

play23:32

add a row to the Model_Colors_Available table: And no anomalies are possible.  

play23:42

We’re now ready for Fifth Normal Form, the  last normal form covered in this video.  

play23:48

For our Fifth Normal Form example, we imagine  that there are three different brands of ice  

play23:52

cream available: Frosty’s, Alpine, and Ice  Queen. Each of the three brands of ice cream  

play23:58

offers a different range of flavors: Frosty’s offers vanilla, chocolate,  

play24:03

strawberry, and mint chocolate chip Alpine offers vanilla and rum raisin  

play24:08

Ice Queen offers vanilla,  strawberry, and mint chocolate chip  

play24:13

Now we ask our friend Jason what  types of ice cream he likes.  

play24:16

Jason says: I only like vanilla and chocolate.  And I only like the brands Frosty and Alpine.  

play24:24

We ask our other friend, Suzy, what types of  ice cream she likes. Suzy says: I only like  

play24:30

rum raisin, mint chocolate chip, and strawberry.  And I only like the brands Alpine and Ice Queen.  

play24:37

So, after a little bit of brainwork, we  deduce exactly which ice cream products  

play24:41

Jason and Suzy are willing to eat;  and we express this in a table:  

play24:48

But time passes, tastes change, and at some point  Suzy announces that she now likes Frosty’s brand  

play24:54

ice cream too. So we need to update our table. It won’t come as any surprise that we might get  

play25:03

this update wrong. We might successfully add a  row for Person Suzy – Brand Frosty’s – Flavor  

play25:09

Strawberry, but fail to add a row for Person Suzy  – Brand Frosty’s – Flavor Mint Chocolate Chip.  

play25:15

And this outcome wouldn’t just be wrong – it  would be logically inconsistent – because we’ve  

play25:20

already established that Suzy likes Frosty’s  brand, and likes Mint Chocolate Chip flavor,  

play25:26

and therefore there’s no way she can  dislike Frosty’s Mint Chocolate Chip.  

play25:32

In this example, we went wrong right at the  beginning. At the beginning, we were given  

play25:36

three pieces of information. First, we were told  which brands offered which flavors. Second, we  

play25:42

were told which people liked which brands. Third,  we were told which people liked which flavors.  

play25:49

From those three pieces of information, we  should have simply created three tables.  

play25:56

And that’s all we needed to do. All the  facts of the situation have been represented.  

play26:02

If we ever want to know what  specific products everyone likes,  

play26:05

we can simply ask the database platform,  expressing our question in the form of a  

play26:10

piece of SQL that logically deduces the  answer by joining the tables together.  

play26:17

To sum things up: if we want to ensure  that a table that’s in Fourth Normal  

play26:21

Form is also in Fifth Normal Form, we need  to ask ourselves whether the table can be  

play26:26

logically thought of as being the result  of joining some other tables together.  

play26:31

If it can be thought of that way,  then it’s not in Fifth Normal Form.  

play26:36

If it can’t be thought of that way,  then it is in Fifth Normal Form.  

play26:42

We’ve now covered all the normal forms from First  Normal Form to Fifth Normal Form. Let’s review,  

play26:49

keeping in mind that for a table to comply with  a particular normal form, it must comply with  

play26:54

all the lower normal forms as well. The rules for first normal form are:  

play27:00

1. using row order to convey  information is not permitted  

play27:04

2. mixing data types within the  same column is not permitted  

play27:08

3. having a table without a  primary key is not permitted  

play27:12

4. repeating groups are not permitted The rule for second normal form is:  

play27:19

Each non-key attribute in the table must  be dependent on the entire primary key.  

play27:25

The rule for third normal form is: Each non-key  attribute in a table must depend on the key,  

play27:30

the whole key, and nothing but the key. If we  prefer to drop the phrase “non-key”, we end up  

play27:37

with an even simpler and even stronger version of  third normal form called “Boyce-Codd Normal Form”:  

play27:43

Each attribute in a table must depend on the  key, the whole key, and nothing but the key.  

play27:49

The rule for fourth normal form is that  the only kinds of multivalued dependency  

play27:54

we’re allowed to have in a table are  multivalued dependencies on the key.  

play27:59

Finally, the rule for Fifth Normal Form  is: it must not be possible to describe  

play28:04

the table as being the logical result  of joining some other tables together.  

play28:11

I hope you’ve found this video helpful.  If you have any comments or questions  

play28:14

on what you’ve just seen, by all means  put them in the comments section below.  

play28:19

And if you have any suggestions for other  complex topics that you’d like to see explained  

play28:24

on Decomplexify, again let me know in the  comments. So long, and thanks for watching!

Rate This

5.0 / 5 (0 votes)

相关标签
数据库规范化第一范式数据完整性设计原则第三范式Boyce-Codd第四范式第五范式数据异常逻辑结构