W6L1_Union Find Data Structure

IIT Madras - B.S. Degree Programme
28 Sept 202129:49

Summary

TLDRThe video script delves into the intricacies of Kruskal's algorithm for finding minimum cost spanning trees, emphasizing the role of the Union-Find data structure. It explains the initial naive implementation of Union-Find, where union operations are costly. The script then introduces an optimized approach using two dictionaries to track components and their sizes, allowing unions to be performed more efficiently. The concept of merging smaller components into larger ones to reduce the complexity of union operations is highlighted. The summary also touches on the amortized analysis that proves the improved union operation's efficiency, ultimately enhancing Kruskal's algorithm's performance to near-linear time complexity.

Takeaways

  • 🌳 The script discusses two algorithms for finding minimum cost spanning trees: Prim's and Kruskal's algorithms.
  • 🔍 Prim's algorithm extends a tree by finding the shortest edge connected to the current tree, similar to Dijkstra's algorithm.
  • 🌉 Kruskal's algorithm builds a tree by starting with disconnected nodes and adding the shortest edge possible without forming a cycle.
  • 🔢 Kruskal's algorithm involves sorting all edges in ascending order and then processing them to build the minimum cost spanning tree.
  • 🔄 A key challenge in Kruskal's algorithm is efficiently tracking the components (sets of connected vertices) as edges are added.
  • 🔑 The 'find' operation identifies the component a vertex belongs to, while the 'union' operation merges two components into one.
  • 📈 Initially, union operations can be inefficient, taking O(n) time, as they require updating all vertices in the smaller component.
  • 📚 The script introduces a data structure to improve union find operations, using dictionaries to track components and their members.
  • 🔑 By keeping track of component sizes and merging smaller components into larger ones, the union operation can be optimized.
  • 📊 The amortized complexity of the improved union find is O(log m) per operation, leading to significant efficiency gains in Kruskal's algorithm.
  • 📈 The overall time complexity of Kruskal's algorithm, with the optimized union find, becomes O(n log n), a substantial improvement from the naive O(n^2).

Q & A

  • What are the two algorithms mentioned for finding minimum cost spanning trees?

    -The two algorithms mentioned for finding minimum cost spanning trees are Prim's algorithm and Kruskal's algorithm.

  • How does Prim's algorithm extend a tree in the context of minimum cost spanning trees?

    -Prim's algorithm extends a tree by finding the shortest edge connected to the current tree.

  • What is the initial state of nodes in Kruskal's algorithm?

    -In Kruskal's algorithm, the initial state of nodes is disconnected.

  • How does Kruskal's algorithm build a minimum cost spanning tree?

    -Kruskal's algorithm builds a minimum cost spanning tree by starting with disconnected nodes and connecting them by adding the shortest edge possible, processed in ascending order.

  • What is the main difficulty in implementing Kruskal's algorithm efficiently?

    -The main difficulty in implementing Kruskal's algorithm efficiently is keeping track of the components and determining which nodes belong to the same component or different components as edges are added.

  • What is the purpose of the 'find' operation in the context of Kruskal's algorithm?

    -The 'find' operation is used to determine which partition a given vertex belongs to, which is crucial for checking if adding an edge would form a cycle.

  • What does the 'union' operation in Kruskal's algorithm achieve?

    -The 'union' operation in Kruskal's algorithm merges two components into a single partition when an edge connecting them is added to the spanning tree.

  • Why is the 'union' operation considered time-consuming in the initial implementation of Kruskal's algorithm?

    -The 'union' operation is considered time-consuming because it requires finding all other vertices in the same component as a given vertex and updating their partition, which involves a linear scan proportional to the size of the set.

  • How can the efficiency of the 'union' operation be improved in Kruskal's algorithm?

    -The efficiency of the 'union' operation can be improved by keeping track of the size of each component and always merging the smaller component into the larger one, which reduces the number of elements that need to be relabeled in each operation.

  • What is the amortized complexity of the 'union' operation in the optimized version of Kruskal's algorithm?

    -The amortized complexity of the 'union' operation in the optimized version of Kruskal's algorithm is O(log m), where m is the number of union operations.

  • How does the optimized union-find data structure contribute to the overall efficiency of Kruskal's algorithm?

    -The optimized union-find data structure contributes to the overall efficiency of Kruskal's algorithm by reducing the time complexity of the 'union' operation, which in turn brings the overall time complexity of the algorithm down from O(n^2) to O(n log n).

Outlines

00:00

🌳 Introduction to Kruskal's Algorithm and Union-Find

This paragraph introduces Kruskal's algorithm for finding a minimum cost spanning tree, contrasting it with Prim's algorithm. It explains that Kruskal's builds the tree from the ground up by adding the shortest available edges without forming cycles. The challenge lies in efficiently tracking the components (sets of connected vertices) as edges are added. The paragraph outlines the need for 'find' and 'union' operations within a disjoint set data structure to manage these components, highlighting the inefficiency of naive union-find implementations due to the need to update all elements in a component.

05:05

🔍 Inefficiency of Naive Union-Find and the Need for Improvement

The second paragraph delves into the inefficiency of the naive union-find method used in Kruskal's algorithm. It describes how the union operation, which combines components, requires a full scan of all elements to update their component assignments. This results in an O(n) time complexity for each union, which becomes a bottleneck when performing multiple unions. The paragraph emphasizes the need for a more efficient data structure to handle the dynamic nature of component updates during the algorithm's execution.

10:09

🔄 Enhancing Union-Find with Component Tracking

The third paragraph introduces an enhanced union-find method to improve the efficiency of Kruskal's algorithm. It discusses the use of a 'members' dictionary to track which vertices belong to which components and a 'size' dictionary to keep count of the vertices in each component. This allows the union operation to be more efficient by only iterating over the vertices in the smaller component, rather than scanning all vertices. The paragraph outlines the updated union and find operations and explains how this approach reduces the time complexity of union operations.

15:13

📈 Amortized Analysis of the Improved Union-Find

This paragraph provides an amortized analysis of the improved union-find method. It explains that by always merging the smaller component into the larger one, the size of any component grows exponentially with each union operation. Since the largest component can only be as large as 2m (where m is the number of union operations), the number of times a component can double in size is limited to log m. This insight leads to the conclusion that the amortized time complexity of a union operation is log m, significantly improving the overall performance of Kruskal's algorithm.

20:18

🔗 Impact of Improved Union-Find on Kruskal's Algorithm

The fifth paragraph discusses how the improved union-find method impacts the overall time complexity of Kruskal's algorithm. With the new method, the sorting of edges takes O(m log m) time, and the union operations, which are performed n-1 times (where n is the number of vertices), take O(n log n) time in total. The paragraph concludes that the overall time complexity of Kruskal's algorithm is now O(N log N), where N represents the size of the graph, a substantial improvement over the naive implementation.

25:21

🌐 Advanced Union-Find Techniques and Their Impact

The final paragraph explores advanced techniques for the union-find data structure, such as using trees to represent components and applying path compression to flatten these trees. This approach can reduce the time complexity of both union and find operations to almost constant time, further optimizing Kruskal's algorithm. The paragraph encourages further exploration of these advanced techniques for their potential to make each union-find operation cost-effective.

Mindmap

Keywords

💡Minimum Cost Spanning Trees

Minimum Cost Spanning Trees (MCST) are a theme central to the video. They are a subset of the edges of a connected, undirected graph that connects all the vertices together, without any cycles and with the minimum possible total edge weight. The video discusses two algorithms for finding MCSTs: Prim's and Kruskal's. Prim's algorithm starts from a single vertex and grows the tree one vertex at a time, while Kruskal's algorithm starts with all vertices as separate trees and merges them incrementally using the smallest available edge that doesn't form a cycle.

💡Prim's Algorithm

Prim's Algorithm is a greedy algorithm used to find the minimum spanning tree in a connected, undirected graph. It is mentioned in the script as the first algorithm studied for MCSTs. It works by starting with an arbitrary node and at each step, adds the cheapest edge that connects a vertex not yet in the tree to the tree.

💡Dijkstra's Algorithm

Dijkstra's Algorithm is a graph search algorithm that solves the single-source shortest path problem for a graph with non-negative edge weights, and it is referenced in the script in comparison to Prim's algorithm. Although not directly about minimum spanning trees, it shares the 'greedy' nature of Prim's in extending a solution incrementally by the shortest edge.

💡Kruskal's Algorithm

Kruskal's Algorithm is another algorithm for finding the minimum spanning tree, which is discussed extensively in the script. It is a 'bottom-up' approach that starts with all nodes as separate and progressively merges the lightest edges that do not form a cycle, eventually forming a single tree that spans all nodes.

💡Disjoint Sets

Disjoint sets are sets that do not have any elements in common. In the context of the video, disjoint sets are used to represent the collection of components in Kruskal's algorithm before they are merged to form the minimum spanning tree. The script discusses how to keep track of these disjoint sets efficiently using union-find data structures.

💡Union-Find

Union-Find, also known as Disjoint Set Union (DSU), is a data structure that keeps track of a set of elements partitioned into multiple disjoint subsets. It is crucial to Kruskal's algorithm as it helps to determine whether adding an edge would form a cycle. The script describes how to implement union-find with efficient operations for 'find' and 'union'.

💡Find Operation

The 'find' operation in union-find is used to determine which subset a particular element is in. The script explains how this operation is implemented in the context of Kruskal's algorithm to check if two vertices are in the same subset, which would mean adding the edge would form a cycle.

💡Union Operation

The 'union' operation in union-find is used to merge two subsets into a single subset. In the script, it is described as a critical operation in Kruskal's algorithm, where it is used to join two components into one when the corresponding edge is added to the minimum spanning tree.

💡Path Compression

Path Compression is an optimization technique used in union-find data structures to flatten the structure of the sets, reducing the cost of 'find' operations. It's mentioned in the script as a technique that can be applied to make the 'find' operation almost constant time, thus improving the efficiency of the algorithm.

💡Amortized Analysis

Amortized Analysis is a method to analyze the average time complexity of an algorithm over a sequence of operations, considering the worst-case scenario for individual operations might not reflect the overall performance. The script uses amortized analysis to argue that the 'union' operation in the improved union-find has an amortized time complexity of O(log m), which is crucial for the efficiency of Kruskal's algorithm.

💡Sorting

Sorting is a fundamental algorithmic process discussed in the script, specifically in relation to the initial step of Kruskal's algorithm where all edges are sorted by weight. The efficiency of sorting affects the overall time complexity of the algorithm, with the script noting that sorting takes O(m log m) time.

Highlights

Introduction to minimum cost spanning trees and algorithms like Prim's and Kruskal's.

Explanation of Prim's algorithm and its approach to extending a tree by finding the shortest edge.

Description of Kruskal's algorithm and its method of building a tree bottom-up by adding the shortest edges.

The importance of sorting edges in Kruskal's algorithm and the process of adding edges without forming cycles.

Challenge of tracking components in Kruskal's algorithm and the need for efficient data structures.

Introduction to the concept of disjoint sets and their role in Kruskal's algorithm.

The function 'find' and its role in determining the partition a vertex belongs to.

Explanation of the 'union' operation and its impact on merging components.

Discussion on the inefficiency of the naive union-find implementation in terms of time complexity.

Introduction of the improved union-find data structure with 'members' and 'size' dictionaries.

Optimization of the 'union' operation by merging smaller components into larger ones.

Analysis of the amortized complexity of the improved union-find algorithm.

Explanation of how the improved union-find affects the efficiency of Kruskal's algorithm.

Final time complexity analysis of Kruskal's algorithm using the optimized union-find data structure.

Introduction to path compression and its role in further optimizing union-find operations.

Conclusion on the overall improvement of Kruskal's algorithm through the use of advanced union-find techniques.

Transcripts

play00:10

When we studied minimum cost spanning trees, we saw two algorithms. The first one was Prims

play00:15

algorithm, which, like Dykstra s algorithm, tried to extend a tree by finding the shortest

play00:21

edge connected to the current tree. And the other was Kruskal s algorithm.

play00:25

So, in Kruskal s algorithm, we built up a tree bottom up. We started with disconnected

play00:31

nodes, and then we connected them by adding the shortest edge possible. So, we begin by

play00:36

processing all the edges in ascending order, of course. So, we sort the edges, you pick

play00:41

up the smallest edge, and each time we pick up an edge, we see if the edge forms a cycle.

play00:46

So, we have a collection of components, which have already been joined together by the edges.

play00:52

And if the new edge does not create a cycle within a component, we add it. And in the

play00:57

process, what happens is that the endpoints of the edge u and v belonged earlier to different

play01:03

components. Because if they were in the same component, the component is connected, it

play01:06

would form a cycle. But having now connected these two components together, we must merge

play01:11

the components as we go along, so that we do not add another edge within this component.

play01:15

So, the main difficulty that we saw with implementing Kruskal s algorithm efficiently was precisely

play01:21

this task of keeping track of where the components were at any given time. So, we know that in

play01:26

the beginning, it is all disjoint because it is a bottom up algorithm, everything is

play01:30

a singleton. But as we go along and edges are added, which edges are, which nodes are

play01:36

part of the same component, which ones are different components, this was the problem.

play01:40

So, abstractly, what we have is a collection of vertices, which are partitioned. So, this

play01:46

is a collection of disjoint sets, every vertex belongs to one of the partitions. And no vertex

play01:53

belongs to two partitions. So, the collection of partitions together gives us the full set.

play02:00

But each collection in that partition is in, is in, has an empty intersection with every

play02:05

other collection. And now given this partition, so, we have this kind of picture that we have

play02:11

a set.

play02:12

And it has now been divided like this, into these disjoint partitions. And now I say,

play02:19

I give you a vertex u and I asked which partition it belongs to? So, that is, this expression,

play02:24

this function called find. So, find tells us which partition a given vertex belongs

play02:30

to. And similarly, now when I take say u belongs to this, and say v belongs to this, then after

play02:38

this edge has been added. So, I have now found an edge which connects u and v.

play02:42

Now, these two become the same partition. So, I have to have a second operation, which

play02:46

says given two elements of the set of the collection with the set I am dealing with,

play02:51

take their partitions and make them into a single partition. So, the union operation

play02:55

and the find operation and what we actually saw in Kruskal s algorithm was this, this

play02:59

union operation is what took us time. Because we had to find all the other vertices which

play03:03

are in the same collection as u, same partition as u and convert them to the same partition

play03:08

as v.

play03:09

So, what we need is a data structure, which can implement these operations. So, essentially

play03:14

we have a set, a set of elements which is partitioned into components. So, as we said,

play03:19

partition means that C1 union C2 union Ck is actually all of S. And each element, small

play03:25

s in S belongs to exactly one of the Cis or Cjs. So, this is what it means to be a partition.

play03:32

And now, we need to first create the initial partition.

play03:35

So, the initial partition is going to be the one where every element is on its own. So,

play03:40

you have a singleton partition containing just s for every small s, in your set. And

play03:47

then we have these two operations as we said, find and union. So, find tells us the name

play03:51

of the partition or the name of the component that s belongs to. And union given two set

play03:56

elements, s and s prime tells us, gives us back a single component for all the elements

play04:02

in the same component as s and s prime together.

play04:05

So, let us for simplicity assume that our elements are numbered 0 to n minus 1, this

play04:11

is what we typically do even for vertices when we deal with graphs. We deal it, we will

play04:16

typically use 0 to n minus 1 as the names of our vertex set. So, we will set up a mechanism

play04:25

to keep track of the components and the easiest way is to have either a list or an array or

play04:31

a dictionary, where now because these are 0 to n minus 1, I can interpret these as indices

play04:35

in the list or as keys in the dictionary or as positions in the array.

play04:39

So, if I look up the eighth element of component, it will tell me which component the element

play04:46

i belongs to. So, now what does make union find do? Make union find basically creates

play04:52

names for the components, which are the same as the vertices. So, I have, I have elements

play04:56

0 to n minus 1 and I also have components. So, this is just for convenience, I do not

play05:05

want to create another set of names. So, the components are also called 0 to n minus 1.

play05:09

So, 0 goes to component 0, 1 goes to component 1, and so on. Because I need a name for the

play05:15

set. The component containing 7, the component containing 6, so I need a name for that set.

play05:22

And those names are going to be just the same 0 to n minus 1. So, component of the element,

play05:28

i will be the component i, that is what this is saying, component of the element i. So,

play05:33

this is the element, and this is the component number.

play05:41

This is how we set up our thing and make with this make union find. So, we have this, say,

play05:48

dictionary component. And now when I want to find out which component vertex i belongs

play05:53

to or a set element i belongs to, I just query this dictionary, I will look at probate with

play05:57

the index, or the key value i. And the, what is stored in the dictionary is the component

play06:03

number. So, initially, it is i and it will change as it goes along because of unions.

play06:06

So, let us see how union works. So union, as we saw is basically a sequential scan.

play06:11

So, you want to now take two elements, i and j, and combine their components. So, over

play06:18

time, they have reached some component name. So, if I look at component of i, it is some

play06:24

c old and c new because I am going to merge in this, I am assuming that I am going to

play06:29

put everything which is in component of i into the same component as j.

play06:33

So, I am not going to change the names of the component for the same component as j,

play06:40

I am going to change the name of the component for things which belong to the component of

play06:43

i. So, let us say component of i is old component c, and component of j is a new component c

play06:49

new. Now, I have to take every vertex, and this is the part that is inefficient. I have

play06:55

no idea which other nodes have the same component as i.

play06:59

So, I have to take every other vertex. And if that vertex has a component, which is the

play07:04

same as i, I will reset its component to be the same as j. So, this is once linear scan,

play07:11

which is proportional to the size of the set, which I have to do every time I do a union.

play07:15

It does not really matter whether the component of i was small or not. I have to look at every

play07:19

other vertex because I have no indirect information about which other vertices have been merged

play07:23

with i, so far.

play07:24

So, what is the component of i contain? So, with this, it is easy to see that this make

play07:30

union find takes order n. Because I have to go through and label the component number

play07:36

as i for every vertex i. So, this is just a simple linear scan across all the elements

play07:40

in my set. Find just looks up a dictionary. So, if we assume that this is a dictionary,

play07:44

or an array with a constant time lookup of the ith element, then in constant time, or

play07:50

O, one time, I will find the component for a given element i.

play07:55

But union as we can see, takes time proportional to the size of the set. Because I have to

play08:01

go through every element in the set and check if it has got to be merged or not. Whether

play08:06

or not it has to be merged, I have to check it. So, this is much like an adjacency matrix

play08:09

representation of a graph, in order to find the neighbors of a vertex, you have to scan

play08:15

all the columns whether or not this vertex has few neighbors, or has many neighbors,

play08:19

you have to scan all the columns to find the neighbor.

play08:21

So, in the same sense, you have no idea how many elements share a component with i, so

play08:25

you have to scan all of them. And wherever you find something which shares a component

play08:29

with i, you have to move it to j. So, this means that one union operation takes order

play08:35

n time. So, if I give m union operations, then it is going to take m times n time. So,

play08:40

this is going to be the bottleneck that we saw.

play08:42

Because in Kruskal s algorithm, we have to add n edges or n minus 1 edges to make a spanning

play08:49

tree. So, we do order n merges, and naively speaking, if we do order n merges and each

play08:56

merge takes order n time, then I spent order n squared time executing all of Kruskal s

play09:00

algorithm. And that is what we are trying to improve.

play09:05

So, how do we improve this? Well, what is the bottleneck? So, the bottleneck is that

play09:09

we do not know, so if I am sitting here and if I have a component which contains i, I

play09:14

do not know what are the other vertices in this component. So, this whole component has

play09:21

some name c. So, I know that component of i, is C. The question is what other vertices

play09:36

have components c. So, which are the other vertices in the same component?

play09:42

So, now we keep track of an int, an dictionary going the other way. So, component tells us

play09:49

given a vertex, which component it belongs to. Members tells us given a component, which

play09:55

are the vertices which belong to it. So, members of c will be list, it will give you a list

play10:02

of vertices. In general initially, for example, I will have members of, of the ith component

play10:09

will just be the list i. Because, initially every component has only one vertex, but as

play10:14

it grows, it will be a longer list.

play10:16

And of course, now since this is a list, I can take its length. I can take the length

play10:21

of this members c, and I will know how many vertices belong to this component. But I do

play10:26

not want to actually spend time computing the length each time. So, I will separately

play10:30

store this as size of c. So, I am keeping track of two new dictionaries. One is members

play10:35

of c, members of c says give me a component and give me the list of vertices belong to

play10:40

this component.

play10:41

Size of c says, how many vertices are there, which is just the length of that list. But

play10:44

I do not want to compute that list explicitly each time, so I will just keep track of it

play10:48

because it is easy to update as we will see. Now given this, let us see how to do our implementation.

play10:53

So, first we have to implement this make union find initial step, where we create the singletons.

play11:00

So, we had this old step which we repeat, which is we for every vertex for every element

play11:04

i.

play11:05

I set the component of i to be the actual component i itself. So, remember that we were

play11:11

using i for both 0 to n minus 1 to name both the elements and the components. So, the component

play11:16

of i is just a component i. On the other hand, we have to now do the reverse map. So, we

play11:21

have to say what does component of i contain? Well members of i is the list containing i.

play11:26

And size of component i is 1. Because I have only singleton's at all point. So, this is

play11:32

the initialization, very straightforward.

play11:35

What about find? Well, find does not update anything, find very looks up the component.

play11:39

So, it does not have to look at members or size because it just needs inform, information

play11:43

in the dictionary component. So, as before find of i just looks up component of i. Union

play11:49

is where we were having problems, so let us see if we can make union a little more efficient.

play11:54

So, as before, we find out the names of the components of the two vertices are two elements

play12:00

we are dealing with i and j.

play12:01

So, we are going to as before, assume that we are moving everything in the component

play12:06

of i, to be in the component of j. So, we call the component of i as c old and the component

play12:11

of j as c new. And this difference now is that instead of going over everything, before

play12:17

we were doing this for loop over every element in our set, for every vertex in our set. So,

play12:26

we are going from 0 to n minus 1. But now I do not have to do it for every vertex, (())(12:30)I

play12:30

know which ones have to change.

play12:32

The ones that have to change are the ones that are in the same component as i. So, I

play12:38

just have to look at the list members of c old and for every k members of c old, I will

play12:45

first of all change its component. I will update its component from c old to c new.

play12:50

But now I have to keep track of these other things as I go along. So, I have to take this

play12:55

c new component and append this new element to its list. So, a k is now being added as

play13:01

a new element in members of c new.

play13:04

So, as I change the name of the component of k from c old to c new, I go to the members

play13:10

of C new and I add k to it. And when I add k to it, its size also changes. So, I also

play13:15

increment the size of c new by 1. So, these are two extra steps, which happened when I

play13:20

re label, when I move vertex from this component to that component I must add it to its list

play13:24

members and I must upgrade the size by 1. So, this is simple enough.

play13:28

So, the question is what have we gained? Because what we have gained really is this step. Compared

play13:37

to our earlier implementation, this is still order n and this is still order one. So, the

play13:43

question is what have we gained in union? So, the point is that by going over members

play13:49

of c old we can get away with only working with this many elements rather than this many

play13:57

elements. So, when I re label instead of looking at order n elements, I am looking at order

play14:01

of the size of the component.

play14:03

So, this is similar to our graph theoretic things where we say that when we look at the

play14:08

neighbors of a vertex, we are interested in the degree of the vertex, the actual edges

play14:12

rather than all the neighbors which are not (())(14:14). So, in the same spirit, we are

play14:15

only interested in those elements which belong to this component. We are not interested in

play14:20

scanning all the components, so we are explicitly keeping that in this members list.

play14:24

So, the question is so, this is fine, we are using members here. But why are we keeping

play14:31

track of the size? Because there is no apparent reason in this whole thing except that we

play14:35

have to update it correctly and all that. Nowhere are we using size in this union operation.

play14:39

We are just updating it. We are updating it for sure, but we are not using the fact that

play14:45

the size is available to us. So the way we will use size as follows.

play14:49

We will always choose whether to, so we remember we have a choice. We can either go from i

play14:53

to j or j to I, which component gets relabeled as which component. Do I make everything in

play15:00

come opponent of i point to j or make everything and component of j point to i. So, that is

play15:06

where size will play a role. So, size will say that we only always merge the smaller

play15:12

component into the bigger component.

play15:14

So, we have basically two components. So, if i belongs to c and j belongs to c prime,

play15:20

I have to decide whether to make everything in c the same as c primer, or everything in

play15:23

c prime. So, the final component that survives is going to be c or c prime? So, the rule

play15:28

is that the larger component will survive. So, if c is smaller than c prime, I will rename,

play15:33

re label everything reliable everything inside c to be c prime.

play15:36

And if c prime happens to be smaller than c, I will do the opposite. So, that is the

play15:40

crucial decision that size gives us. So, a priori does, does not seem to be so helpful.

play15:49

Remember, our problem was that when we were doing this as range of n, blindly, we were

play15:57

spending order n time on every merge. But now we are spending something proportional

play16:02

to the size of the smaller set. But the size of the smaller set could still be almost half.

play16:08

It could be half plus a little bit, half minus a little bit. So, in the worst case, a single

play16:12

merge operation could still be order n. Because it could be taken by two steps. So, we have

play16:17

not really guaranteed that an individual merge operation is less expensive now, up to an

play16:24

order constant factor. Of course, it will be half as expensive, but half is not good

play16:28

enough. It is a constant factor, we want something which is a little more strong than half.

play16:33

So, it turns out that this we cannot deny, I mean, it could be that an individual merge

play16:37

will cost half n time. But it will not happen frequently. So, we have to do our accounting

play16:45

a little carefully. So, we are not going to change the implementation. This is our final

play16:49

implementation. But we are going to argue that this is good enough, when we have this

play16:52

rule, of always merging the smaller component to the larger component, we are actually going

play16:58

to get a sizable improvement.

play17:00

So, why does this happen? Well, basically, if I take a vertex i and I track what happens

play17:07

to i. So, initially, its component is i. The first time it is forced to change its component,

play17:14

it changes its component from some c to c prime. Then from c prime to c double prime.

play17:20

But what I have guaranteed now is that when I take i and I look at the component it belongs

play17:26

to and Now supposing this is the other component, which has j and I now merge this.

play17:32

Then i moves from c to c prime, but c prime was at least as big as c. Because c prime

play17:39

was a bigger component. So, if c had some k elements, then c prime has more than k elements.

play17:44

So, totally now, the component of i has gone from k to something bigger than 2k. So, the

play17:49

size of the component of i doubles every time it is labeled. So basically, each time I re

play17:56

label the component of i, I move it to a set, which is twice as big as it was before, at

play18:01

least twice as big, could we could be even bigger.

play18:05

Now the other point to note is that when I do a merge, how many different elements change

play18:12

their component from the initial state. So, initially, remember, I have 0 to n minus 1,

play18:18

as components, 0 to n minus 1. Now at some point, I do a merge. So, maybe I take i and

play18:24

j. So, I take these, and then they will both go say to j. So, at this point, I have touched

play18:30

i and j. So, I have changed their component structure. In the case of one I have changed

play18:35

the component name and the other one, I have changed it to be in a component, which has

play18:39

new elements.

play18:40

Now the question is, after m such things, after m union operations, how many elements

play18:46

have had their component touched in this way, either it has moved, or something has come

play18:50

into it? Well, you can see that, if I take two elements, then I had touched two. The

play18:57

next merge could, next union could take two other elements and touch two. So, I could

play19:01

do 2 plus 2 plus 2, which is 2. The first merge, first union merges these two, next

play19:08

union merges these two.

play19:10

But I cannot do any more than that, I claim. Because if I do not take two new elements,

play19:13

then I must be taking some set which has already been touched and one more element. So, at

play19:19

most, I can touch 2m elements. So, this means that at most 2m elements have had their component

play19:26

updated since the beginning. But if at most 2m elements have had their components updated,

play19:32

the worst that can happen is that all of them are merged into a single component.

play19:35

I cannot have a larger component than that. So, the largest component of any vertex in

play19:43

my set is at most 2 times m. So, many of them will still have singletons, this is what we

play19:49

are saying. So, after m union find, m union operations, the largest set in my partition

play19:55

cannot be bigger than 2m. So, we have now these two facts. The fact is that each time

play20:01

I change the partition of i, it moves to a set, which is twice as big.

play20:05

But it cannot go beyond a certain size because across m operations, it can only grow up to

play20:10

2m. So, combining this, we see that the size of a component will grow from 1 to 2 to 4

play20:18

to 8, but it has to stop before 2m. Because it cannot get beyond 2m, because 2m is the

play20:22

largest set I could possibly have, after m union operations. So, that means is doubling

play20:27

can happen at most log m times.

play20:28

That means if I pick a vertex, and I examine how many times it changes its component, it

play20:34

can be at most log m times. So, a single vertex cannot change its component more than log

play20:38

m times. So, this is what gives us now a better bound for union find.

play20:44

Because it says that basically over m updates. First of all, we know that at most, 2m elements

play20:52

are relabeled. And each of them is relabeled at most log m times. So, the remember our

play20:59

relabeling operation is now optimal, I never look at a vertex to re label unless I need

play21:03

to. Because I am looking only at members of c. So, every time I did decide whether to

play21:09

re label a vertex or not, I am going to re label it, I never look at a vertex and, and

play21:13

do not re label it.

play21:14

So, if I only need to know how many reliable operations are there, because I never look

play21:18

at vertices, or try to do anything with vertices in union when I am not relabeling them. But

play21:23

I know that a, there are at most 2m of them, because that is the limit of how many vertices

play21:27

ever get relabeled. And each one of them gets done at most log m times. So, if I add it

play21:32

up, it is like, again, it is a similar analysis. But in a more complicated setting.

play21:36

When we said that, when you do breadth first search or depth first search using an adjacency

play21:40

list, then as we process all the vertices, we do the sum of the degrees. So, an individual

play21:46

vertex might add a lot of neighbors to the queue in breadth first search but others will

play21:51

add less. And totally the number of neighbors I have to look at is going to be the sum of

play21:55

the degrees and therefore it gets bounded by the total number of edges.

play21:58

Similarly, here, we are saying that the total number of operations of relabeling across

play22:04

all the M union operations is going to be m times log m. Because 2m, order of m times

play22:12

log m, let us say. Because 2m vertices are going to get elements are going to have the

play22:17

labels change, and no element is going to be touched more than log m times. So, therefore,

play22:22

over m union operations, I will not do more than m times log m work, in terms of changing

play22:29

component numbers.

play22:31

So, this is where we have saved. So, instead of going to m times n, we have come to m times

play22:38

log n. So, this means that if I have, since I am assuming that there are m of them, I

play22:43

can take m times log m. And I can say what is the cost of one of them? I divide it by

play22:49

m. So, on an average, in a sense, each of them is only taking logarithmic time. So,

play22:54

it is not that any one of them is actually taking logarithmic time for sure.

play22:57

Because I know that sometimes I have to merge a large component. But then many of the times

play23:01

I will be merging very small components. And if I add it up, so this is again, this notion

play23:06

that we saw with, as I said, with BFS and DFS, when we were dealing with adjacency lists,

play23:12

that we count the cost across the entire operation and say the total sum of the operation can

play23:16

only be so much. Even though individual, individual operations can be more.

play23:20

So, this is, this notion of amortized, what you gain in one place, you will lose in another

play23:25

place. But overall, it will all work out. So, the amortized complexity of one union

play23:29

operation is log m. Even though individual union operations could be more expensive,

play23:35

totally m log m is the cost of m such operations.

play23:39

So, how does this affect Kruskal s algorithm? So, this is where we started looking at these

play23:48

operations, because this was the context in which we needed it. So, remember that, in

play23:52

Kruskal s algorithm, we start by sorting the edges. So, we take the edges, there are m

play23:55

of them, and we sort them in ascending order. After this, we have to start this business

play24:02

of creating the tree. So, we have to keep track of these components.

play24:06

So, we first do a make union find, which will take order n time for our n vertices. So,

play24:11

this is going to be ordered n time. Now, when I check whether a vertex needs to be added

play24:20

or not, I need to look up the component name of the two endpoints. But this is order 1

play24:25

time because I just have to look up my component (())(24:27). And now this is the step which

play24:31

was bothering us. So, earlier this was taking order n time when we did it naively.

play24:36

But now we know in this amortize sense is going to take log m time. So, how many union

play24:42

operations are we going to actually perform? Well, because they are constructing a spanning

play24:46

tree on n vertices and a spanning tree on n vertices remember will have n minus 1 edges

play24:50

exactly. We are going to do exactly n minus 1 union operation. So, we can think of it

play24:55

as the order n union operations. So, we saw earlier that if we do order m, if you do m

play25:00

operations is m log n.

play25:02

So, here it is n operation, so it is n log n. So, the total cost of all these union operations

play25:08

across each addition that we actually perform, remember that if find u turns out to be find

play25:13

v we just throw away the edge, so we do not do anything. So, only when find u is not equal

play25:17

to find v, when they are in two different components, we have to do a merge. But now

play25:20

what we are saying is that this merge, will actually happen overall, the overall cost

play25:26

of these merges, they will be n of them will be n log n.

play25:29

So, this is the amortized cost over all of the union operations. So, now, if we go back

play25:35

to the first step, this is another expensive step. This takes m log m time using an optimal

play25:41

sorting algorithm. So, we have, we have seen that, things like merge sort, for instance,

play25:45

are m log m. And a naive algorithm like insertion sort of selection sort could be n squared.

play25:51

But you cannot beat m log m, we said without giving a proof saying that that is a lower

play25:55

bound.

play25:56

So, we need to spend m log m time to sort. So, we need m log m time to sort and n log

play26:00

n time to do the unions. Now, this is kind of bit awkward to have log m here and log

play26:06

n there. But notice that m is at most n squared. So, if I take logs on both sides, I will say

play26:13

log m is less than equal to 2 of log n. So, this is how logs work. So, log of m is actually

play26:22

big O of log of n, it is some constant times log of n.

play26:25

So, I can say that m log m is the same as m log n. So, now I have the same thing on

play26:32

both sides. Now, I am in good shape, I have n log n and I have m log n.

play26:36

So, I can now combine them and claim that the overall time of Kruskal s algorithm is

play26:40

m plus n times log n. So, remember that this really is the size of of G. So, we are really

play26:48

saying that in terms of the size of G, it is N log N. I mean, if you think of capital

play26:52

N as the size of G, it is within capital N log N.

play26:56

And this is probably the best that one could hope to do given that you have to do sorting

play27:00

and all that. So, this is how this union find algorithm actually helps us bring Kruskal

play27:06

s algorithm down from n squared to something like n log n.

play27:11

So, what we have seen is that if we use this two way mapping, so component is the obvious

play27:17

thing. It tells us given an element which component it belongs to. But if we keep the

play27:21

reverse mapping, given a component, which are the elements which belong to it, then

play27:25

we can relabeled these components efficiently by also keeping track of the size, always

play27:30

merge a smaller component into a bigger component.

play27:33

And if we just follow this rule of merging smaller components into bigger components

play27:37

and keep track of the components using this member array, then or member dictionary, then

play27:42

we can guarantee that across m operations, the amortized complexity of each union operation

play27:47

is log m. So, m log m is the total cost of all the unions, find remains constant time

play27:53

because it is just looking up a dictionary value.

play27:54

And make union find remains linear time because I have to actually reset the component to

play27:59

a singleton for everybody. So, this is a very direct and easy way of doing union find, there

play28:05

is a slightly more complicated way where instead of keeping this members of k as a list of

play28:10

vertices, you can keep it as a tree. And if we keep it as a tree, then it turns out that

play28:16

union operation becomes very simple. So, basically, I will have members of i. So, this will be

play28:23

some tree. And I will have members of say j and this will be some other tree.

play28:30

And the rule is that if you belong to the same tree or the same component, so the union

play28:33

operation will just either connect this tree that way, or connect this tree this way, so

play28:38

it will just take the root of one tree and put it as a child of the root of the other

play28:42

tree. So, the union operation is very simple, I just stick one tree under the other tree.

play28:47

But now find means, I am sitting here and I must go up to the root, the root vertex

play28:52

will tell me which component I am in.

play28:54

So, we have to be careful that these components, these trees do not become very tall. And there

play28:59

is a clever way of keeping them not only from growing very tall, but actually flattening

play29:03

them. There is something called path compression that you can do. And if you do this path compression,

play29:08

eventually your tree will look like this, where everything so if I have a component

play29:13

and I have all the other vertices which belong to the component will be only one step below.

play29:18

So, it will only take me one lookup in order to tell you what component is. So, it will

play29:23

have an amortized complexity of almost constant time. So, both union and find become so union

play29:29

is constant and find is almost constant. So, this is really the ultimate implementation

play29:34

that you could have for this where you can keep track of this and then each operation

play29:37

is a unit cost. But we will not go into that now. But you can look it up. It is a very

play29:42

interesting optimization of this union find data structure.

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Graph TheoryKruskal's AlgorithmUnion-FindData StructuresAlgorithm OptimizationSorting AlgorithmsAmortized AnalysisComputer ScienceSpanning TreesEfficiency
هل تحتاج إلى تلخيص باللغة الإنجليزية؟