Your Private GitHub Repos Aren't as Private as You Think

ProtonPenguin
26 Jul 202414:58

Summary

TLDRThis script reveals a significant GitHub vulnerability where private and deleted repository data remains accessible indefinitely through forks. Even after deletion, commits, including sensitive information like API keys, are retrievable with the commit hash. GitHub acknowledges this as an intentional feature, not a bug, which contradicts user expectations of privacy and data destruction upon deletion. The script urges GitHub to reconsider this design to align with security best practices and user trust.

Takeaways

  • πŸ”’ GitHub's design allows anyone to access data from deleted forks, deleted repositories, and even private repositories, forever.
  • πŸ˜• Users may mistakenly believe that deleting a repository removes all associated data from public access, but this is not the case on GitHub.
  • πŸ” The vulnerability stems from GitHub's repository network structure, where forks retain access to commits from the original repository, even if deleted.
  • πŸ”‘ Commit hashes, even partial ones, can be used to access deleted or private data, posing a risk to sensitive information.
  • 🚫 GitHub's response to the issue confirms it as an intentional design feature, not a bug, and it's documented in their security considerations.
  • πŸ€” The average user perceives the separation of private and public repositories as a security boundary, which this feature contradicts.
  • πŸ’‘ The GitHub archive website stores every event on GitHub, including commit hashes, making it possible to find and access deleted or private data.
  • πŸ›‘ Deleting a repository does not securely delete the data; it remains accessible through any existing forks.
  • πŸ”„ Changing the visibility of a repository from private to public results in two separate repository networks, making past private commits in forks visible to the public.
  • πŸš€ For open-source projects, any code committed before the project was made public remains accessible, even in a private fork.
  • πŸ”„ Rotating API keys or sensitive information is crucial if accidentally committed to a repository, as simple deletion is not enough to secure the data.

Q & A

  • What is the main concern raised in the blog post mentioned in the script?

    -The main concern is that anyone can access data from deleted forks, deleted repositories, and even private repositories on GitHub, and this data remains accessible forever, which is an intentional design feature of GitHub, not a bug.

  • How can one access data from a deleted fork on GitHub?

    -One can access data from a deleted fork by using the commit hash. Even if the fork is deleted, the commit is still accessible as long as the commit hash is known.

  • What is the vulnerability with accessing deleted fork data?

    -The vulnerability lies in the fact that even after a fork is deleted, the code with potentially sensitive information remains accessible using the commit hash, contrary to what one might expect after deletion.

  • Why is the commit hash important in accessing deleted data?

    -The commit hash is crucial because it uniquely identifies a commit. Knowing even a part of the commit hash can allow access to the commit's data, even if the repository or fork has been deleted.

  • How secure is the commit hash against brute force attacks?

    -The commit hash is not entirely secure against brute force attacks. A minimum of six characters is required to access the commit, which is a large number but not large enough to be considered safe against brute force.

  • What is GitHub Archive and how does it relate to the vulnerability?

    -GitHub Archive is a website that archives every event that happens on GitHub, including commits. This means that commit hashes for almost every commit on every repository that was once public are available, potentially exposing private information.

  • What happens when a public upstream repository that has been forked is deleted?

    -When a public upstream repository is deleted, instead of deleting the whole tree, GitHub reassigns the root node to one of the downstream forks, making all commits from the upstream repository still accessible via any fork.

  • Why does GitHub keep the data accessible even after deletion?

    -GitHub's design decision is based on the principle of open code collaboration, where visibility is intended to be public. This design allows forks to remain accessible even if the upstream repository is deleted.

  • What is the implication of this feature for users who open source a tool on GitHub?

    -The implication is that any code committed to a private fork before making the upstream repository public is also accessible to the public, as the commits are part of the same repository network.

  • What does GitHub's response to the vulnerability reveal about their stance on this feature?

    -GitHub's response indicates that this is an intentional design decision and is working as expected. They do not have any immediate plans to change this functionality, as it is documented in their security considerations.

  • What are the takeaways for users regarding the security of their repositories on GitHub?

    -The main takeaway is that any commit to a repository network, including the upstream repo or downstream forks, will exist forever and cannot be deleted or hidden. Users must be cautious about committing sensitive information and consider rotating any accidentally committed private keys immediately.

Outlines

00:00

πŸ”’ GitHub's Persistent Data Access Issue

This paragraph discusses a significant vulnerability in GitHub that allows anyone to access data from deleted forks, deleted repositories, and even private repositories indefinitely. The author demonstrates how deleted fork data can still be accessed by using commit hashes, which, although difficult to guess, are not entirely secure against brute force attacks. The GitHub Archive is mentioned as a place where commit hashes are stored, potentially exposing sensitive information that was mistakenly committed and thought to be deleted.

05:02

πŸ“š The Risks of Deleted and Private Repository Data

The second paragraph delves into the consequences of GitHub's design where data from deleted forks and repositories can still be accessed if at least one fork remains. It explains how users have unknowingly exposed API keys by hardcoding them into example files within forks, which they later delete, under the false assumption of privacy. The paragraph also covers how private commits made to a repository before it becomes public are accessible through the public repository network, highlighting a common yet risky workflow that many developers might not be aware of.

10:04

🚨 GitHub's Response to Data Accessibility Concerns

In the final paragraph, the author shares GitHub's response to the reported vulnerability, stating that the persistent data access is an intentional design feature, not a bug. The author argues that while GitHub's documentation mentions this functionality, most users are unaware of it and expect a higher level of privacy. The paragraph concludes with a call for GitHub to reconsider its security model, given the potential for misuse and the discrepancy between user expectations and actual privacy protections.

Mindmap

Keywords

πŸ’‘GitHub

GitHub is a web-based platform for version control and collaboration that allows developers to manage and contribute to projects. In the video's context, it is the central platform where the discussed vulnerabilities and access to data occur. The script mentions GitHub's intentional design that allows access to deleted and private repository data, which is a key point of the video's theme.

πŸ’‘Repository

A repository in the context of version control systems like Git, and by extension GitHub, is a storage location for a project's files and their revision history. The script discusses both public and private repositories, emphasizing the security issues that arise when one believes private data within a repository is secure from public access.

πŸ’‘Fork

Forking a repository on GitHub means creating a copy of the project under a different user's account, allowing for independent development. The script highlights the vulnerability associated with forking, where even deleted forks can have their data accessed, contradicting the common assumption of data privacy post-deletion.

πŸ’‘Commit

A commit in version control is a set of changes to a repository's files that are saved as a single unit. The video script discusses the permanence of commits, even after a repository is deleted, and the risks of accidentally committing sensitive information that remains accessible.

πŸ’‘API Key

An API key is a unique code passed in by computer programs calling an API to identify the calling program, its developer, or its user. The script uses API keys as an example of sensitive information that could be inadvertently committed to a repository and remain accessible even after deletion, illustrating the security risks.

πŸ’‘Vulnerability

In the context of cybersecurity, a vulnerability is a weakness in a system that can be exploited by attackers. The video script describes specific vulnerabilities on GitHub that allow unauthorized access to deleted and private data, which is a central theme of the video.

πŸ’‘Brute Force Attack

A brute force attack involves attempting all possible combinations to crack a password or encryption key. The script mentions the possibility of brute forcing short commit hashes to access private data, indicating a potential security flaw in GitHub's system.

πŸ’‘GitHub Archive

GitHub Archive is a project that aims to create an open, comprehensive archive of GitHub activity. The script refers to this archive as a place where commit hashes are stored, which could potentially be used to access private information if known.

πŸ’‘Repository Network

A repository network on GitHub refers to the interconnected set of repositories that includes the original repository and all its forks. The script explains that commits in a repository network are accessible across the entire network, which is a key aspect of the discussed vulnerabilities.

πŸ’‘Security Model

A security model defines how access to resources is controlled in a system. The video script suggests that GitHub's security model may be flawed or misunderstood by users, as it allows access to data that users expect to be private, indicating a need for a change in the platform's approach to security.

πŸ’‘Deletion

Deletion, in the context of data and software, refers to the act of removing data or files. The script emphasizes that deletion on GitHub does not equate to data destruction, as commits remain accessible, challenging the user's trust in the platform's handling of private information.

Highlights

GitHub repositories, including deleted and private ones, can be accessed by anyone through deleted Forks.

This data access is an intentional feature of GitHub, not a bug.

Deleted Fork data remains accessible forever, contrary to what users might expect.

A vulnerability exists where private information in commits can be accessed even after deletion.

Users mistakenly believe that deleting a repository removes all associated data from public access.

Accessing deleted Fork data requires knowing the commit hash, which can be partially guessed.

GitHub archive exposes commit hashes, making it possible to access private information.

40 valid API keys were discovered from deleted Forks, indicating a real-world impact of this vulnerability.

Even if a public repository is deleted, its data remains accessible if at least one Fork exists.

GitHub's repository network structure means that deleting an Upstream repo does not delete its history.

A major tech company's private key was exposed due to misunderstanding GitHub's deletion and privacy features.

Any code committed to a public repository may be permanently accessible if there is at least one Fork.

Private repos that are Forked and then made public have their commits visible to everyone.

GitHub acknowledges this feature in their documentation, but many users are unaware.

The security implications of GitHub's design are significant and may require a change in user expectations.

GitHub's response to the findings indicates that they do not plan to change this feature in the near future.

The blog post suggests that GitHub should revisit its security model to better align with user expectations.

Transcripts

play00:00

if you thought your private GitHub

play00:02

repositories were safe from prying eyes

play00:04

think again this blog post caught my

play00:07

attention today and I'm kind of

play00:09

surprised that no one's talking about it

play00:10

because this seems like a big deal

play00:12

anyone can access deleted and private

play00:14

repository data on GitHub specifically

play00:18

you can access data from deleted Forks

play00:21

deleted repositories and even private

play00:23

repositories on GitHub and it's

play00:25

available Forever This is known by

play00:28

GitHub and intentionally designed that

play00:30

way that's right this is a feature not a

play00:33

bug so what's the vulnerability here how

play00:35

can you access this data here's how the

play00:38

vulnerability Works accessing deleted

play00:40

Fork data so let's say you have a fork

play00:43

of a public repository you then commit

play00:46

code to your fork and you delete your

play00:48

fork so it would look something like

play00:50

this you would create the fork commit

play00:52

something let's say that code has some

play00:55

private information that you don't want

play00:56

people seeing for example you might have

play00:59

accidentally put an API key or a

play01:01

password or something like that in there

play01:03

and then you delete it now a reasonable

play01:06

person would say that okay this

play01:08

repository has been deleted this

play01:10

information should no longer be

play01:12

available so it's fine no big deal

play01:15

however that is a wrong assumption the

play01:18

code is actually still accessible even

play01:20

though it shouldn't be right you deleted

play01:22

it but it is and it's accessible forever

play01:25

out of your control there is absolutely

play01:27

nothing that you can do to remove that

play01:29

data dat from the public record here's

play01:31

how you would do that so I'm going to go

play01:33

to GitHub uh let's find some repository

play01:36

YT DLP that looks good let me just

play01:39

create a fork of

play01:45

that let's edit this read me

play01:49

file

play01:50

secret commit the

play01:52

changes

play01:55

oops and now let's say I have this

play01:58

information here this secret that I

play02:00

don't want to be

play02:01

public let me just copy that URL and

play02:05

let's say okay I noticed that I

play02:06

accidentally committed some information

play02:08

that I shouldn't have so I'm just going

play02:10

to delete the repository thinking that

play02:12

that will remove this information from

play02:14

the public record so I'll go to

play02:17

settings delete this repository I want

play02:20

to delete this repository

play02:26

yes okay now it's gone that information

play02:29

should no longer be publicly available

play02:31

right well maybe let's go back to the YT

play02:36

DLP

play02:38

repository and if I paste in this URL

play02:41

that I had before which contains the

play02:43

commit hash let's see what we get so

play02:47

notice that here it says this commit

play02:49

does not belong to any branch on this

play02:51

repository and may belong to a fork

play02:52

outside the repository oh look here is

play02:56

that commit that I made where I deleted

play02:59

all of this information from the readme

play03:01

and added a secret so this information

play03:04

that I thought would no longer be

play03:05

accessible actually is hm interesting

play03:09

notice however that we needed to

play03:10

actually know the commit ID to do this

play03:13

and the commit ID is a pretty long hash

play03:16

so trying to brute force that is

play03:18

actually quite difficult so you might

play03:20

think okay well at least there's some

play03:22

kind of safety there but you don't need

play03:25

the full hash let's see how far we can

play03:27

get let's delete all that yep still

play03:30

accessible let's go back one more let's

play03:32

go back two more characters yep still

play03:34

accessible we delete the B4 okay then

play03:37

it's not found B still not found okay

play03:40

looks like in this case the minimum that

play03:43

we need is six characters which isn't

play03:47

enough to be safe against a Brute Force

play03:48

attack like that is not a huge number I

play03:51

think each one of these is 16 since it's

play03:54

a heximal and so 16 to^ of

play03:58

6 it's a large number 16 million but

play04:03

it's not large enough that I would

play04:04

consider this to be secure so anyone

play04:07

that knows the commit hashes and as I

play04:09

mentioned earlier even the short commit

play04:11

hashes which can possibly be brute

play04:13

forced then they'd be able to access all

play04:15

that private information the minimum

play04:17

number of characters that get requires

play04:19

for a short commit hash is actually Four

play04:22

so instead of 16 to the 6 the minimum is

play04:25

actually 16 to the 4 which is

play04:31

65,536 very much in the realm of being

play04:34

brute forcible and in fact the commit

play04:36

hash is actually discoverable remember

play04:39

how I mentioned that you need the commit

play04:40

hash to access that private information

play04:43

well here's a place where you can find

play04:44

it GitHub archive this is a website that

play04:47

basically archives every single event

play04:49

that happens on GitHub there are 15 plus

play04:52

event types which I won't go into the

play04:54

details of but basically this is a

play04:57

massive store of information about every

play04:59

single thing that happens on GitHub

play05:02

which includes commits this means that

play05:04

the hashes for just about every commit

play05:07

on every repository that was at once

play05:09

public are available on this

play05:12

website and how often can we find data

play05:14

from deleted Forks well the person that

play05:16

wrote this blog post Joe Leon from

play05:18

truffle security company found 40 valid

play05:21

API keys from deleted Forks in which

play05:24

apparently users did something like this

play05:26

first they forked the repo then they

play05:28

hardcoded an API key in into an example

play05:30

file then they did some changes and then

play05:32

they deleted the fork like this this is

play05:35

something that a new user might want to

play05:36

do they'd see a placeholder in some

play05:39

example file showing how to use the

play05:42

program that's in a certain repository

play05:43

and they'll just change the example file

play05:46

to contain the API key that seems like a

play05:48

reasonable way to do things as a new

play05:50

user especially if you know that you're

play05:52

going to delete the fork later

play05:54

unfortunately this vulnerability shows

play05:56

that no you absolutely should not do

play05:58

that because even if if you delete the

play06:00

fork you cannot trust GitHub to actually

play06:03

securely delete it however that's not

play06:05

the only vulnerability here it gets

play06:07

worse accessing deleted repo data so

play06:10

consider this situation you have a

play06:12

public repository on GitHub some user

play06:14

Forks your repo you commit data after

play06:16

they Fork it and they never sync their

play06:18

fork with your updates and you delete

play06:20

the entire repo in this case the code

play06:22

that you committed after they forked is

play06:24

still accessible so as long as at least

play06:27

one fork exists then that information

play06:29

will be publicly accessible forever so I

play06:32

mentioned earlier that this is a feature

play06:34

instead of a bug and why is that well

play06:36

let's go into the details so GitHub

play06:39

stores repositories and forks in a kind

play06:41

of repository network with the original

play06:43

Upstream repository acting as the root

play06:45

node it's like a tree the way that git

play06:48

itself is a tree right you have an

play06:50

initial commit and then you have

play06:52

branches on top of that and you have a

play06:54

whole history that comes back to this

play06:56

route however in this GitHub repository

play06:59

Network when a public Upstream

play07:01

repository that has been forked is

play07:02

deleted instead of just deleting the

play07:04

whole tree because well GitHub you know

play07:07

probably shouldn't do that you wouldn't

play07:10

really want your fork to be deleted if

play07:12

the Upstream goes away the way that

play07:13

GitHub solves this issue is by

play07:15

reassigning the root node to one of the

play07:17

downstream Forks however notice what

play07:19

happens here all of the commits from the

play07:21

Upstream repository still exist and are

play07:24

accessible via any fork and according to

play07:26

the author this isn't some hypothetical

play07:28

scenario and apparently this just

play07:30

happened last week quote I submitted a

play07:32

P1 vulnerability to a major tech company

play07:34

showing that they accidentally committed

play07:36

a private key for an employees GitHub

play07:37

account that had significant access to

play07:39

their entire GitHub organization so

play07:42

obviously that is a pretty big security

play07:44

vulnerability the company should first

play07:45

of all get rid of that API key and

play07:48

second of all probably remove that from

play07:50

the history if possible well what they

play07:52

did is they immediately you deleted the

play07:54

repository but since it had been forked

play07:57

you could still access the commit

play07:58

containing sens data via a fork despite

play08:01

the fork never sinking with the original

play08:03

Upstream repository which is very scary

play08:06

that seems like a huge violation of the

play08:09

trust that users have in GitHub the

play08:11

implication here is that any code

play08:12

committed to any public repository may

play08:15

be accessible forever as long as there

play08:17

is at least one fork of that repository

play08:20

so even if that fork doesn't have some

play08:22

commits that are on your Upstream

play08:25

version or on your private version or

play08:27

anywhere as long as one public Fork

play08:30

exists every commit in that repository

play08:33

network is public forever but it gets

play08:36

worse accessing private repo data okay

play08:39

so consider this common workflow for

play08:41

open sourcing a new tool on GitHub so

play08:43

step one is you create a private repo

play08:45

that will eventually be made public you

play08:47

know you might not want to create it

play08:48

publicly right off the bat because it's

play08:50

still in a very early State and maybe it

play08:53

just doesn't make sense to have people

play08:54

looking into it and you know you're just

play08:57

not ready to manage the community yet

play08:58

perfectly reasonable afterwards you

play09:00

create a private internal version of the

play09:02

repo via forking and commit additional

play09:04

code for features you're not going to

play09:06

make public again that makes sense let's

play09:08

say that this is something that you're

play09:10

trying to make money from well that

play09:12

seems like a reasonable way to do it you

play09:13

might have a public version that is

play09:16

fully open source that has all of its

play09:18

code accessible and then you add some

play09:20

kind of Enterprise features that you

play09:22

want to charge money for in a private

play09:24

fork okay seems reasonable and step

play09:27

three you make your Upstream repository

play09:29

public public and keep your fork private

play09:31

this seems like a fairly common workflow

play09:33

and you might think that the private

play09:35

features that you added to your private

play09:37

Fork are inaccessible to the public but

play09:40

guess what they are 100% viewable by

play09:44

anyone any code committed between the

play09:45

time you created an internal Fork of

play09:47

your tool and when you open source the

play09:49

tool those commits are accessible on the

play09:51

public repository so just to clarify any

play09:54

commits you made to the private Fork

play09:56

after you make the Upstream repository

play09:58

public or are not viewable and the

play10:00

reason for that is because changing the

play10:02

visibility of a private Upstream

play10:03

repository results in two repository

play10:05

networks one for the private version and

play10:07

one for the public version so looking at

play10:09

these graph these commits that are on

play10:12

this private Fork of the tool are public

play10:15

and again here's a demo video I'm not

play10:17

going to play it if you want to see the

play10:19

details check out the link it'll be in

play10:21

the description and this is a fairly

play10:23

common workflow right like creating

play10:25

something private and then creating a

play10:27

private Fork of it and then making the

play10:29

original public that seems like a

play10:30

totally reasonable thing that many

play10:32

people would do and would assume that

play10:34

everything that is in the private Fork

play10:36

stays private and everything that's in

play10:38

the public Upstream version is public

play10:41

but no apparently with GitHub that's not

play10:43

the case so what does GitHub have to say

play10:45

about this well the author submitted

play10:47

their findings to GitHub via their bug

play10:49

program and here's the response thanks

play10:52

for the submission this is an

play10:53

intentional design decision and is

play10:55

working as expected as noted in our

play10:57

documentation we make make this

play10:59

functionality more strict in the future

play11:01

but don't have anything to announce

play11:02

right now so it's a feature not a bug

play11:06

and it's pretty clear actually in their

play11:07

documentation that it was designed to

play11:09

work this way under important security

play11:12

considerations you can see commits to

play11:14

any repository in a fork Network can be

play11:16

accessed from any repository in the same

play11:18

Fork Network including the Upstream

play11:20

repository and under the section

play11:23

changing a private repository to a

play11:24

public repository they say when you

play11:27

change a private repository to public

play11:29

all the commits in that repository

play11:31

including any commits made in the

play11:32

repositories that it was forked into

play11:34

will be visible to everyone so there we

play11:36

go it's in the docs GitHub says it's a

play11:38

feature and that's how it's intended to

play11:40

work however I don't think users really

play11:44

understand that or at least most users

play11:46

don't at least to me I haven't looked

play11:48

into these pages in detail so you know

play11:50

what maybe that's entirely my fault but

play11:53

I think most users would assume that if

play11:55

you delete something that information

play11:57

will no longer be available and then

play11:59

information in a private Fork will

play12:01

always remain private so perhaps it's

play12:03

time for GitHub to revisit this and the

play12:05

author of the blog post agrees with me

play12:07

the average user views the separation of

play12:09

private and public repositories as a

play12:11

security boundary and understandably

play12:13

believes that any data located in a

play12:15

private repository cannot be accessed by

play12:17

public users unfortunately as we

play12:19

documented above that is not always true

play12:21

what's more the action of deletion

play12:23

implies the destruction of data however

play12:26

as we saw deleting a repository or Fork

play12:28

does not mean that your commit data is

play12:30

actually deleted now before any of you

play12:32

FSF Shield jump in and say oh well

play12:35

clearly this is Microsoft and

play12:37

ifying the product and well if they

play12:39

hadn't acquired GitHub this wouldn't

play12:41

have happened and to that I say no I

play12:43

don't think so GitHub was designed this

play12:45

way from the start where it was supposed

play12:48

to be a place for open code

play12:49

collaboration where everything is

play12:52

visible publically and all of this

play12:54

information is available to anyone that

play12:56

wants to see it so I would actually say

play12:58

that if anything Microsoft didn't Focus

play13:01

hard enough on what Enterprises want and

play13:04

that is privacy and the ability to hide

play13:06

information from people I think if they

play13:08

focused a bit harder on that this issue

play13:10

actually wouldn't have happened because

play13:12

this functionality would have been

play13:14

removed or changed earlier I'm not

play13:16

against open source but I'm just saying

play13:18

in this case it does seem like they're

play13:20

taking open architecture a bit too far

play13:23

so what are the takeaways from this well

play13:24

the main one is as long as a fork exists

play13:27

any commit to that Repository Network

play13:29

which includes commits on the Upstream

play13:31

repo or Downstream forks will exist

play13:34

forever you cannot delete them you

play13:36

cannot hide them they will always be

play13:38

publicly accessible this means that

play13:41

simply deleting a commit that

play13:42

accidentally added some private

play13:44

information is not enough if a private

play13:46

API key was committed for example you

play13:49

must immediately rotate that API key if

play13:52

some private information such as maybe a

play13:54

person's name and social security number

play13:56

were committed too bad that information

play13:59

will always be publicly accessible

play14:01

forever the second takeaway is that

play14:03

perhaps GitHub should change this now

play14:06

GitHub has a reputation for being very

play14:09

good about security right they put a lot

play14:12

of work into making sure that their

play14:14

products are very secure that things

play14:17

that are private are intended to be

play14:19

private right they have a very Advanced

play14:21

security program they have you know all

play14:23

these certifications by all these

play14:25

standards and basically they've put a

play14:27

lot of work into making sure that all of

play14:30

the information that is intended to be

play14:32

private is private so perhaps it's time

play14:35

to change the way that these repository

play14:37

networks work because security is only

play14:40

as good as the users and if users don't

play14:44

understand the security model and

play14:46

clearly they don't then perhaps it's

play14:48

time to change the security model but

play14:51

what do you think did you know that

play14:52

GitHub works this way cuz I sure as heck

play14:54

didn't let me know in the comments

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
GitHub SecurityData PrivacyRepository NetworksAPI Key LeakCode DeletionFork VulnerabilityCommit AccessOpen Source RisksPrivate InformationSecurity Awareness