Scraping Dark Web Sites with Python
Summary
TLDRIn this video, the speaker demonstrates how to automate interactions with websites on the dark web using Tor and Python. They show how to install and configure Tor on a Kali Linux virtual machine, use the torify command to tunnel traffic, and access onion sites with curl. The video also explains creating a Python script with the requests-tor library to scrape dark web data. Additionally, they highlight tools like Flare for monitoring cyber threats and the dark web. The speaker aims to educate viewers on tracking cybercrime and automating data collection from the dark web.
Takeaways
- 🌐 The video discusses automating interactions with websites on the dark web using Tor and .onion addresses.
- 🛠️ The presenter demonstrates installing Tor on a Kali Linux virtual machine and using the 'torify' command to tunnel traffic through the Tor network.
- 🔒 The importance of configuring the Tor control port for secure communication with the Tor service is highlighted, including enabling authentication methods.
- 📝 The script shows how to modify the Tor configuration file to enable the control port and set it up for cookie authentication.
- 🔄 The presenter explains how to restart the Tor service after configuration changes and verify the new IP address through Tor.
- 🕵️♂️ The video mentions using Tor for threat intelligence gathering, tracking cybercrime, and understanding the activities of threat actors on the dark web.
- 🛑 The use of the 'requests-unixsocket' library in Python is introduced to automate HTTP requests through Tor.
- 🤖 An example Python script is provided to demonstrate how to scrape content from .onion websites using Tor.
- 🔎 The video showcases the use of tools like Flare for cyber threat intelligence and attack surface management, emphasizing the value of tracking threat actors and ransomware groups.
- 📈 The presenter discusses the potential for using Tor to scrape and monitor changes on dark web marketplaces, forums, and leak sites for intelligence purposes.
- 🔗 The script concludes with a mention of various resources and libraries for further exploration of Tor usage in Python and command-line tools.
Q & A
What is the main purpose of the video?
-The main purpose of the video is to demonstrate how to automate interactions with websites on the dark web using tools like Tor, Curl, and Python.
Why does the speaker use a Kali Linux virtual machine?
-The speaker uses a Kali Linux virtual machine because it is a popular environment for cybersecurity and penetration testing, providing necessary tools for the demonstration.
What is the command to install Tor as a service in Kali Linux?
-The command to install Tor as a service in Kali Linux is `sudo apt install tor`.
What is the purpose of the 'torify' command?
-The 'torify' command is used to wrap other commands and tunnel their traffic through the Tor network.
Why does the speaker modify the Tor configuration file?
-The speaker modifies the Tor configuration file to enable the control port and authentication, which is necessary for tunneling traffic and automating interactions with Tor.
How can you verify that your IP address is routed through Tor using Curl?
-You can verify that your IP address is routed through Tor using Curl by running the command `torify curl ifconfig.me` to see the IP address that Curl reports.
What are the two main ports used by Tor and what are their purposes?
-The two main ports used by Tor are 9050 (for the Socks proxy) and 9051 (for the control port). The Socks proxy port is used for routing traffic through Tor, and the control port is used for configuration and management of the Tor service.
What Python library does the speaker install to make requests through Tor?
-The speaker installs the `requests[socks]` library in Python to make requests through Tor.
How does the speaker automate accessing a dark web URL in Python?
-The speaker automates accessing a dark web URL in Python by using the `requests` library with the Tor proxy settings, making a GET request to the URL through the Tor network.
What kind of information can be gathered from dark web scraping according to the speaker?
-Information that can be gathered from dark web scraping includes threat intelligence, cyber crime activities, ransomware updates, leaked credentials, personal identifiable information (PII), and other cyber threats.
What tool does the speaker mention for tracking cyber threats and managing attack surfaces?
-The speaker mentions 'Flare' as a tool for tracking cyber threats and managing attack surfaces, providing visibility into various threats and vulnerabilities.
Outlines
🌐 Automating Dark Web Interactions with Tor
The speaker begins by apologizing for the hotel room setting and introduces the topic of automating interactions with websites on the dark web using Tor, a tool that enables anonymous communication. The speaker demonstrates how to install and configure Tor on a Kali Linux virtual machine, including enabling the control port for Tor to allow for programmatic interaction. The video also covers using 'torify' to tunnel traffic through Tor and the importance of configuring the Tor service correctly to achieve this. The potential applications of such automation in threat intelligence and tracking cybercrime are briefly mentioned.
🔎 Exploring Cyber Threat Intelligence with Onion Addresses
This paragraph delves into the use of Tor for scraping and automating interactions with .onion addresses to gather threat intelligence. The speaker discusses the importance of tracking cybercriminal activities, such as ransomware attacks, by accessing their leak sites and forums on the dark web. Tools like Flare are highlighted for their ability to provide insights into threat actors and potential data breaches. The speaker also demonstrates how to use Tor to access an onion link and retrieve HTML content from a ransomware group's site, showcasing the practical application of Tor in threat intelligence gathering.
🛠️ Automating Dark Web Data Retrieval with Python
The speaker transitions to discussing the automation of dark web data retrieval using Python, starting with installing the 'requests[socks]' library to make HTTP requests through Tor. A Python script is created to demonstrate how to use the 'requests_tor' module to send requests to .onion addresses and retrieve web page data. The script is then modified to access a different dark web URL, revealing the versatility of the approach. The speaker also touches on the discovery of a 'website seized' notice, hinting at the dynamic nature of content on the dark web and the importance of staying updated with the latest changes.
📚 Resources for Automating Tor Interactions and Dark Web Scraping
The final paragraph provides a list of resources and further reading for those interested in automating interactions with Tor and scraping the dark web. The speaker mentions various Python libraries such as 'torpy', 'stem', and 'torrequest', which can be used for different levels of Tor interaction and control. Additionally, the paragraph references a Medium article and a resource by Dan Nadir that provide detailed instructions and insights on working with Tor in Python. The speaker concludes by emphasizing the value of these tools for tracking and understanding the ever-changing landscape of cyber threats.
Mindmap
Keywords
💡Tor
💡Onion Addresses
💡Curl
💡Requests Library
💡Kali Linux
💡Threat Intelligence
💡Torify
💡Dark Web
💡Control Port
💡Cybercrime
Highlights
Automating interactions with websites using command line tools like curl and scripting languages like Python.
Introduction to automating website interactions on the dark web using Tor hidden services and onion addresses.
Installing Tor as a service on a Kali Linux virtual machine to tunnel through the onion router.
Using the torify command to wrap other commands and tunnel traffic through Tor.
Configuring Tor by editing the torrc file to enable the control port and adjust authentication settings.
Restarting the Tor service to apply changes and enable Tor to handle traffic.
Using curl with torify to access and pull information from onion sites, including viewing current IP addresses.
Exploring the reasons for scraping and automating interactions with onion addresses, such as threat intelligence and tracking cybercrime.
Using tools like Flare to monitor the dark web for exposed attack surfaces, leaked credentials, and other cyber threats.
Creating a Python script using the requests_unofficial and requests_tor libraries to automate requests to onion sites.
Demonstrating how to use the requests_tor library to create a Tor client and make GET requests to onion sites.
Automating the process of scraping dark web pages and extracting HTML content using Python.
Discussing the challenges and alternatives for automating Tor-based scraping, including other libraries like TorPy and Stem.
Showcasing practical examples and use cases for scraping dark web sites, such as tracking ransomware groups and monitoring leak sites.
Exploring additional resources and articles for further reading on Tor automation and dark web scraping techniques.
Transcripts
hi I am out on travel so this is a hotel
room video please don't hate me I'm
sorry in a lot of other videos I've
showcased how you can automate
interactions with the website whether
you're on the command line using tools
like curl or in scripting languages like
python where you can use libraries and
packages modules like requests but I
haven't showcased how we might be able
to automate this or scrape different
websites that might be in the dark web
you using tour hidden services or do
onion addresses so in this video that's
what we're going to dive into thankfully
this is really easy to do so I am inside
of my Cali Linux virtual machine I'll
hit Control Alt t on my keyboard to open
up a terminal f11 to full screen zoom in
to make this text a little bit easier
for you to read and I will go ahead and
install tour just as a service that
might run in the background so that I
could tunnel through the onion router
and access some of those dark websites
like onion addresses moving through
different relays and nodes across that
network will pseudo appt install tacy to
automatically confirm enter my password
for Cali and then go ahead and install
tour now one command that is actually
bundled with the tour package is this
thing called torify and if I actually
wanted to take a look at the Man pages
for that we could see it is a wrapper
for tour Soxs and tour so like a socks
proxy how we might move through and have
some network communication through that
protocol think of this like proxy chains
on the command line you could basically
put it in front of other commands you'd
want to run and that tunnels your
traffic all through tour I'll hit Q to
get out of that so say I were to use
Curl on the command line and access just
if config doso and that will give me hey
my current public IP address I'm fine
with that but if I wanted to wrap that
through torify let's see if I could get
that to come through for me m not
working all that well turns out we
actually need to configure and enable
that inside of the tour configuration
file so I could pseudo Nano Etc tour and
T RC for that configuration file and
having this open in our text editor I
want to scroll through and try to find
the configuration settings that all
change in this case we want to enable
the control Port we want to ensure that
is uncommented and you could add a
little bit more security here as it
notes if you enable the control Port be
sure to enable one of these
Authentication methods to prevent
attackers from accessing it so you could
add your own hashed control password for
the sake of Simplicity just cruise into
this demo I won't do that but I will
actually uncomment and again maybe you
had an octo Thorp or hashtag present
there for cookie authentication and this
value was originally the number one I'll
toggle that just to zero inside Nano crl
o to save the file crl X to exit and
with that I will service I think tour
restart is all that we should need to go
ahead and restart that service now
fingers crossed I'll be able to do this
torfi curl command and finally get a new
IP address separate from what I would
have had originally just naturally going
through tour or without tour in this
case but through Tour on that end
toggling on that control port and
manipulating and changing some of the
authentication to actually interact with
the tour service and maybe authenticate
or change your IP address or manipulate
what routes or nodes that you move
through is kind of optional in sometimes
but honestly probably good to do for
this case however in some tools that you
might use it's not always necessary and
of course this was just a cutesy example
trying to see our current IP address to
validate that we're moving through tour
but we might be asking actually why
would we even do this why would you want
to scrape or interrogate or automate
interactions with like onion addresses
things that we haven't even dug into yet
but consider all of the thread
intelligence or just hey maybe tracking
cyber crime and threat actors and
adversaries that you might be able to do
with that if you build and create your
own thread intelligence feed or automate
what's out there on onion sites you
might like in this twoo some really
awesome tools like flare that awesome
cyber threat intelligence and attack
surface management solution where
attackers thread actors and aders series
no longer have the information Advantage
because you can get out in front of it
let me log in here super quick spinning
up our dashboard we can take a look at
our threat risk assessment our exposed
attack surface and maybe get a better
understanding of look how secure are we
and our business our organization and
our company are there any leaked
credentials are there any personal
identifiable information or pii that's
out across the dark web or even the
clear net in Shady cyber crime telegram
groups or for sale on marketplaces or
within data breaches all of that awesome
stuff we could dig into within flare and
even on top of that getting a better
idea as to what cyber criminals are up
to and what damage they're doing like I
tend to track ransomware thread actors
and adversaries that do damage
encrypting the devices and data of
companies and businesses we could see oh
play ransomware or 8base or lock bit 3.0
what they're up to and what data they
might be dumping for those victims that
could be really worthwhile information
to keep tabs on and honestly we could
just use this as basically a Google
Across the dark web and do just Global
searches for any severity of a threat or
information exposure or risk that we
want to track in any of these different
categories like the open internet leaky
S3 buckets GitHub repositories or pce
spin posts maybe just then the dark web
marketplaces where malware could be
bought and sold Forum posts where thread
actors are chatting with each other or
telegram the real social media for cyber
crime look I could just look for for oh
info stealer malware I'll put that in
quotes and then we'll see what's popping
up maybe hey I'll go ahead and change
the date just to say look I'll do a
custom range here so we aren't getting
anything super duper recent about
November 2023 up to just the start of
December 2023 and if I search for this
look at all of the crazy shenanigans
that we might be able to dig into and of
course you'll actually get all of the
links all of the references flare will
just outright give that to you alongside
the actor maybe some uh summary of the
content whether or not you want to take
down information that's pertinent to you
your company your business and then
maybe some artificial intelligence to
help translate Russian languages or
again vernacular that you're not
familiar with like I totally can't read
that I don't understand that language oh
here's a good example looks like AI was
able to offer a quick synopsis little
bit of a summary here the details it's
worth digging into and even some
remediation or mitigation guides you
could of course create your own
identifiers for things that you want to
track like your business your company
your name whatever you want and maybe
track down o the flow of threats as to
what might feed into the other as
Associated events and Trends across the
cyber crime or threat intelligence
industry and Supply chains just as well
if ransomware attacks actually have a
thirdparty maybe trickle down effect
onto your world anyway I'm driving down
that road to note how we might be able
to dig into those threat actors cyber
crime and stuff out on the dark web that
we might want to track so just as an
example this is an onion link that I'm
actually viewing through the tour
browser hey that graphical user
interface the web browser to just simply
go to any onion address that we want to
big long V3 URL with a onion TLD or top
level domain you can't naturally access
that with curl but if we funnel it
through tour or maybe scrape it in
Python we totally could let me get back
to C and I'll show you look if I were to
try to curl that big on ransomware URL
that was a page that had a listing of
those different ransomware groups and
maybe their own onion leak sites that we
might want to keep track of
unfortunately c will tell us hey we
don't know how to do that in this case
not going to resolve an onion address
but we might be able to tell curl look
we have tour set up and installed we
should actually still pull that info cuz
if I were to try and tfy this I think
it'll still whine at me but we could
tell Tor excuse me we could tell curl
look let's actually use a socks 5 host
name and we'll specify our current local
host
1271 with our Port
9050 which is that default Port that the
tour service we listening on not the
control port in this case because we're
not manipulating or tweaking and tuning
some tour settings but we just want that
socks proxy to funnel through if I add
these arguments in and then I paste my
URL fingers crossed will be able to pull
down this onion site across the dark web
in an automated way not using just a
tour browser let me hit enter on this
and hopefully I got that syntax right
takes a little bit cuz we're funneling
through all those noes but take a look
now we've got all of the HTML specific
to that exact web page and looks it's
listing out all of those different gangs
all those different sites all those
illicit underground cyber crime
syndicates and includes a couple other
links that we might be able to dig into
that's pretty cool at least for a
one-off on the command line curl we
could pull down onion sites now of
course the better question well okay how
do we automate that and maybe a
scripting language like python let me
show you how on the command line inside
of our Cali Linux or whatever virtual
machine you might like we could use pip
to install a new python Library I'll go
ahead and pip install requests uncore
tour and that will allow us to make
requests across tour we'll go ahead and
install that get it staged set up for us
and then I'll create a new script maybe
requests T testing. py and now inside my
text editor I'll add my usual shabang
line user bin environment Python 3 and
we'll go ahead and import requests
uncore tour but truthfully there is one
sort of subm module or piece of data in
this package that I'm most interested in
so actually change that from request
tour I want to go ahead and import
requests tour with with a capital r
capital T and no underscore in this case
now with that module imported we can go
ahead and create sort of a client or the
way we could interact with it and if you
wanted to get used to oh just how you
naturally type requests.get or request.
poost when you use some scripting
language stuff in Python like this we
can call that object just requests and
I'll create a new requests tour object
and I'll pass in some parameters here
we'll specify tour ports like the actual
proxy ports that Tor might be listening
on
9050 as we saw and I'll add a comma
there just to den note hey that's
usually a tuple we just got to make sure
that value is set and it'll actually
Supply another T C ports for our control
Port that 9051 that you saw set in the
tour configuration file that should be
9051 with that set and staged again this
is super duper simple all we need to do
is a usual requests.get and we can
supply any URL that could still be a
onion address across the dark web let me
Define a variable for that here we'll
just paste in the ransomware sites we
had been using previously and I'll
Define that as a variable let me capture
that and we'll print out the response or
the text of that request.get just like
we normally do in Python and actually
since that is usually just one port we
should toggle that variable name the
keyword argument to tour cport singular
no s at the very very end now again
super simple with all this set I can get
back to my command line and let's try to
run my Python 3 requests toward testing.
py script fingers crossed again it'll
take a little bit because we're
tunneling through all that traffic but
look we have all that output and we can
in an automated way within python
interact with those dark web URLs V3
onion addresses and tour hidden services
this is all the actual output the raw
HTML in the source of the web page
that's returned to us when we view this
in our tour browser in that graphical
user web browser interface here but let
me actually go pull this a little bit
further because again we're not doing
anything too crazy but we're just
demonstrating that we can access onion
sites let me see if I can pull down that
alv or black cat ransomware blog and
maybe we could track oh specific new
updates on leak sites or maybe get the
alerts when we're seeing new changes
across the dark web let's use this as an
example I'll go back to my script and of
course we can make this whatever we want
but let's just change that URL to now
get to alfv in their leak site back to
the command line super easy we'll just
run this one more time but take a look
at what we've got here obviously
requesting a new page and we'll get the
HTML that gives us some interesting
breadcrumbs scrolling up to the top here
this tells us oh the website has been
seized and this is actually kind of a
little gimmick a little bit of a trick
and exit scam that that thread actor
ransomware gang alv and black cat had
been up to recently where they're trying
to scam out a lot of their Affiliates
the whole cyber threat Intel community
and a lot of infos Pros were digging
into this previously but if we took a
look at the web page here I'll get back
to that ransomware leak site take a look
if we open the link that brings us to
the this website has been seized page
and it is looking like oh a formal
official law enforcement operation to
take down that ransomware gang and their
online presence in their leak site
however it's something that a lot of
folks have been tracking and saying look
it's not I even wanted to get my head
straight on this over on Twitter or X or
whatever and it was just validating hang
on was this an April Fool's joke or was
it still the exit scam whatever or is
this a real interdiction and some folks
chimed in look that is still the exit
scam I thank them and they linked this
really cool little thread and right up
from Fabian here I love the fact that
they end up changing the file path just
like we saw if we were digging into the
actual HTML source code of the
application just like we saw from our
python code output look if you actually
dig into this you can see maybe even the
stupid copy pasta clone mirroring any
open directory that would like save page
as for any regular actual law
enforcement interdiction and takedown I
think that's just a little cool bit and
worthwhile to mix in here now hey if you
wanted to dig into any of these
resources online there are a lot of
references articles blogs and write up
that showcase even Tori like we started
with some of the tricks that you got in
the mix and maybe some of the control
Port communication altering or tweaking
some of the settings maybe getting a new
IP address through all the nodes relays
and tunnels alongside the documentation
or at least the sort of public page
showcasing that requests tour package
and library in Python you could dig into
and actually see a little bit more of
what you might be able to do here I do
like the advance you should section
where they show you look you could very
easily just check your IP get a new
identity test some things or make any
other HTTP method request that you want
and look I'll be the first to admit
maybe you have another solution or a
better tool or a better trick some
techniques to actually accomplish this
script and automate some scraping of
tour hidden Services V un addresses or
the dark web there's torpy just as well
a pure python implementation of the tour
protocol so you don't even need to have
that tour client installed like we did
to begin with OR stem or other libraries
that we might be able to dig into they
showcase this with some command line
examples that are really kind of slick
and even a little bit of the Python
syntax itself if you wanted to import it
and use it within your own code and
scripts here's another simple one tour
requests you can find this online and
that's pretty basic pretty similar just
like we did with requests unor tour
import this thing hey you could have a
context manager if you wanted to
requests.get as usual and that is one
easy way to do it you could have maybe a
little bit more communication there and
actually stage and set up some of the
passwords like the client or control
Port authentication and let me actually
dive into that a little bit this
resource danan Madar is actually really
awesome because it talks a little bit
about all of this it actually offers
some more links lets you install tour
just as we did to begin with take a look
at the version go ahead and check out
the status of that service bounce
restart stop and start as you need to
and then maybe even interact with that
control Port super duper simple you can
just try to authenticate but once you
validate hey we actually have the
control Port running configured and set
in r RC file then maybe you could
authenticate and set up a new hashed
password just like we saw in that file
you could generate one with just a
simple command T tac tac hashen password
and whatever you want slap that into the
config file and you're good to go you
can do your authentication you can then
check your IP with torify you could then
manipulate and change your IP address or
you can even use stem one really awesome
library that lets you manipulate and do
a little bit more fine-tuning with a lot
of the really or nodes that you travel
and Traverse through while you move
through that tour onion router protocol
you can validate this with other tools
like privoxy you could go ahead and dig
into other libraries that might change
your own IP address so there is a lot
out there and it's just a matter of
Googling and playing with what you're
interested in I do want to give another
shout out to this medium article because
it digs into a really cool use case of
tour within Python and they dig into
that stem library that python module
that I just kind of alluded to with a
little bit more detail on how you could
dig into specific relays or nodes that
you move through they have some cool
visuals and they set up using stem and
this I think gives you a little bit more
of an idea for the syntax or code that
you might be able to use and get some
better fine tooth comb granularity and
what you're going to do as you move
through tool but at the end of the day
look it's still automating interaction
with onion websites cross the dark web
tour hidden services and if you want you
can scrape whatever data like oh
ransomware updates or potential breaches
or just changes or modifications to
forums that you might be tracking
marketplaces where you want to see
whether or not things are actually being
modified up down Sales reviews anything
that is worth your attention you could
put together with your own code if you'd
like to build out something custom but I
will acknowledge look there's a whole
lot out there and there's almost too
much of it it's a little overwhelming
and look if you just want a solution
that's quick and easy already done for
you and manages this with so much insane
Telemetry invisibility please do take a
look at flare big thanks to flare for
sponsoring this video they are seriously
incredible they have so much cool data
and I love just being able to look
around and see what threats are out
there and know and assess my own attack
service thank you so much for watching
hope you enjoyed this video please do
those YouTube algorithm things like
comment subscribe and I'll see you in
the next one in the hotel cuz I'm still
on Trav so this is it for a little
bit
5.0 / 5 (0 votes)