This is How I Scrape 99% of Sites
Summary
TLDREn este video, se explica cómo realizar scraping de datos en sitios web de comercio electrónico, centrándose en la identificación de la API backend que hidrata la información en el frontend. Se muestra el uso de herramientas como la consola de Chrome y técnicas avanzadas para extraer datos en formato JSON, incluyendo la disponibilidad de productos y precios. Además, se mencionan soluciones para evitar bloqueos mediante proxies de alta calidad. El autor demuestra cómo automatizar este proceso utilizando bibliotecas como Curl cffi y Python, con un enfoque en la extracción eficiente y respetuosa de datos públicos.
Takeaways
- 🔍 La clave para realizar scraping exitoso en sitios de e-commerce es encontrar la API del backend que proporciona los datos del frontend.
- 🛠️ Utilizar las herramientas de inspección de red en Chrome, especialmente el filtrado por 'Fetch XHR', es esencial para encontrar respuestas en formato JSON.
- 💡 Es importante no extraer directamente el HTML, sino buscar los puntos finales de la API que entregan los datos necesarios.
- 🌍 El uso de proxies, como los de Proxy Scrape, es crucial para evitar bloqueos al realizar scraping a gran escala.
- 📄 Los proxies residenciales y móviles son ideales para superar las protecciones antibot de los sitios web.
- 📈 La personalización de los parámetros de búsqueda en la API, como el 'start index', permite manipular los resultados de búsqueda de manera eficiente.
- 🧰 Es importante modelar los datos obtenidos utilizando herramientas como Pydantic para organizar y procesar la información de manera efectiva.
- 👨💻 Al utilizar bibliotecas como Curl CFFI, se puede evitar el bloqueo por 'TLS fingerprinting' y obtener una respuesta exitosa del servidor.
- ⚙️ Después de obtener los datos de la API, el siguiente paso es integrarlos en el código mediante la automatización con funciones en Python.
- 📊 Es crucial ser considerado con la cantidad de datos extraídos y evitar sobrecargar los servidores para no ser bloqueado.
Q & A
¿Cuál es la técnica principal que se menciona para realizar scraping de sitios web de comercio electrónico?
-La técnica principal es identificar el API backend que el sitio utiliza para llenar la interfaz, en lugar de intentar extraer directamente el HTML.
¿Por qué no es recomendable intentar extraer enlaces o datos del HTML de un sitio web?
-Porque muchos sitios web cargan datos dinámicamente a través de APIs backend y el HTML por sí solo no contiene toda la información necesaria para el scraping.
¿Qué herramienta se utiliza en Chrome para inspeccionar las solicitudes de red de un sitio web?
-Se utiliza la herramienta de 'Inspect' de Chrome, específicamente la pestaña 'Network', filtrando las solicitudes Fetch/XHR y enfocándose en las respuestas JSON.
¿Cuál es el propósito de usar proxies al realizar scraping de datos a gran escala?
-Los proxies ayudan a evitar ser bloqueados por los sitios web al permitir rotar direcciones IP o mantener sesiones prolongadas con una IP, lo cual es útil para superar las protecciones antibots.
¿Qué tipo de datos se puede obtener de las respuestas JSON del API de un sitio de comercio electrónico?
-Se puede obtener información como disponibilidad del producto, números de SKU, imágenes, precios y metadatos.
¿Cómo se encuentran los códigos de productos para realizar peticiones al API de un sitio web?
-Los códigos de productos se pueden obtener navegando por categorías o utilizando la función de búsqueda del sitio web, y luego observando las solicitudes que aparecen en la pestaña de red del navegador.
¿Qué se necesita para superar el bloqueo de solicitudes al API usando 'curl' o 'requests' en Python?
-Es necesario usar herramientas como 'curl cffi' para replicar correctamente la huella digital TLS del navegador, lo que ayuda a que las solicitudes parezcan legítimas y no sean bloqueadas.
¿Cómo se puede modelar los datos obtenidos de un API de manera eficiente en Python?
-Se pueden utilizar modelos como 'Pydantic' para estructurar los datos en objetos, lo que facilita la manipulación y comprensión de los datos extraídos del API.
¿Por qué es importante usar encabezados como el 'User-Agent' al realizar scraping?
-El 'User-Agent' permite que las solicitudes parezcan como si vinieran de un navegador real, lo que puede ayudar a evitar bloqueos o restricciones impuestas por el sitio web.
¿Cuál es el consejo del autor sobre cómo manejar la extracción de datos sin ser bloqueado?
-El autor recomienda ser cuidadoso y no hacer demasiadas solicitudes rápidamente, ya que esto podría hacer que el sitio web bloquee la dirección IP. Es mejor extraer solo los datos necesarios de manera controlada.
Outlines
🛒 Análisis de Scraping para Datos de E-commerce
El autor explica cómo realiza scraping de datos de sitios de e-commerce para el análisis de competidores y productos. Se enfatiza en no extraer directamente el HTML, sino buscar la API backend que proporciona los datos al frontend. Se usan herramientas de inspección en el navegador (Chrome) y se filtran las respuestas JSON relevantes. A medida que los proyectos crecen, es crucial utilizar proxies de alta calidad para evitar bloqueos, destacando el patrocinio del video por parte de Proxy Scrape, un proveedor que ofrece proxies rotativos y sesiones persistentes. El autor explica la importancia de usar proxies geolocalizados o móviles para evitar la detección por sistemas de protección antibots en los sitios web.
🔍 Cómo Encontrar Datos de Disponibilidad de Productos
El autor detalla el proceso para encontrar la disponibilidad de productos en el backend de una página. Se utiliza el endpoint de disponibilidad de la API y se experimenta con códigos de productos para obtener los datos deseados. El autor muestra cómo encontrar estos códigos en el sitio web, usando la búsqueda de términos como 'botas', y explica cómo identificar las solicitudes API correctas que entregan los productos listados. Además, señala la importancia de manipular parámetros como 'start index' para extraer diferentes páginas de productos y cómo este proceso puede repetirse con otros datos.
🛠 Ajustes y Configuración de Código para Scraping
Aquí el autor profundiza en el uso de la librería Curl y cómo manejar errores comunes como denegaciones de acceso al intentar hacer solicitudes a las APIs. Explica cómo solucionar problemas de 'fingerprinting' con herramientas como curl_cffi, que simulan las solicitudes como si fueran hechas por un navegador real. Además, menciona la necesidad de manejar las cabeceras adecuadamente y cómo el uso de proxies ayuda a evitar bloqueos. Esta sección sirve como una introducción para integrar estos ajustes en proyectos de scraping más grandes y automatizados.
💻 Estructuración y Modelado de Datos en Proyectos de Scraping
En esta sección, el autor se centra en la estructuración del proyecto de scraping. Explica cómo usar Python para modelar datos extraídos de las APIs, utilizando herramientas como Pydantic para estructurar la información de productos. Se detalla la creación de sesiones de solicitud usando proxies y cómo estructurar funciones que interactúan con las APIs. El autor demuestra cómo manejar errores y respuestas inesperadas, e introduce un bucle para procesar y extraer los nombres de los productos. Finalmente, anima a los usuarios a seguir el proyecto y personalizar el scraping según sus propias necesidades, haciendo hincapié en el uso de datos públicos de manera ética.
Mindmap
Keywords
💡Web Scraping
💡API Backend
💡Proxies
💡Fetch XHR
💡JSON
💡Producto ID
💡Antibot Protection
💡Rotación de IP
💡TLS Fingerprinting
💡Pydantic
Highlights
The speaker focuses on scraping e-commerce data for competitor analysis and product analysis.
Emphasizes that scraping HTML directly isn't effective; instead, finding the backend API is key.
Demonstrates how to use Chrome's inspect tool to find the API that hydrates a website's frontend.
Introduces the concept of Fetch XHR responses, with a focus on JSON data as the target for scraping.
Stresses the importance of using high-quality proxies, especially when scraping larger projects.
Mentions Proxy Scrape as a preferred proxy provider, offering various proxy types like residential, datacenter, and mobile.
Describes how to extract product availability and stock information from JSON data.
Explains how to manipulate API requests to fetch product data, such as changing product codes.
Covers techniques for locating product IDs on e-commerce sites using search queries.
Demonstrates how to paginate through large search results using API request manipulation.
Introduces modeling of scraped data using tools like Pydantic for better data organization.
Shows how to handle API requests using libraries like `curl_cffi` to bypass bot protection.
Explains how to fingerprint TLS requests to mimic real browser behavior and avoid 403 errors.
Walks through the process of setting up a Python script for automating e-commerce data scraping.
Concludes with advice on scraping responsibly to avoid overwhelming websites and getting blocked.
Transcripts
a large part of the work I do in
scraping is e-commerce data competitor
analysis product analysis and all that
and I want to show you in this video how
I go about scraping almost every single
site that I come up against especially
ones like this so I've covered this
before but what you want to do is you
absolutely don't want to be trying to
pull out links and trying to um you know
scrape the HTML that's just not going to
work I know if you look over my head
here I'll make it a bit bigger I mean
this is just passing HTML for this is
just not going to work what we want to
do is we want to find the backend API
that this site uses to hydrate the front
end to basically populate this data to
find that we want to open up our inspect
tool our tools here in Chrome go to
network I'll try and make this a little
bit
bigger and then we need to start
interrogating the site now the first
thing I always do pretty much is just
sort of scroll around and see what pops
up I'm going to click on Fetch xhr and
it's responses that are Json that we are
going to be interested in uh you can
either move around go to different
categories or click on a product we'll
do just fine when you start to scale up
projects like this one you'll find that
your requests start to get blocked and
that's where you need to start using
high quality proxies and I want to share
with you the proxy provider that I use
and the sponsor of this Video Proxy
scrape proxy scrape gives us access to
high quality secure fast and ethically
sourced proxies that cover residential
Data Center and mobile with rotating and
sticky session options there's 10
million plus proxy SE in the pool to use
or with unlimited concurrent sessions
from countries all over the globe
enabling us to scrape quickly and
efficiently my goto is either geot
targeted residential proxies based on
the location of the website or the
mobile proxies as these are the best
options for passing antibot protection
on sites and with auto rotation or
sticky sessions it's a good first step
to avoid being blocked for the project
we're working on today I'm going to use
the sticky sessions with residential
proxies holding on to a single IP for
about 3 minutes it's still only one line
of code to add to your project and then
we can let proxy scrape handle the rest
from there and also any traffic you
purchase is yours to use whenever you
need as it doesn't ever expire so if
this all sounds good to you go ahead and
check out proxy scrape at the link in
the description below let's get on with
the video so let's go ahead and look at
what we've got here um so here right
away I can see a load of images and a
load of Json data here the one that I'm
interested in straight away says
availability and this has all the
product availability the like you know
the basically the stock numbers and the
SKS etc for this item that's pretty
handy that's very relevant and the other
one is right here which is sort of the
whole product data everything that uh
comes with it so we can see we've got
all the images and stuff like that and
there's there's pricing information in
here metadata if I Collapse these uh we
can see everything coming up pricing
information so this is essentially the
data that I want now I've shown you all
this before in other videos and if this
is new to you then I'll will cover
everything you need to do to get started
with this but what I haven't done before
is I haven't showed you more of a full
project which is what I'm going to go
through through in a minute um the first
thing that I want to do though is we
need to understand the API and the
endpoints and what's happening so I'm
going to go ahead and I'm just going to
copy the request URL for this one which
is the product now we can see that this
is basically essentially just their API
and by hitting it like this we do indeed
get the Json response for this data now
what that means is we could effectively
take a different um
[Music]
product for example uh let's see if I
can grab the data for this one the code
for this one and just put it on the end
here and we're going to get that
information but how do we go about
getting these product codes well there's
another way that we can do this and uh
I'm going to keep this one open so now
I've got the sort of the product link
here and I'm going to open the um the
availability one as well so we can have
all three and have a look where is the
availability here so again here the
availability it's basically very
straightforward so just going to paste
this in here we get the availability
again if I change the product
code it's going to give us the
availability for that product now to
actually find the product IDs well how
would you find them on the website well
you could either go to a category or you
might want to search and this is kind of
where I tend to go for go for to start
with so I might type something like
boots into the search again with this
open on this side you know here we go
431 results this is how I would
typically sort of look to get this
information so if I come over back to
the um the the data here that I had I
need to scroll to the bottom somewhere
around here we're going to find a um a
request wish it wouldn't show me all of
these actually what I'm going to do is
I'm going to delete all this I had all
the other ones and I'm going to search
again just so it comes up at the top
okay so this is it loading up you can
see it's loading up all these products
and this is because these are the the
products that have come from the search
so this endpoint is actually slightly
different it's going to give you
different bits of information we'll we
will cover that the one I'm looking for
is the actual um search one here search
query there we I found it so what this
is is this is like basically hitting the
M the API Endo with the search query
that we gave it and again you know I can
put this in here put this in I wish this
would go away I don't know what this is
for I wish and I can put this in here
and here is the response now I'm going
to just collapse a lot of this
information uh get rid of all of this
cuz we're not that interested in this
information but what we are interested
in if I make this full screen and we
have a good look is we have a view size
a view set size we have the count which
is 431 which was the whole of the search
uh we have the search term and then we
have the items at 48 per page which was
the view size we also have the current
set which I believe uh no there should
be another one start index here we go so
what we can actually do is we can start
to see are any of these parameters
available for us to manipulate so if I
change the start index to 10 what
happens okay that wasn't the right one
um I think it's actually so start index
didn't work so I'm going to change it
and quite often it's just start maybe
okay start is start index okay that's
fine to find that out if you were I mean
you could try and guess it like that but
what you could do is you can uh if we
just come back here and we manually go
to the next page with the uh developer
tools open you would see that and it
would it would be there so if we scroll
down somewhere along here
start is 48 we can see that there so you
can start to do everything that you
would do on the page um and just keep an
eye on the uh the actual Network Tab and
you'll see everything come through so
now that I know that the uh the start
index works oh way too
big we can start to put together
something that's going to give us we can
use to search we can have like the that
we can start we want to start on zero
index I guess yeah and then we can go
through the items so what we have in the
actual items response is somewhere down
here we have a lot of good information
actually and in some cases this is
enough but a lot of cases you do want to
go actually deep into the product itself
we have a product ID so this product is
some kind of kid Superstar boots right
so now we come back to our products part
end point and we hit this in here here's
the product straight away has come back
and it's given us all this information
and the one that I want to look at the
most is the pricing information it's got
a discount all this cool stuff right
here then we can of course go to the
availability one put the product code in
and here's the available availability
and this one has some availability so
you can see that we're starting to work
out how their API works now this is not
that difficult especially if you've
either worked with rest apis before or
built AR apis before but my best device
as I said is just to look through the
website so what I want to do now is to
take this and I want to turn it into
something we can repeat within our code
uh so I'm going to get rid of this at
the moment I don't think I'm going to
need this uh we can always actually we
can always come back to it and I've got
my um terminal open here in a new folder
let's make this a bit bigger and I'm
going to create a virtual
environment like
so I'm going to activate it what I want
to show you now is a couple of
interesting things so I'm going to go
and I'm going to use Curl I'm going to
take this endpoint that we know that
works in our browser we can see it works
there I'm going to paste it here and we
get denied so this is a curl error and
this is basically you know the akin to
you know we can't get this data like
this well let's try it with requests so
let's Import in requests and we'll do
our response is equal to requests.get
let's put the URL in there we're getting
you can see that we we're having issues
here we're not able to stream the data
for whatever reason so I'm going to
change the headers I can't clear this up
can I clear this up we'll do it this way
we're going to change the headers so we
I'll say our headers are equal to
because you know you always want to do a
good user agent right user agent and let
me just grab one my user
agent this one will be fine put that in
here oh uh I need to sanitize and paste
please there we go cool so now we'll
import requests again and we'll do our
response is equal to requests
doget and we'll grab our URL again this
one will be fine put you in there we'll
say our headers is equal to the headers
that we just created which is the user
agent and response. status code
403 now this is because of TLS
fingerprinting I'm going to cover this
much more in a video much more in depth
coming up so if you're interested in
finding out really why this is happening
and what you can do to avoid it and how
you know everything works underneath the
hood you want to subscribe for that
video but essentially what we want to do
is we're going to um I'm going to come
out of this just so I don't get any Nam
space issues actually I don't need to
we'll do um import we'll do uh from Curl
cffi we're going to import in requests
as uh CU Rec curl cffi is going to give
us a more consistent fingerprint that
looks like a real browser so what I can
do now is I can go up to here we don't
need this one we just want this and
instead of using actual requests I'm
going to use Co requests uh CFI request
and I'll do request. status code and I
got 403 because I forgot to do this
impersonate is equal to and we can just
put Chrome in here you don't have to put
the version and now if I do response do
status code we get our 200 our response.
Json is all the data so we basically
needed to uh get our fingerprint sorted
for the um to make the request you
notice I didn't need any cookies I
didn't need any headers I didn't need
anything other than what curl cffi or
other you know TLS fingerprint um sort
of spoofers do there's a few out there
and I will as I said I'll cover that in
a following video so now that I know
that this is going to work what I'm
going to do is I'm going to go into my
we need to activate this one here I'm
going to do pip three and we're going to
use that curl cffi Library three install
curl cffi and I'm going to use uh rich I
always use Rich for printing we're also
going to use pantic because I want to
get it to a point where we have modeled
the data a bit better um so I will
install these I think that should
probably be enough for us in this
instance and I'm going to touch
main.py and we'll make this open here
now I've imported everything that we're
going to need I'm going to look at
modeling my data a little bit closer now
I've done this already but essentially
what I'm going to do is I'm going to
take so from this the products one and
the search one so we can get that
information I haven't done the
availability one but you can add that
one on nice and easy now that you know
the the end point here so we're going to
model this information I'm basically
just going to take what I want from here
and create a pantic model with it so the
first one is the search item which I'm
going to have the product ID the model
ID price sale price and the display name
and the rating so that's all comes from
that search endpoint and then the same
thing I'm going to have with the search
response which means I can easily find
out and manipulate what page and count
Etc like this so we can see the search
term the count uh of total items for
that search and the start index which I
tolded about earlier and then the items
is the list of search items then I've
modeled the item detail um which is the
the information that I was after before
so I've just basically put the product
description and the pricing information
in as d dictionaries rather than
modeling them because this is quite
Dynamic this data I found some products
they don't have all of this information
so it was easier just to do it like this
again with the product description so
it's up to you but basically what I'm
saying is model your data from here I
creating a new session now I gave I
created a function for this because
initially I thought maybe I would want
to expand on this project and then be
able to import this new session function
from into a different uh you know
different different file or different
part of the project so all I'm saying is
I'm creating a session I'm using
request. session and again this is K
cffi so we have this impersonate here
and I also am importing my proxy now I
talked about sticky proxies earlier and
that's what I'm going to be using here
it's not actually essential to do so
with this specific site but there are
sites that will be um that will sort of
match your fingerprint or your request
with the IP address and if it starts to
different
it starts to get flagged that's a lot
less common though so this should be
fine and now I'm going to model a
function that's going to go ahead and
query the search API we need our session
which we're going to create our query
string and our start number and I've
just put in the an F string into the URL
here to do that and then I'm going to
basically just get the data from here we
want to put in something to handle if we
get a bad response so basically I've put
request uh for status which is going to
throw me an exception if we get anything
that isn't a 200 response basically
going to let me know if we're starting
to get blocked um I'm not too fond of
this I think there's probably a more
elegant way of handling it but this will
work just fine for now then we are
basically taking the response data and
uh pushing it into our model our search
response model we're unpacking it and
I'm unpacking from the raw and item list
which is essentially this piece of here
so raw I'm going to go to this one and
then this one here and then I'm going to
unpack everything that fits into my
models like so again it's up to you how
you model your data and then I'm going
to return the search which is a type of
the search response model I'm going to
do exactly the same now for the detail
API very very similar we're going to put
the item. product ID and this is why I
like to use models with my data because
now look I can clearly see in this
function that this takes in the search
item and then we use the item ID as to
to put into our URL rather than just
having you know the whatever piece of
data from a dictionary I find this much
much easier to see request raise for
State raise for status again and the
same thing we're going to push our
response Json into our item detail model
we're going to return that out and
here's our main function we're going to
create a new session we're going to go
and put a search term in here so again
this is our session that we're giving it
the search query parameter which I
Define in the other function is hoodie
start index I put as one that should
probably be zero but you get the idea
I'm just going to Loop through all of
these and we're going to print out the
name of the product as we go through so
I've got it to this point here I wanted
to show you up to here because this is
kind of like the main part of getting
the data which is absolutely the hardest
part of web scraping and then sort of
understanding how you can go through and
figure out how the sites backend apis
work and then manipulate them slightly
to get the information that you're after
once you've got that data it's entirely
up to you what you're going to do with
it I mean you could collect more here
you probably want to do the availability
etc etc so I'm going to save this and
I'm going to come over here and I'm
going to run P Main and we should
hopefully start to see some of the
product names coming through so I've
searched for hoodie and we're now this
is the information that's coming back so
I'm just looping through the um products
that were on that first search page I it
was 48 and I'm querying their API as if
I was a browser like I showed you on
this page here and just pulling the data
out so this is the absolute best and
easiest way to get data from websites
like this website owners and site
designers will find it very very
difficult to protect their backend API
in such a way that their front end can
still access it just by the nature of it
it happens a lot now it's not always
going to be as easy as this but you will
be surprised how often it is the only
thing I will say is that if you're going
to do this you're going to be able to
pull a lot of data quite quickly so I
would always say you know be be be
consider it and don't hammer it if you
hammer it you're probably going to get
blocked they'll find out anyway but pull
the data that you need it's all publicly
available data I'm not doing anything
there I'm not using any API Keys here
I'm not using anything that I shouldn't
do this is all publicly available data
I'm just pulling it in the most
convenient and easy way to easy easy
fashion as possible so hopefully you got
the idea and you can mimic this now with
your own uh projects Etc if you've
enjoyed this video I'd really appreciate
like comment subscribe it makes a whole
load of difference to me uh it really
does check out the patreon I always post
stuff early on there or consider uh
joining uh the Youtube uh the Youtube
channel down below as well um there's
another video right here which if you
watch this one now you'll continue my
watch time across YouTube and they will
promote my channel more thanks bye
Посмотреть больше похожих видео
Cómo CONECTAR un FORMULARIO con una BASE de DATOS en NOTION
CÓMO CONSUMIR UN API con JAVASCRIPT desde la web
Build your own Amazon price scraper on Google sheets
Cómo Saber Si Una Estrategia De Trading Funciona Sin Perder Ni $1
30. Rutas dinámicas con vue-router y useRoute | AbiDev
¿Como realizar una Base de Datos en Google Sheets? Base de datos en la nube Gratis
5.0 / 5 (0 votes)