This is How I Scrape 99% of Sites

John Watson Rooney
15 Sept 202418:27

Summary

TLDREn este video, se explica cómo realizar scraping de datos en sitios web de comercio electrónico, centrándose en la identificación de la API backend que hidrata la información en el frontend. Se muestra el uso de herramientas como la consola de Chrome y técnicas avanzadas para extraer datos en formato JSON, incluyendo la disponibilidad de productos y precios. Además, se mencionan soluciones para evitar bloqueos mediante proxies de alta calidad. El autor demuestra cómo automatizar este proceso utilizando bibliotecas como Curl cffi y Python, con un enfoque en la extracción eficiente y respetuosa de datos públicos.

Takeaways

  • 🔍 La clave para realizar scraping exitoso en sitios de e-commerce es encontrar la API del backend que proporciona los datos del frontend.
  • 🛠️ Utilizar las herramientas de inspección de red en Chrome, especialmente el filtrado por 'Fetch XHR', es esencial para encontrar respuestas en formato JSON.
  • 💡 Es importante no extraer directamente el HTML, sino buscar los puntos finales de la API que entregan los datos necesarios.
  • 🌍 El uso de proxies, como los de Proxy Scrape, es crucial para evitar bloqueos al realizar scraping a gran escala.
  • 📄 Los proxies residenciales y móviles son ideales para superar las protecciones antibot de los sitios web.
  • 📈 La personalización de los parámetros de búsqueda en la API, como el 'start index', permite manipular los resultados de búsqueda de manera eficiente.
  • 🧰 Es importante modelar los datos obtenidos utilizando herramientas como Pydantic para organizar y procesar la información de manera efectiva.
  • 👨‍💻 Al utilizar bibliotecas como Curl CFFI, se puede evitar el bloqueo por 'TLS fingerprinting' y obtener una respuesta exitosa del servidor.
  • ⚙️ Después de obtener los datos de la API, el siguiente paso es integrarlos en el código mediante la automatización con funciones en Python.
  • 📊 Es crucial ser considerado con la cantidad de datos extraídos y evitar sobrecargar los servidores para no ser bloqueado.

Q & A

  • ¿Cuál es la técnica principal que se menciona para realizar scraping de sitios web de comercio electrónico?

    -La técnica principal es identificar el API backend que el sitio utiliza para llenar la interfaz, en lugar de intentar extraer directamente el HTML.

  • ¿Por qué no es recomendable intentar extraer enlaces o datos del HTML de un sitio web?

    -Porque muchos sitios web cargan datos dinámicamente a través de APIs backend y el HTML por sí solo no contiene toda la información necesaria para el scraping.

  • ¿Qué herramienta se utiliza en Chrome para inspeccionar las solicitudes de red de un sitio web?

    -Se utiliza la herramienta de 'Inspect' de Chrome, específicamente la pestaña 'Network', filtrando las solicitudes Fetch/XHR y enfocándose en las respuestas JSON.

  • ¿Cuál es el propósito de usar proxies al realizar scraping de datos a gran escala?

    -Los proxies ayudan a evitar ser bloqueados por los sitios web al permitir rotar direcciones IP o mantener sesiones prolongadas con una IP, lo cual es útil para superar las protecciones antibots.

  • ¿Qué tipo de datos se puede obtener de las respuestas JSON del API de un sitio de comercio electrónico?

    -Se puede obtener información como disponibilidad del producto, números de SKU, imágenes, precios y metadatos.

  • ¿Cómo se encuentran los códigos de productos para realizar peticiones al API de un sitio web?

    -Los códigos de productos se pueden obtener navegando por categorías o utilizando la función de búsqueda del sitio web, y luego observando las solicitudes que aparecen en la pestaña de red del navegador.

  • ¿Qué se necesita para superar el bloqueo de solicitudes al API usando 'curl' o 'requests' en Python?

    -Es necesario usar herramientas como 'curl cffi' para replicar correctamente la huella digital TLS del navegador, lo que ayuda a que las solicitudes parezcan legítimas y no sean bloqueadas.

  • ¿Cómo se puede modelar los datos obtenidos de un API de manera eficiente en Python?

    -Se pueden utilizar modelos como 'Pydantic' para estructurar los datos en objetos, lo que facilita la manipulación y comprensión de los datos extraídos del API.

  • ¿Por qué es importante usar encabezados como el 'User-Agent' al realizar scraping?

    -El 'User-Agent' permite que las solicitudes parezcan como si vinieran de un navegador real, lo que puede ayudar a evitar bloqueos o restricciones impuestas por el sitio web.

  • ¿Cuál es el consejo del autor sobre cómo manejar la extracción de datos sin ser bloqueado?

    -El autor recomienda ser cuidadoso y no hacer demasiadas solicitudes rápidamente, ya que esto podría hacer que el sitio web bloquee la dirección IP. Es mejor extraer solo los datos necesarios de manera controlada.

Outlines

00:00

🛒 Análisis de Scraping para Datos de E-commerce

El autor explica cómo realiza scraping de datos de sitios de e-commerce para el análisis de competidores y productos. Se enfatiza en no extraer directamente el HTML, sino buscar la API backend que proporciona los datos al frontend. Se usan herramientas de inspección en el navegador (Chrome) y se filtran las respuestas JSON relevantes. A medida que los proyectos crecen, es crucial utilizar proxies de alta calidad para evitar bloqueos, destacando el patrocinio del video por parte de Proxy Scrape, un proveedor que ofrece proxies rotativos y sesiones persistentes. El autor explica la importancia de usar proxies geolocalizados o móviles para evitar la detección por sistemas de protección antibots en los sitios web.

05:02

🔍 Cómo Encontrar Datos de Disponibilidad de Productos

El autor detalla el proceso para encontrar la disponibilidad de productos en el backend de una página. Se utiliza el endpoint de disponibilidad de la API y se experimenta con códigos de productos para obtener los datos deseados. El autor muestra cómo encontrar estos códigos en el sitio web, usando la búsqueda de términos como 'botas', y explica cómo identificar las solicitudes API correctas que entregan los productos listados. Además, señala la importancia de manipular parámetros como 'start index' para extraer diferentes páginas de productos y cómo este proceso puede repetirse con otros datos.

10:02

🛠 Ajustes y Configuración de Código para Scraping

Aquí el autor profundiza en el uso de la librería Curl y cómo manejar errores comunes como denegaciones de acceso al intentar hacer solicitudes a las APIs. Explica cómo solucionar problemas de 'fingerprinting' con herramientas como curl_cffi, que simulan las solicitudes como si fueran hechas por un navegador real. Además, menciona la necesidad de manejar las cabeceras adecuadamente y cómo el uso de proxies ayuda a evitar bloqueos. Esta sección sirve como una introducción para integrar estos ajustes en proyectos de scraping más grandes y automatizados.

15:02

💻 Estructuración y Modelado de Datos en Proyectos de Scraping

En esta sección, el autor se centra en la estructuración del proyecto de scraping. Explica cómo usar Python para modelar datos extraídos de las APIs, utilizando herramientas como Pydantic para estructurar la información de productos. Se detalla la creación de sesiones de solicitud usando proxies y cómo estructurar funciones que interactúan con las APIs. El autor demuestra cómo manejar errores y respuestas inesperadas, e introduce un bucle para procesar y extraer los nombres de los productos. Finalmente, anima a los usuarios a seguir el proyecto y personalizar el scraping según sus propias necesidades, haciendo hincapié en el uso de datos públicos de manera ética.

Mindmap

Keywords

💡Web Scraping

Web scraping es el proceso de extraer datos de sitios web automáticamente. En el video, el autor explica cómo utiliza esta técnica para recopilar información de productos y realizar análisis competitivos en sitios de comercio electrónico. El concepto es central al tema, ya que el video se enfoca en cómo realizar scraping de manera eficiente y cómo evitar bloqueos usando proxies.

💡API Backend

Una API backend es la interfaz que un sitio web utiliza para comunicar datos entre su servidor y el front-end. En el video, el autor menciona que la clave para obtener información de un sitio es encontrar la API que utiliza para poblar datos en el front-end. Esta API entrega información en formato JSON, lo cual facilita su extracción para el scraping.

💡Proxies

Los proxies son servidores intermedios que se utilizan para ocultar la dirección IP original al realizar solicitudes en línea. El video explica cómo el uso de proxies de alta calidad, como Proxy Scrape, es fundamental para evitar bloqueos cuando se realizan múltiples solicitudes a un sitio web durante el scraping, especialmente en sitios protegidos con sistemas anti-bot.

💡Fetch XHR

Fetch XHR es una herramienta de Chrome DevTools que permite rastrear las solicitudes de datos que hace un sitio web. El autor utiliza esta función para detectar y analizar las respuestas JSON que el backend del sitio envía al navegador, lo que le permite encontrar información clave como disponibilidad de productos y precios.

💡JSON

JSON (JavaScript Object Notation) es un formato de datos utilizado comúnmente para el intercambio de información entre servidores y clientes web. En el video, el autor menciona que los datos de productos, como la disponibilidad y los precios, son devueltos en formato JSON, lo cual facilita el proceso de scraping.

💡Producto ID

El Product ID es un identificador único asignado a cada producto en un sitio web. Este dato es esencial para acceder a la información detallada del producto desde la API. El autor explica cómo se puede extraer este ID desde la respuesta JSON para obtener datos adicionales como precios y disponibilidad.

💡Antibot Protection

La protección antibot es un sistema implementado por sitios web para evitar que bots realicen solicitudes automatizadas. El autor señala que, al escalar proyectos de scraping, es probable que las solicitudes sean bloqueadas por estas protecciones, y que el uso de proxies como los residenciales o móviles es una solución eficaz para sortear estos bloqueos.

💡Rotación de IP

La rotación de IP implica cambiar periódicamente la dirección IP utilizada para realizar solicitudes a un servidor, lo que ayuda a evitar que un sitio web bloquee las solicitudes. En el video, el autor menciona que Proxy Scrape ofrece opciones de rotación automática de IPs, lo que permite mantener una sesión sin ser bloqueado durante el scraping.

💡TLS Fingerprinting

El TLS Fingerprinting es una técnica utilizada por servidores web para identificar clientes a partir de patrones en la conexión TLS. El autor menciona que el uso de librerías como Curl CFFI puede ayudar a sortear bloqueos que resultan de esta técnica, ya que permite replicar el comportamiento de un navegador real al realizar solicitudes.

💡Pydantic

Pydantic es una librería de Python que permite modelar datos de manera estructurada y validarlos con facilidad. En el video, el autor utiliza Pydantic para crear modelos que representan la estructura de los datos JSON extraídos de la API, facilitando su manejo y procesamiento en el código.

Highlights

The speaker focuses on scraping e-commerce data for competitor analysis and product analysis.

Emphasizes that scraping HTML directly isn't effective; instead, finding the backend API is key.

Demonstrates how to use Chrome's inspect tool to find the API that hydrates a website's frontend.

Introduces the concept of Fetch XHR responses, with a focus on JSON data as the target for scraping.

Stresses the importance of using high-quality proxies, especially when scraping larger projects.

Mentions Proxy Scrape as a preferred proxy provider, offering various proxy types like residential, datacenter, and mobile.

Describes how to extract product availability and stock information from JSON data.

Explains how to manipulate API requests to fetch product data, such as changing product codes.

Covers techniques for locating product IDs on e-commerce sites using search queries.

Demonstrates how to paginate through large search results using API request manipulation.

Introduces modeling of scraped data using tools like Pydantic for better data organization.

Shows how to handle API requests using libraries like `curl_cffi` to bypass bot protection.

Explains how to fingerprint TLS requests to mimic real browser behavior and avoid 403 errors.

Walks through the process of setting up a Python script for automating e-commerce data scraping.

Concludes with advice on scraping responsibly to avoid overwhelming websites and getting blocked.

Transcripts

play00:00

a large part of the work I do in

play00:02

scraping is e-commerce data competitor

play00:04

analysis product analysis and all that

play00:06

and I want to show you in this video how

play00:08

I go about scraping almost every single

play00:10

site that I come up against especially

play00:12

ones like this so I've covered this

play00:14

before but what you want to do is you

play00:16

absolutely don't want to be trying to

play00:17

pull out links and trying to um you know

play00:20

scrape the HTML that's just not going to

play00:23

work I know if you look over my head

play00:24

here I'll make it a bit bigger I mean

play00:26

this is just passing HTML for this is

play00:28

just not going to work what we want to

play00:30

do is we want to find the backend API

play00:33

that this site uses to hydrate the front

play00:35

end to basically populate this data to

play00:38

find that we want to open up our inspect

play00:40

tool our tools here in Chrome go to

play00:43

network I'll try and make this a little

play00:45

bit

play00:47

bigger and then we need to start

play00:49

interrogating the site now the first

play00:50

thing I always do pretty much is just

play00:52

sort of scroll around and see what pops

play00:54

up I'm going to click on Fetch xhr and

play00:57

it's responses that are Json that we are

play00:59

going to be interested in uh you can

play01:01

either move around go to different

play01:02

categories or click on a product we'll

play01:04

do just fine when you start to scale up

play01:07

projects like this one you'll find that

play01:08

your requests start to get blocked and

play01:10

that's where you need to start using

play01:11

high quality proxies and I want to share

play01:13

with you the proxy provider that I use

play01:15

and the sponsor of this Video Proxy

play01:17

scrape proxy scrape gives us access to

play01:19

high quality secure fast and ethically

play01:22

sourced proxies that cover residential

play01:24

Data Center and mobile with rotating and

play01:26

sticky session options there's 10

play01:29

million plus proxy SE in the pool to use

play01:31

or with unlimited concurrent sessions

play01:32

from countries all over the globe

play01:34

enabling us to scrape quickly and

play01:36

efficiently my goto is either geot

play01:39

targeted residential proxies based on

play01:41

the location of the website or the

play01:42

mobile proxies as these are the best

play01:44

options for passing antibot protection

play01:46

on sites and with auto rotation or

play01:48

sticky sessions it's a good first step

play01:50

to avoid being blocked for the project

play01:52

we're working on today I'm going to use

play01:54

the sticky sessions with residential

play01:55

proxies holding on to a single IP for

play01:58

about 3 minutes it's still only one line

play02:00

of code to add to your project and then

play02:02

we can let proxy scrape handle the rest

play02:04

from there and also any traffic you

play02:06

purchase is yours to use whenever you

play02:08

need as it doesn't ever expire so if

play02:11

this all sounds good to you go ahead and

play02:13

check out proxy scrape at the link in

play02:14

the description below let's get on with

play02:16

the video so let's go ahead and look at

play02:20

what we've got here um so here right

play02:22

away I can see a load of images and a

play02:24

load of Json data here the one that I'm

play02:27

interested in straight away says

play02:28

availability and this has all the

play02:30

product availability the like you know

play02:32

the basically the stock numbers and the

play02:34

SKS etc for this item that's pretty

play02:37

handy that's very relevant and the other

play02:39

one is right here which is sort of the

play02:41

whole product data everything that uh

play02:43

comes with it so we can see we've got

play02:45

all the images and stuff like that and

play02:46

there's there's pricing information in

play02:48

here metadata if I Collapse these uh we

play02:51

can see everything coming up pricing

play02:53

information so this is essentially the

play02:54

data that I want now I've shown you all

play02:57

this before in other videos and if this

play02:58

is new to you then I'll will cover

play03:00

everything you need to do to get started

play03:02

with this but what I haven't done before

play03:03

is I haven't showed you more of a full

play03:05

project which is what I'm going to go

play03:06

through through in a minute um the first

play03:09

thing that I want to do though is we

play03:10

need to understand the API and the

play03:12

endpoints and what's happening so I'm

play03:14

going to go ahead and I'm just going to

play03:15

copy the request URL for this one which

play03:17

is the product now we can see that this

play03:19

is basically essentially just their API

play03:21

and by hitting it like this we do indeed

play03:23

get the Json response for this data now

play03:26

what that means is we could effectively

play03:28

take a different um

play03:29

[Music]

play03:30

product for example uh let's see if I

play03:33

can grab the data for this one the code

play03:34

for this one and just put it on the end

play03:37

here and we're going to get that

play03:38

information but how do we go about

play03:40

getting these product codes well there's

play03:41

another way that we can do this and uh

play03:43

I'm going to keep this one open so now

play03:45

I've got the sort of the product link

play03:47

here and I'm going to open the um the

play03:49

availability one as well so we can have

play03:51

all three and have a look where is the

play03:53

availability here so again here the

play03:55

availability it's basically very

play03:57

straightforward so just going to paste

play03:58

this in here we get the availability

play04:00

again if I change the product

play04:03

code it's going to give us the

play04:04

availability for that product now to

play04:07

actually find the product IDs well how

play04:09

would you find them on the website well

play04:11

you could either go to a category or you

play04:12

might want to search and this is kind of

play04:14

where I tend to go for go for to start

play04:17

with so I might type something like

play04:18

boots into the search again with this

play04:21

open on this side you know here we go

play04:23

431 results this is how I would

play04:26

typically sort of look to get this

play04:27

information so if I come over back to

play04:29

the um the the data here that I had I

play04:31

need to scroll to the bottom somewhere

play04:34

around here we're going to find a um a

play04:37

request wish it wouldn't show me all of

play04:42

these actually what I'm going to do is

play04:44

I'm going to delete all this I had all

play04:45

the other ones and I'm going to search

play04:47

again just so it comes up at the top

play04:50

okay so this is it loading up you can

play04:51

see it's loading up all these products

play04:52

and this is because these are the the

play04:54

products that have come from the search

play04:55

so this endpoint is actually slightly

play04:57

different it's going to give you

play04:58

different bits of information we'll we

play04:59

will cover that the one I'm looking for

play05:01

is the actual um search one here search

play05:04

query there we I found it so what this

play05:07

is is this is like basically hitting the

play05:09

M the API Endo with the search query

play05:11

that we gave it and again you know I can

play05:14

put this in here put this in I wish this

play05:17

would go away I don't know what this is

play05:18

for I wish and I can put this in here

play05:20

and here is the response now I'm going

play05:22

to just collapse a lot of this

play05:24

information uh get rid of all of this

play05:26

cuz we're not that interested in this

play05:27

information but what we are interested

play05:29

in if I make this full screen and we

play05:31

have a good look is we have a view size

play05:35

a view set size we have the count which

play05:37

is 431 which was the whole of the search

play05:40

uh we have the search term and then we

play05:42

have the items at 48 per page which was

play05:44

the view size we also have the current

play05:47

set which I believe uh no there should

play05:50

be another one start index here we go so

play05:53

what we can actually do is we can start

play05:54

to see are any of these parameters

play05:57

available for us to manipulate so if I

play05:59

change the start index to 10 what

play06:02

happens okay that wasn't the right one

play06:04

um I think it's actually so start index

play06:07

didn't work so I'm going to change it

play06:09

and quite often it's just start maybe

play06:12

okay start is start index okay that's

play06:14

fine to find that out if you were I mean

play06:17

you could try and guess it like that but

play06:18

what you could do is you can uh if we

play06:19

just come back here and we manually go

play06:21

to the next page with the uh developer

play06:24

tools open you would see that and it

play06:26

would it would be there so if we scroll

play06:27

down somewhere along here

play06:32

start is 48 we can see that there so you

play06:35

can start to do everything that you

play06:37

would do on the page um and just keep an

play06:41

eye on the uh the actual Network Tab and

play06:43

you'll see everything come through so

play06:44

now that I know that the uh the start

play06:47

index works oh way too

play06:49

big we can start to put together

play06:52

something that's going to give us we can

play06:55

use to search we can have like the that

play06:58

we can start we want to start on zero

play07:00

index I guess yeah and then we can go

play07:03

through the items so what we have in the

play07:05

actual items response is somewhere down

play07:08

here we have a lot of good information

play07:10

actually and in some cases this is

play07:12

enough but a lot of cases you do want to

play07:14

go actually deep into the product itself

play07:16

we have a product ID so this product is

play07:20

some kind of kid Superstar boots right

play07:22

so now we come back to our products part

play07:24

end point and we hit this in here here's

play07:26

the product straight away has come back

play07:28

and it's given us all this information

play07:30

and the one that I want to look at the

play07:31

most is the pricing information it's got

play07:33

a discount all this cool stuff right

play07:35

here then we can of course go to the

play07:37

availability one put the product code in

play07:39

and here's the available availability

play07:41

and this one has some availability so

play07:44

you can see that we're starting to work

play07:47

out how their API works now this is not

play07:49

that difficult especially if you've

play07:51

either worked with rest apis before or

play07:53

built AR apis before but my best device

play07:55

as I said is just to look through the

play07:57

website so what I want to do now is to

play07:59

take this and I want to turn it into

play08:02

something we can repeat within our code

play08:05

uh so I'm going to get rid of this at

play08:08

the moment I don't think I'm going to

play08:09

need this uh we can always actually we

play08:11

can always come back to it and I've got

play08:13

my um terminal open here in a new folder

play08:16

let's make this a bit bigger and I'm

play08:17

going to create a virtual

play08:20

environment like

play08:22

so I'm going to activate it what I want

play08:25

to show you now is a couple of

play08:27

interesting things so I'm going to go

play08:29

and I'm going to use Curl I'm going to

play08:31

take this endpoint that we know that

play08:32

works in our browser we can see it works

play08:34

there I'm going to paste it here and we

play08:37

get denied so this is a curl error and

play08:39

this is basically you know the akin to

play08:41

you know we can't get this data like

play08:43

this well let's try it with requests so

play08:45

let's Import in requests and we'll do

play08:48

our response is equal to requests.get

play08:51

let's put the URL in there we're getting

play08:54

you can see that we we're having issues

play08:55

here we're not able to stream the data

play08:58

for whatever reason so I'm going to

play09:00

change the headers I can't clear this up

play09:02

can I clear this up we'll do it this way

play09:04

we're going to change the headers so we

play09:05

I'll say our headers are equal to

play09:07

because you know you always want to do a

play09:09

good user agent right user agent and let

play09:13

me just grab one my user

play09:16

agent this one will be fine put that in

play09:18

here oh uh I need to sanitize and paste

play09:21

please there we go cool so now we'll

play09:24

import requests again and we'll do our

play09:27

response is equal to requests

play09:32

doget and we'll grab our URL again this

play09:35

one will be fine put you in there we'll

play09:38

say our headers is equal to the headers

play09:40

that we just created which is the user

play09:41

agent and response. status code

play09:44

403 now this is because of TLS

play09:47

fingerprinting I'm going to cover this

play09:49

much more in a video much more in depth

play09:51

coming up so if you're interested in

play09:52

finding out really why this is happening

play09:54

and what you can do to avoid it and how

play09:57

you know everything works underneath the

play09:58

hood you want to subscribe for that

play10:00

video but essentially what we want to do

play10:02

is we're going to um I'm going to come

play10:03

out of this just so I don't get any Nam

play10:05

space issues actually I don't need to

play10:06

we'll do um import we'll do uh from Curl

play10:11

cffi we're going to import in requests

play10:14

as uh CU Rec curl cffi is going to give

play10:19

us a more consistent fingerprint that

play10:22

looks like a real browser so what I can

play10:25

do now is I can go up to here we don't

play10:28

need this one we just want this and

play10:31

instead of using actual requests I'm

play10:34

going to use Co requests uh CFI request

play10:37

and I'll do request. status code and I

play10:39

got 403 because I forgot to do this

play10:42

impersonate is equal to and we can just

play10:44

put Chrome in here you don't have to put

play10:46

the version and now if I do response do

play10:50

status code we get our 200 our response.

play10:55

Json is all the data so we basically

play10:58

needed to uh get our fingerprint sorted

play11:02

for the um to make the request you

play11:05

notice I didn't need any cookies I

play11:07

didn't need any headers I didn't need

play11:10

anything other than what curl cffi or

play11:12

other you know TLS fingerprint um sort

play11:15

of spoofers do there's a few out there

play11:17

and I will as I said I'll cover that in

play11:19

a following video so now that I know

play11:21

that this is going to work what I'm

play11:23

going to do is I'm going to go into my

play11:25

we need to activate this one here I'm

play11:27

going to do pip three and we're going to

play11:28

use that curl cffi Library three install

play11:31

curl cffi and I'm going to use uh rich I

play11:36

always use Rich for printing we're also

play11:38

going to use pantic because I want to

play11:40

get it to a point where we have modeled

play11:42

the data a bit better um so I will

play11:44

install these I think that should

play11:46

probably be enough for us in this

play11:48

instance and I'm going to touch

play11:51

main.py and we'll make this open here

play11:54

now I've imported everything that we're

play11:55

going to need I'm going to look at

play11:57

modeling my data a little bit closer now

play12:00

I've done this already but essentially

play12:01

what I'm going to do is I'm going to

play12:02

take so from this the products one and

play12:05

the search one so we can get that

play12:07

information I haven't done the

play12:09

availability one but you can add that

play12:10

one on nice and easy now that you know

play12:12

the the end point here so we're going to

play12:14

model this information I'm basically

play12:15

just going to take what I want from here

play12:17

and create a pantic model with it so the

play12:20

first one is the search item which I'm

play12:22

going to have the product ID the model

play12:23

ID price sale price and the display name

play12:26

and the rating so that's all comes from

play12:29

that search endpoint and then the same

play12:31

thing I'm going to have with the search

play12:32

response which means I can easily find

play12:35

out and manipulate what page and count

play12:37

Etc like this so we can see the search

play12:39

term the count uh of total items for

play12:42

that search and the start index which I

play12:44

tolded about earlier and then the items

play12:46

is the list of search items then I've

play12:49

modeled the item detail um which is the

play12:52

the information that I was after before

play12:54

so I've just basically put the product

play12:56

description and the pricing information

play12:58

in as d dictionaries rather than

play13:00

modeling them because this is quite

play13:02

Dynamic this data I found some products

play13:05

they don't have all of this information

play13:06

so it was easier just to do it like this

play13:08

again with the product description so

play13:10

it's up to you but basically what I'm

play13:12

saying is model your data from here I

play13:14

creating a new session now I gave I

play13:16

created a function for this because

play13:18

initially I thought maybe I would want

play13:20

to expand on this project and then be

play13:24

able to import this new session function

play13:26

from into a different uh you know

play13:28

different different file or different

play13:30

part of the project so all I'm saying is

play13:32

I'm creating a session I'm using

play13:33

request. session and again this is K

play13:35

cffi so we have this impersonate here

play13:38

and I also am importing my proxy now I

play13:41

talked about sticky proxies earlier and

play13:43

that's what I'm going to be using here

play13:45

it's not actually essential to do so

play13:47

with this specific site but there are

play13:50

sites that will be um that will sort of

play13:53

match your fingerprint or your request

play13:56

with the IP address and if it starts to

play13:58

different

play13:59

it starts to get flagged that's a lot

play14:01

less common though so this should be

play14:03

fine and now I'm going to model a

play14:05

function that's going to go ahead and

play14:07

query the search API we need our session

play14:10

which we're going to create our query

play14:11

string and our start number and I've

play14:14

just put in the an F string into the URL

play14:17

here to do that and then I'm going to

play14:19

basically just get the data from here we

play14:22

want to put in something to handle if we

play14:25

get a bad response so basically I've put

play14:27

request uh for status which is going to

play14:30

throw me an exception if we get anything

play14:32

that isn't a 200 response basically

play14:34

going to let me know if we're starting

play14:35

to get blocked um I'm not too fond of

play14:38

this I think there's probably a more

play14:39

elegant way of handling it but this will

play14:41

work just fine for now then we are

play14:44

basically taking the response data and

play14:48

uh pushing it into our model our search

play14:51

response model we're unpacking it and

play14:53

I'm unpacking from the raw and item list

play14:56

which is essentially this piece of here

play15:00

so raw I'm going to go to this one and

play15:02

then this one here and then I'm going to

play15:04

unpack everything that fits into my

play15:05

models like so again it's up to you how

play15:08

you model your data and then I'm going

play15:10

to return the search which is a type of

play15:12

the search response model I'm going to

play15:14

do exactly the same now for the detail

play15:15

API very very similar we're going to put

play15:18

the item. product ID and this is why I

play15:19

like to use models with my data because

play15:22

now look I can clearly see in this

play15:24

function that this takes in the search

play15:26

item and then we use the item ID as to

play15:30

to put into our URL rather than just

play15:32

having you know the whatever piece of

play15:34

data from a dictionary I find this much

play15:36

much easier to see request raise for

play15:38

State raise for status again and the

play15:41

same thing we're going to push our

play15:42

response Json into our item detail model

play15:46

we're going to return that out and

play15:47

here's our main function we're going to

play15:48

create a new session we're going to go

play15:50

and put a search term in here so again

play15:53

this is our session that we're giving it

play15:55

the search query parameter which I

play15:57

Define in the other function is hoodie

play15:59

start index I put as one that should

play16:00

probably be zero but you get the idea

play16:02

I'm just going to Loop through all of

play16:04

these and we're going to print out the

play16:06

name of the product as we go through so

play16:09

I've got it to this point here I wanted

play16:11

to show you up to here because this is

play16:13

kind of like the main part of getting

play16:15

the data which is absolutely the hardest

play16:17

part of web scraping and then sort of

play16:19

understanding how you can go through and

play16:23

figure out how the sites backend apis

play16:25

work and then manipulate them slightly

play16:28

to get the information that you're after

play16:30

once you've got that data it's entirely

play16:32

up to you what you're going to do with

play16:33

it I mean you could collect more here

play16:35

you probably want to do the availability

play16:37

etc etc so I'm going to save this and

play16:40

I'm going to come over here and I'm

play16:41

going to run P Main and we should

play16:43

hopefully start to see some of the

play16:44

product names coming through so I've

play16:46

searched for hoodie and we're now this

play16:48

is the information that's coming back so

play16:50

I'm just looping through the um products

play16:52

that were on that first search page I it

play16:54

was 48 and I'm querying their API as if

play16:58

I was a browser like I showed you on

play16:59

this page here and just pulling the data

play17:02

out so this is the absolute best and

play17:05

easiest way to get data from websites

play17:08

like this website owners and site

play17:11

designers will find it very very

play17:13

difficult to protect their backend API

play17:16

in such a way that their front end can

play17:18

still access it just by the nature of it

play17:20

it happens a lot now it's not always

play17:22

going to be as easy as this but you will

play17:24

be surprised how often it is the only

play17:27

thing I will say is that if you're going

play17:28

to do this you're going to be able to

play17:30

pull a lot of data quite quickly so I

play17:33

would always say you know be be be

play17:36

consider it and don't hammer it if you

play17:38

hammer it you're probably going to get

play17:39

blocked they'll find out anyway but pull

play17:41

the data that you need it's all publicly

play17:43

available data I'm not doing anything

play17:45

there I'm not using any API Keys here

play17:47

I'm not using anything that I shouldn't

play17:49

do this is all publicly available data

play17:51

I'm just pulling it in the most

play17:52

convenient and easy way to easy easy

play17:54

fashion as possible so hopefully you got

play17:57

the idea and you can mimic this now with

play17:59

your own uh projects Etc if you've

play18:02

enjoyed this video I'd really appreciate

play18:04

like comment subscribe it makes a whole

play18:05

load of difference to me uh it really

play18:07

does check out the patreon I always post

play18:09

stuff early on there or consider uh

play18:12

joining uh the Youtube uh the Youtube

play18:15

channel down below as well um there's

play18:17

another video right here which if you

play18:19

watch this one now you'll continue my

play18:21

watch time across YouTube and they will

play18:23

promote my channel more thanks bye

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
scraping webe-commerceAPI RESTanálisis datosproxies rotativosprotección antibotsanálisis productosdatos competenciaautomatización scrapingproxies residenciales