Tutoriel Prestashop - Web-scraping d'une boutique pour récupérer les produits d'un Prestashop 1.6
Dans ce nouveau tutoriel Prestashop 1.6 nous allons voir comment récupérer les données d'une boutique afin d'y récupérer sous format csv tout les produits en vue d'une importation future vers une autre boutique.
Dans ce nouveau tutoriel prestashop nous allons voir comment récupérer les données d'un site sous prestashop afin d'y récupérer sous format csv tout les produits en vue d'une importation future vers une autre boutique.
Pour allez plus loin, j'ai écrit en 2024 un article sur le webscraping Prestashop avec Pupetter.
Disclaimer:
https://fr.wikipedia.org/wiki/Web_scraping
OS utilisé: Ubuntu 16.04
Version de scrapy 1.3.3
Version de la boutique cible: prestashop 1.6.*
Thème de la boutique cible: default-bootstrap
0) Instalation de scrapy (framework de web crawling en python) dans un terminal:
alexandre@ordi: sudo apt install python-pip
alexandre@ordi: sudo pip install Scrapy
1) Création du projet:
alexandre@ordi: scrapy startproject prestashop16
2) On édite item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy.item import Item, Field
class Prestashop16Item(Item):
# define the fields for your item here like:
url = Field()
balise_title = Field()
balise_meta_description = Field()
h1 = Field()
reference = Field()
quantity = Field()
description_courte = Field()
description_longue = Field()
prix_ttc = Field()
images = Field()
main_image = Field()
pass
3) On édite settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for prestashop16 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'prestashop16'
SPIDER_MODULES = ['prestashop16.spiders']
NEWSPIDER_MODULE = 'prestashop16.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# Ou on utilise google bot
# USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
USER_AGENT = 'prestashop16 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# On crawl une page par seconde
DOWNLOAD_DELAY = 1
# On enregistre les données dans un fichier CSV
FEED_URI = '/home/nom_utilisateur/desktop/liste_produits_prestashop.csv'
# On veut un CSV
FEED_FORMAT ='csv'
FEED_EXPORTERS_BASE = {
'csv':'scrapy.exporters.CsvItemExporter',
}
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'prestashop16.middlewares.Prestashop16SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'prestashop16.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'prestashop16.pipelines.Prestashop16Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
4) On crée dans le dossiers spiders -> le fichier presta_bot.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider
from scrapy.linkextractors import LinkExtractor
from prestashop16.items import Prestashop16Item
from scrapy.selector import Selector
from scrapy.http import Request
class prestashop16(CrawlSpider):
name="presta_bot"
#on autorise seulement le crawl du site indiqué dans allowed_domains
allowed_domains = ['demo-prestashop-16.terracode.de']
# on definit l'id du produit de départ
start_id_product = 1
# on definit l'id du produit de fin
end_id_product = 5
#on boucle la requete sur la rangée d'id
def start_requests(self):
for i in range(self.start_id_product,self.end_id_product):
yield Request('https://demo-prestashop-16.terracode.de/index.php?controller=product&id_product=%d' % i,
callback=self.parse_items)
def parse_items(self,response):
#récupération des datas récoltées (contenu de la page produit)
sel = Selector(response)
#on prépare item
item = Prestashop16Item()
item['url'] = response.url
item['balise_title'] = sel.xpath('//title/text()').extract()
item['balise_meta_description'] = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
item['h1'] = sel.xpath('//h1/text()').extract()
item['reference'] = sel.xpath('//span[contains(@itemprop, "sku")]/@content').extract()
item['quantity'] = sel.xpath('//span[@id="quantityAvailable"]/text()').extract()
item['description_courte'] = sel.xpath('//div[@id="short_description_content"]//p/text()').extract()
item['description_longue'] = sel.xpath('//section[@class="page-product-box"]//div[@class="rte"]//p/text()').extract()
item['prix_ttc'] = sel.xpath('//span[contains(@itemprop, "price")]/@content').extract()
item['images'] = sel.xpath('//ul[@id="thumbs_list_frame"]/li/a/@href').extract()
item['main_image'] = sel.xpath('//div[@id="image-block"]//span[@id="view_full_size"]//img/@src').extract()
# on fait passer item à la suite du processus
yield item
5) on lance le bot
alexandre@ordi:~/prestashop16$ scrapy crawl presta_bot
6) on recupere le csv sur le bureau
Laisser un commentaire