{"id":170577,"date":"2026-01-28T07:36:08","date_gmt":"2026-01-28T06:36:08","guid":{"rendered":"https:\/\/liora.io\/en\/?p=170577"},"modified":"2026-07-25T17:00:33","modified_gmt":"2026-07-25T16:00:33","slug":"scrapy-everything-you-need-to-know-about-this-python-web-scraping-tool","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/scrapy-everything-you-need-to-know-about-this-python-web-scraping-tool","title":{"rendered":"Scrapy: Everything you need to know about this Python web scraping tool"},"content":{"rendered":"<p><strong>During internet browsing, many websites do not allow direct saving of data for personal use. The simplest solution in this case is to manually copy and paste the data, which can quickly become tedious and time-consuming. That&#8217;s why web scraping techniques are often used to extract data from websites.<\/strong><\/p>\n<strong>Web Scraping<\/strong> is the automation of the quasi-automatic process of <a href=\"https:\/\/liora.io\/en\/machine-learning-python-where-to-start\">extracting data from websites.<\/a> This operation is performed using scraping tools often known as web scrapers. These tools allow you to load and extract specific data from websites based on users&#8217; needs. They are most often custom-designed for a single site and then configured to work with other websites that have the same structure.\n\nWith the<a href=\"https:\/\/liora.io\/en\/python-the-most-popular-programming-language\"> Python programming language, the most commonly used tools<\/a> in the field of Web Scraping are BeautifulSoup and Scrapy Crawler. In this article, we will present some differences between these two tools and focus on Scrapy later on.\n<h2 class=\"wp-block-heading\" id=\"h-web-scraping-vs-web-crawling\">Web Scraping vs Web Crawling<\/h2>\nBefore delving into the subject, it&#8217;s quite interesting to understand the difference between Web Scraping and Web Crawling techniques:\n<h3 class=\"wp-block-heading\" id=\"h-web-scraping\">&#8211; Web Scraping<\/h3>\nWeb Scraping uses bots to programmatically analyze a web page and extract content from it. With Web Scraping, it&#8217;s necessary to target and extract specific data.\n\nExample of web data extraction: Extracting prices of various specific products on Amazon or any other e-commerce website.\n<h3 class=\"wp-block-heading\" id=\"h-web-crawling\">&#8211; Web Crawling<\/h3>\nThe term crawling is used as an analogy to how a spider crawls (that&#8217;s also why web crawlers are often called spiders). Web Crawling tools also use <strong>robots (bots called crawlers)<\/strong> to systematically browse the World Wide Web, typically for the purpose of indexing it.\n\nThis involves looking at a page in its entirety and cataloging all elements, including the last letter and period on the page. The bots used will then, during their navigation through heaps of data and information, locate and retrieve information that resides in the deepest layers.\n\nAs examples of <strong>Web Crawling<\/strong> tools, you can consider all search engines like Google, Yahoo, or Bing. They crawl web pages and use the extracted information to index them.\n<h2 class=\"wp-block-heading\" id=\"h-beautifulsoup-vs-scrapy\">BeautifulSoup vs Scrapy<\/h2>\nLet&#8217;s continue with a quick comparison between BeautifulSoup and Scrapy, the two most widely used Web scraping libraries.\n<h3 class=\"wp-block-heading\" id=\"h-beautifulsoup\">&#8211; BeautifulSoup<\/h3>\nBeautifulSoup is a very popular Python library that can be used to analyse HTML or XML documents in order to describe them using a tree or dictionary structure. This makes it easy to find and extract specific data from web pages. BeautifulSoup is fairly easy to learn and has good, comprehensive documentation that makes it easy to learn.\n\nThe advantages of BeautifulSoup :\n\n&#8211; Very good documentation (very useful when you&#8217;re just starting out).\n&#8211; Large community of users.\n&#8211; Easy for beginners to learn and master.\n\nDisadvantages:\n\n&#8211; Dependence on other external Python libraries.\n<h3 class=\"wp-block-heading\" id=\"h-scrapy\">&#8211; Scrapy<\/h3>\nScrapy is a <a href=\"https:\/\/liora.io\/en\/the-new-champion-of-open-source-llm-falcon\">comprehensive open-source framework<\/a> and is among the most powerful libraries used for web data extraction. Scrapy natively integrates functions for extracting data from HTML or XML sources using CSS and XPath expressions.\n\nSome advantages of Scrapy:\n<ul>\n \t<li>Efficient in terms of memory and CPU.<\/li>\n \t<li>Built-in functions for data extraction.<\/li>\n \t<li>Easily extendable for large-scale projects.<\/li>\n \t<li>Relatively high performance and speed compared to other libraries.<\/li>\n<\/ul>\nAs disadvantages, we can mention the limited documentation, which can be discouraging for beginners.\n\nTo summarize all the points mentioned above:\n\n<style type=\"text\/css\">\n.tg  {border-collapse:collapse;border-spacing:0;}<br \/>\n.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;<br \/>\n  overflow:hidden;padding:10px 5px;word-break:normal;}<br \/>\n.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;<br \/>\n  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}<br \/>\n.tg .tg-8xxg{background-color:#9b9b9b;border-color:inherit;font-size:22px;font-weight:bold;text-align:center;vertical-align:top}<br \/>\n.tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top}<br \/>\n<\/style>\n<table style=\"undefined;table-layout: fixed; width: 800px\">\n<colgroup>\n<col style=\"width: 400px\">\n<col style=\"width: 400px\">\n<\/colgroup>\n<thead>\n<tr>\n<th>Scrapy<\/th>\n<th>BeautifulSoup<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\n<ul>\n \t<li>A framework<\/li>\n \t<li>High processing speed due to its built-in functions<\/li>\n \t<li>Best choice for complex projects and tasks<\/li>\n \t<li>Scrapy is more complex than BeautifulSoup<\/li>\n \t<li>Works more like a web crawler (spider = Python classes defining what to crawl, how to do it, and how to extract information)<\/li>\n \t<li>Integrates features for creating pipelines<\/li>\n<\/ul>\n<\/td>\n<td>\n<ul>\n \t<li>Python library\/module<\/li>\n \t<li>Can become slow depending on task complexity<\/li>\n \t<li>Ideal for small projects<\/li>\n \t<li>Ideal for beginners<\/li>\n \t<li>Considered as a syntax analyzer (parser)<\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image3-2.png\" title=\"\" alt=\"\" loading=\"lazy\"><figcaption><\/figcaption><\/figure>\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"https:\/\/liora.io\/en\/courses\/data-ai\/data-analyst\">Discover our courses<\/a><\/div><\/div>\n\n<h2 class=\"wp-block-heading\" id=\"h-scrapy-architecture\">Scrapy architecture<\/h2>\nWhen a project is created, various files are used to interact with Scrapy&#8217;s main components. Scrapy&#8217;s architecture, as described in the official documentation, is shown below:\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" width=\"1042\" height=\"621\" src=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/09\/image9.png\" alt=\"\" loading=\"lazy\" srcset=\"https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/09\/image9.png 1042w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/09\/image9-300x179.png 300w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/09\/image9-1024x610.png 1024w, https:\/\/liora.io\/app\/uploads\/sites\/9\/2023\/09\/image9-768x458.png 768w\" sizes=\"(max-width: 1042px) 100vw, 1042px\">\n\n<figcaption><\/figcaption><\/figure>\nIf we analyse Scrapy&#8217;s architecture diagram, we can see that its central element, the engine, controls four executive components:\n<ul>\n \t<li>Spiders<\/li>\n \t<li>Item Pipelines<\/li>\n \t<li>The Downloader<\/li>\n \t<li>The Scheduler<\/li>\n<\/ul>\nAt the beginning of the process, communication occurs through Spiders, which allow the transmission of requests (containing the URLs to be scraped and the information to be extracted) to the engine.\n\nThe engine then forwards the request to the Scheduler to enqueue it<strong> (if multiple URLs are provided).<\/strong>\n\nThe engine also receives requests from the Scheduler, which has ordered tasks previously, and forwards them to the Downloader module, which downloads the HTML code of the page and transforms it into a Response object. The Response object is then passed to the Spider and subsequently to the ItemPipeline module. This process repeats for different URL links of web pages.\n\nTherefore, we can now better define the roles of the components:\n<ul>\n \t<li><strong>Spiders:<\/strong> Classes defining different scraping methods by users. The methods are invoked by Scrapy when necessary.<\/li>\n \t<li><strong>Scrapy Engine<\/strong>: Controls the data flow and triggers all events.\nScheduler: Communicates with the Engine regarding the order of tasks to be performed.<\/li>\n \t<li><strong>Downloader:<\/strong> Receives requests from the Engine to download the content of web pages.<\/li>\n \t<li><strong>ItemPipeline:<\/strong> Successive transformation steps<a href=\"https:\/\/liora.io\/en\/data-cleaning-definition-methods-and-relevance-in-data-science\"> (for cleaning, data validation, or database insertion)<\/a> applied to the raw extracted data.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"h-scrapy-installation\">Scrapy Installation<\/h2>\nScrapy is fairly easy to install. Simply run the command below in an Ubuntu terminal. You can easily find the equivalent commands for other operating systems:\n\n# creation of a virtual environment (OPTIONAL)\nvirtualenv scrapy_env\n\n# environment activation (OPTIONAL)\nsource scrapy_env\/bin\/active\n\n# install Scrapy\npip install scrapy\n\n# installation verification test\nscrapy\n\n# Run a quick benchmark to see how Scrappy works on your hardware.\nscrapy bench\n<h2 class=\"wp-block-heading\" id=\"h-the-scrapy-command-prompt\">The Scrapy command prompt<\/h2>\nDuring the experimentation phase, when you are searching for the code syntax to extract information from web pages, Scrapy has a dedicated command-line interface for interactive interaction with the Engine: the Scrapy Shell.\n\nThe Scrapy Shell is built on Python, so you can import any modules you need.\n\nTo access this command-line interface (once Scrapy is installed), simply execute the following command:\n<pre style=\"padding-left: 40px;\"># Open the shell scrapy\nscrapy shell \"URL-de-la-page-internet\"\n# exemple: scrapy shell \"https:\/\/www.ville-ideale.fr\/abries_5001\"<\/pre>\nOnce launched, it is within the Shell that you can execute commands to actually extract information from the specified web page. You can interactively test different commands and extraction approaches.\n\nAfter various experiments, the extraction code lines will be grouped into a Spider class for automation.\n\n<a href=\"\/en\/courses\/data-ai\/\">\nLearn Scrapy\n<\/a>\n<h2 class=\"wp-block-heading\" id=\"h-css-and-xpath-selectors\">CSS and XPATH selectors<\/h2>\nDuring the creation of a <strong>Spider class<\/strong>, the most important step is to create the code responsible for data extraction (code determined in the previous step from the Scrapy Shell).\n\nTo indicate which data from the website should be downloaded by <strong>Scrapy<\/strong>, you can use:\n\nXPath Selectors:\n\nXPath selectors are frequently used in web scraping due to their extensive capabilities. For example, you can use them to:\n\n&#8211; Specify the exact element to extract from the page.\n&#8211; Retrieve the text associated with an element.\n&#8211; Download the parent or child element.\n&#8211; Fetch adjacent elements.\n&#8211; Retrieve elements that start\/end with keywords.\n&#8211; Obtain elements whose attributes satisfy a mathematical condition.\n<ul>\n \t<li aria-level=\"1\">CSS Selectors:\n\nCSS selectors provide an easier alternative, especially for beginners who are familiar with CSS commands. CSS selectors have slightly fewer capabilities than XPath, but in the case of Scrapy, they have been extended with additional syntax for retrieving a specific element&#8217;s attribute.<\/li>\n \t<li aria-level=\"1\">BeautifulSoup Library:\n\nSince Scrapy is written in Python, it&#8217;s entirely possible to import other libraries for specific tasks if needed. This includes the BeautifulSoup library, which can be used (and imported) when defining data extraction classes (Spiders).<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\" id=\"h-example-of-data-extraction-with-scrapy\">Example of data extraction with Scrapy<\/h2>\nTo give you a concrete idea of what Scrapy can do, we&#8217;re going to extract some data about Liora from the <a href=\"https:\/\/fr.trustpilot.com\/review\/liora.io website.\" data-wplink-url-error=\"true\">https:\/\/fr.trustpilot.com\/review\/liora.io website.<\/a>\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image2-3.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nThe idea is to gather in a single CSV file all the comments and ratings of learners available on trustpilot. For the sake of simplicity, we will limit ourselves to reviews given in French.\n\nRecommended methodology for Web Scraping with Scrapy :\n<ul>\n \t<li>Analyse and locate the information to be extracted on the web page<\/li>\n \t<li>Prototype in the Scrapy Shell the various commands for extracting each of the elements identified in the previous step<\/li>\n \t<li>Create a Scrapy project and create the Spider (to define how to extract information from all the pages)<\/li>\n \t<li>Test the Spider on one page<\/li>\n \t<li>Apply the spider to all the pages to retrieve all the information.<\/li>\n<\/ul>\n<h3 class=\"wp-block-heading\" id=\"h-step-1-analysing-and-locating-the-information-to-be-extracted\">Step 1: Analysing and locating the information to be extracted<\/h3>\nThe aim of this fairly manual stage is simply to locate the useful information and identify the associated HTML tags. If we focus on a review (below) :\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image12.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nWe&#8217;ll look at the information below:\n<ul>\n \t<li>The comment<\/li>\n \t<li>The date of the comment<\/li>\n \t<li>Training date<\/li>\n \t<li>The title of the comment<\/li>\n \t<li>The note<\/li>\n<\/ul>\nTo access the HTML code of the page, simply (in the Edge or Firefox browser) right-click on the page and click on the inspect option. The different tags associated with the different elements of the page will appear on the right.\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image6-2.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nTo access the tag associated with a specific element directly, simply select the element and repeat the previous operation (right-click + inspect).\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image8.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nFor example, to obtain the tag associated with the title and after manipulation, we can read the main tag associated with the title of the comment opposite: .typography_heading-s__f7029. This is a CSS tag (also known as a CSS selector) from which only the text can be extracted.\n\nThis operation will be repeated for all the elements you wish to extract, so that the corresponding CSS tags are associated with each of them.\n\n<a href=\"\/en\/courses\/data-ai\/\">\nDiscover our courses\n<\/a>\n<h3 class=\"wp-block-heading\" id=\"h-step-2-experimenting-with-the-scrapy-shell\">Step 2: Experimenting with the Scrapy Shell<\/h3>\nOnce the tags are clearly identified, you can enter the Scrapy command prompt to define the extraction commands completely.\n\nTo enter the Scrapy Shell, you will use the following command (after activating the virtual environment):\n<pre style=\"padding-left: 40px;\"># Open the Scrapy shell on the website trustpilot\nscrapy shell \"https:\/\/fr.trustpilot.com\/review\/liora.io\"<\/pre>\nLa commande ci-dessus permet de :\n\n1. R\u00e9cup\u00e9rer l&#8217;ensemble des \u00e9l\u00e9ments de la page sp\u00e9cifi\u00e9e \u00e0 l&#8217;aide de l&#8217;API de Scrapy. Ces \u00e9l\u00e9ments seront stock\u00e9s dans une variable &#8220;response&#8221;.\n2. Ouvrir le Scrapy Shell pour interagir avec la page web et tester des commandes d&#8217;extraction.\n\nUsing the CSS selectors identified in the previous step, we can use the &#8220;response&#8221; variable to extract the precise information we are looking for.\n\nTo extract the information in a simple and iterative way (and to ensure that the information associated with each comment is simply retrieved), the first element to be extracted is the list of all the information block selectors on a page.\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image5-1.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nThe CSS selector associated with a block is as follows: .styles_reviewContentwrapper__zH_9M&#8217;.\n\n<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image4-3.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\nYou can then run the following command in the Scrapy shell to extract the list of all the blocks.\n\n# Extraction of all blocks\nresponse.css(&#8216;.styles_reviewContentwrapper__zH_9M&#8217;)\n\nThe output result is as follows:\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image11.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nThis is a list of selectors enabling the data contained in each block to be retrieved using each of the elements in the list. We can then extract everything we need with the following lines of code (each line must be executed separately in the Scrapy shell):\n\n# recovery of all information groups\nselectors = response.css(&#8216;.styles_reviewContentwrapper__zH_9M&#8217;)\n\n# extract the note on an element\nselectors[0].css(&#8216;img::attr(alt)&#8217;).extract()\n\n# extracting the title from an element\nselectors[0].css(&#8216;.typography_heading-s__f7029::text&#8217;).extract()\n\n# comment and date on an element\nelt = selectors[0].css(&#8216;.typography_color-black__5LYEn::text&#8217;).extract()\nexp_date = elt[-1]\ncomment = &#8221;.join([word for word in elt[:-1]])\n\n# extract the date of the comment (one day later than what is displayed)\nselectors[0].css(&#8216;div.typography_body-m__xgxZ_.typography_appearance-subtle__8_H2l.styles_datesWrapper__RCEKH &gt; time::text&#8217;).extract()\n<h3 class=\"wp-block-heading\" id=\"h-step-3-creating-a-scrapy-project\">Step 3: Creating a Scrapy project<\/h3>\n<a href=\"\/en\/courses\/data-ai\/\">\nLearn Scrapy\n<\/a>\n\nOnce the prototyping of the<strong> Scrapy code lines<\/strong> is completed, you can easily create a Spider class that will be the consolidation of all the code lines above within a single Python file.\n\nScrapy provides native functions for initializing a Scrapy project (and thus initializing the Spider class files). To create a Scrapy project, simply run the following command in the Shell:\n\n# Creating a project (example: trustdst project)\nscrapy startproject trustdst\n\nWhen you have finished, you will see :\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image7-1.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nThe command we have just run has created a folder with initialised Python files. The architecture can be seen below:\n\nThe output result is as follows:\n\n<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image10.png\" title=\"\" alt=\"\" loading=\"lazy\">\n<h3 class=\"wp-block-heading\" id=\"h-step-4-creating-the-spider\">Step 4 : Creating the Spider<\/h3>\nWe&#8217;re going to use the architecture created in the previous step to create the Python class file that will extract all the information from a page at once. Here again, Scrapy will allow us to initialise the file in question with the command below:\n\n# creation of the Spider class for scraping\nscrapy genspider trustpilotspider en.trustpilot.com\/review\/liora.io\n\nThis command will create the trustpilotspider.py file that we will modify and use for data scraping.\n<figure>\n\t\t\t\t\t\t\t\t\t\t<img decoding=\"async\" src=\"https:\/\/liora.io\/app\/uploads\/2023\/04\/image1-2.png\" title=\"\" alt=\"\" loading=\"lazy\">\n\n<figcaption><\/figcaption><\/figure>\nYou can then modify the file as follows:\n\n# Import scrapy\nimport scrapy\n\n# definition of the sipder\nTrustpilotspiderSpider(scrapy.Spider) class :\n&#8220;&#8221;&#8221;\nname : a class attribute which gives a name to the spider.\nWe&#8217;ll use it when we run our scrapy crawl spider later.\n&lt;spider_name&gt;.\n\nallowed_domains: a class attribute that tells Scrapy that it\nshould only ever fetch pages from the chocolate.co.uk domain.\nThis prevents the spider from fetching pages in the chocolate domain.\n\nstar_urls : a class attribute which tells Scrapy the first url\nthat it should explore. We&#8217;ll modify it in a moment.\n\nparse: the parse function is called after a response has been\nreceived from the target website.\n&#8220;&#8221;&#8221;\nname = &#8220;trustpilotspider\nallowed_domains = [&#8220;en.trustpilot.com&#8221;]\nstart_urls = [&#8220;https:\/\/fr.trustpilot.com\/review\/liora.io&#8221;] # Shift URL\n\ndef parse(self, response) :\n&#8220;&#8221;&#8221;\nModule for extracting information\n&#8220;&#8221;&#8221;\n\n# Loop over all the information blocks\nselectors = response.css(&#8216;.styles_reviewContentwrapper__zH_9M&#8217;)\n\n# Iterative extraction of information\nfor selector in selectors :\n# Information to be returned\nyield{\n&#8216;notes&#8217;: selector.css(&#8216;img::attr(alt)&#8217;).extract(),\n&#8216;title&#8217;: selector.css(&#8216;.typography_heading-s__f7029::text&#8217;).get(), # .extract()[0]\nexp_date&#8217; : selector.css(&#8216;.typography_color-black__5LYEn::text&#8217;).extract()[-1],\n&#8216;comments&#8217; : &#8221;.join([text for text in selector.css(&#8216;.typography_color-black__5LYEn::text&#8217;).extract()[:-1]),\ncomment_date&#8217; : selector.css(&#8216;div.typography_body-m__xgxZ_.typography_appearance-subtle__8_H2l.styles_datesWrapper__RCEKH &gt; time::text&#8217;).get()\n}\n\nTo run the file, simply execute the command below:\n\nscrapy crawl trustpilotspider ou scrapy crawl trustpilotspider -O myonepagescrapeddata.json\n\n(if you want to save the result in a JSON file)\n<h2 class=\"wp-block-heading\" id=\"h-conclusion\">Conclusion<\/h2>\n<a href=\"https:\/\/liora.io\/en\/data-mining-everything-you-need-to-know-about-data-mining\">Data is one of the most valuable assets that a company can possess.<\/a> It is at the heart of Data Science and Data Analysis. Companies actively collecting data can gain a competitive advantage over those that do not. With enough data, organizations can better determine the root causes of problems and make informed decisions.\n\nThere are scenarios where an organization may not have enough data to draw the necessary insights. This is often the case for startups that typically begin with little to no data. One solution in such cases is to employ a data acquisition technique like <strong>Web Scraping.<\/strong>\n\nScrapy is an <a href=\"https:\/\/liora.io\/en\/framework-what-is-it\">open-source framework that efficiently extracts data from the web<\/a> and has a large community of users. It is well-suited for large-scale Web Scraping projects as it provides a clear structure and tools for processing the retrieved information.\n\nThis article was intended to introduce Scrapy with some of its basic features used by <a href=\"https:\/\/liora.io\/en\/data-engineer-role-skills-salary\">Data Engineers,<\/a> Data Scientists, or Data Analysts for information extraction.\n\nIf you want to go further with Scrapy feel free to consult the official documentation: <a href=\"https:\/\/docs.scrapy.org\/en\/latest\/intro\/tutorial.html\">Scrapy Tutorial &#8211; Scrapy 2.8.0 documentation<\/a>. To learn more about information acquisition technologies and methodologies, <strong>feel free to check out Liora&#8217;s training courses<\/strong>.\n<a href=\"\/en\/courses\/data-ai\/\">\nDiscover our courses\n<\/a>\n<script type=\"application\/ld+json\"><br \/>\n{<br \/>\n  \"@context\": \"https:\/\/schema.org\",<br \/>\n  \"@type\": \"FAQPage\",<br \/>\n  \"mainEntity\": [{<br \/>\n    \"@type\": \"Question\",<br \/>\n    \"name\": \"Scrapy Architecture\",<br \/>\n    \"acceptedAnswer\": {<br \/>\n      \"@type\": \"Answer\",<br \/>\n      \"text\": \"When a project is created, various files are used to interact with the main Scrapy components. The architecture of Scrapy as described in the official documentation can be seen below\".<br \/>\n    }<br \/>\n  },{<br \/>\n    \"@type\": \"Question\",<br \/>\n    \"name\": \"Scrapy installation\",<br \/>\n    \"acceptedAnswer\": {<br \/>\n      \"@type\": \"Answer\",<br \/>\n      \"text\": \"Scrapy is quite simple to install. Just run the command below in an Ubuntu terminal. You can easily find the equivalent commands for other operating systems\".<br \/>\n    }<br \/>\n  },{<br \/>\n    \"@type\": \"Question\",<br \/>\n    \"name\": \"Web Scraping Vs Web Crawling\",<br \/>\n    \"acceptedAnswer\": {<br \/>\n      \"@type\": \"Answer\",<br \/>\n      \"text\": \"Web scraping uses robots to programmatically analyse a web page in order to extract content. With Web Scraping it is therefore necessary to search for data in a precise manner. The term crawling is used as an analogy with the way a spider crawls (which is also why web crawlers are often called spiders). Web crawling tools will also use robots (bots called crawlers) to systematically crawl the World Wide Web, usually with the aim of indexing it.\"<br \/>\n    }<br \/>\n  }]<br \/>\n}<br \/>\n<\/script>","protected":false},"excerpt":{"rendered":"<p><strong>During internet browsing, many websites do not allow direct saving of data for personal use. The simplest solution in this case is to manually copy and paste the data, which can quickly become tedious and time-consuming. That&#8217;s why web scraping techniques are often used to extract data from websites.<\/strong><\/p>\n","protected":false},"author":78,"featured_media":170579,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2434],"class_list":["post-170577","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud-dev"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/170577","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/78"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=170577"}],"version-history":[{"count":3,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/170577\/revisions"}],"predecessor-version":[{"id":208866,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/170577\/revisions\/208866"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/170579"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=170577"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=170577"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}