{"id":165614,"date":"2023-01-05T14:38:37","date_gmt":"2023-01-05T13:38:37","guid":{"rendered":"https:\/\/liora.io\/en\/?p=165614"},"modified":"2026-02-06T09:08:12","modified_gmt":"2026-02-06T08:08:12","slug":"pyspark-the-python-library","status":"publish","type":"post","link":"https:\/\/liora.io\/en\/pyspark-the-python-library","title":{"rendered":"PySpark: Everything about the Python library"},"content":{"rendered":"<style><br \/>\n.elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}<\/style>\n<h3>When we talk about database processing in python, we immediately think of the pandas library. However, when dealing with very large databases, calculations become too slow. Fortunately, there is another python library, very similar to pandas, which allows for the processing of very large amounts of data: PySpark.\nIn this article, we will present central elements of Spark, starting with RDDs, the most basic structure of Spark. We will then study the DataFrame type, a richer structure than RDDs that is optimized for machine learning.<\/h3>\n<h3>What is Apache Spark ?<\/h3>\n<b>Apache Spark<\/b> is an open-source framework developed by UC Berkeley&#8217;s AMPLab that allows for the processing of large databases using <b>distributed computing<\/b>, a technique that uses multiple units of computation distributed in clusters for a single project in order to divide the execution time of a query.\n\nSpark was developed in <b>Scala<\/b> and is at its best in its native language. However, the PySpark library allows it to be used with the <a href=\"https:\/\/liora.io\/en\/python-the-most-popular-programming-language\">Python language<\/a> while maintaining similar performance to Scala implementations.\n\nPyspark is therefore a <b>good alternative to the pandas library<\/b> when looking to process data sets that are too large and lead to time-consuming calculations.\n<h3>How structured PySpark is ?<\/h3>\nFirst of all, it is important to understand the basis of how Spark works.\n\nWhen you interact with Spark through PySpark, you <b>send instructions to the Driver<\/b>. The Driver coordinates all operations. The Driver can be communicated by a SparkContext object. This object <b>coordinates the different calculations<\/b> on the different clusters.\n\nThe big advantage of Spark is that the code is completely independent of the SparkContext. So you can <b>develop your code<\/b> locally on any machine.\n<h3>What\u2019s the definition of Resilient Distributed Data (RDD) ?<\/h3>\nAn RDD is the Spark <b>representation of a data table<\/b>. It is a collection of elements that can be used to contain tuples, dictionaries, lists\u2026\n\nThe strength of an RDD<b> lies in its ability<\/b> to evaluate the code lazily: the start of the calculations is postponed until absolutely necessary.\n\nFor example, when importing a file, <b>only a pointer to it is created<\/b>. It is really only at the last moment, when you are looking to display or use a result, that the calculation is done.\n\nTo go further in the handling of an RDD, we can use the documentation available here\n\nAn <b>RDD reads line by line<\/b>, which makes it effective for <b>processing text files<\/b> (counting the number of occurrences of each word in the miserable integral for example), but it is an <b>unsuitable structure for calculations per column<\/b>.\n\nTo do Machine Learning, we need to introduce a new structure: DataFrames.\n<h3>DataFrame pyspark<\/h3>\nThe pyspark DataFrame is the <b>most optimized structure<\/b> in Machine Learning. It uses the underlying bases of an RDD but has been structured in columns as well as rows in an SQL structure. Its shape is<b> inspired by the DataFrame<\/b> of the panda module.\n\nThanks to the DataFrame structure, we can make efficient calculations through a familiar language (similar to pandas), avoiding the cost of learning a new functional language: Scala.\n\n<b>Spark SQL<\/b> is a Spark module that allows you to <b>work on structured data<\/b>. It is therefore within this module that the Spark DataFrame was developed.\n\nSpark SQL has a fairly rich one-page documentation, both in examples and explanations. Contrary to what you can find on the internet, this <b>documentation is the only document perpetually updated <\/b>with the latest version of Spark.\n\nThis article is just an introduction to the <b>main concepts<\/b> of Pyspark. Our trainings contains an entire module on learning this essential tool for handling big data. If you want to master this tool, let yourself be tempted by one of our <b>data science trainings.<\/b>\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex is-content-justification-center\"><div class=\"wp-block-button \"><a class=\"wp-block-button__link wp-element-button \" href=\"\/en\/courses\/data-ai\/\">Discover our trainings<\/a><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>When we talk about database processing in python, we immediately think of the pandas library. However, when dealing with very large databases, calculations become too slow. Fortunately, there is another python library, very similar to pandas, which allows for the processing of very large amounts of data: PySpark. In this article, we will present central [&hellip;]<\/p>\n","protected":false},"author":79,"featured_media":44186,"comment_status":"open","ping_status":"open","sticky":false,"template":"elementor_theme","format":"standard","meta":{"_acf_changed":false,"editor_notices":[],"footnotes":""},"categories":[2433],"class_list":["post-165614","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-ai"],"acf":[],"_links":{"self":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/165614","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/users\/79"}],"replies":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/comments?post=165614"}],"version-history":[{"count":2,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/165614\/revisions"}],"predecessor-version":[{"id":206467,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/posts\/165614\/revisions\/206467"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media\/44186"}],"wp:attachment":[{"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/media?parent=165614"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/liora.io\/en\/wp-json\/wp\/v2\/categories?post=165614"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}