When we talk about database processing in python, we immediately think of the pandas library. However, when dealing with very large databases, calculations become too slow. Fortunately, there is another python library, very similar to pandas, which allows for the processing of very large amounts of data: PySpark.
In this article, we will present central elements of Spark, starting with RDDs, the most basic structure of Spark. We will then study the DataFrame type, a richer structure than RDDs that is optimized for machine learning.
What is Apache Spark ?
Apache Spark is an open-source framework developed by UC Berkeley’s AMPLab that allows for the processing of large databases using
distributed computing, a technique that uses multiple units of computation distributed in clusters for a single project in order to divide the execution time of a query.
Spark was developed in
Scala and is at its best in its native language. However, the PySpark library allows it to be used with the
Python language while maintaining similar performance to Scala implementations.
Pyspark is therefore a
good alternative to the pandas library when looking to process data sets that are too large and lead to time-consuming calculations.
How structured PySpark is ?
First of all, it is important to understand the basis of how Spark works.
When you interact with Spark through PySpark, you
send instructions to the Driver. The Driver coordinates all operations. The Driver can be communicated by a SparkContext object. This object
coordinates the different calculations on the different clusters.
The big advantage of Spark is that the code is completely independent of the SparkContext. So you can
develop your code locally on any machine.
What’s the definition of Resilient Distributed Data (RDD) ?
An RDD is the Spark
representation of a data table. It is a collection of elements that can be used to contain tuples, dictionaries, lists…
The strength of an RDD
lies in its ability to evaluate the code lazily: the start of the calculations is postponed until absolutely necessary.
For example, when importing a file,
only a pointer to it is created. It is really only at the last moment, when you are looking to display or use a result, that the calculation is done.
To go further in the handling of an RDD, we can use the documentation available here
An
RDD reads line by line, which makes it effective for
processing text files (counting the number of occurrences of each word in the miserable integral for example), but it is an
unsuitable structure for calculations per column.
To do Machine Learning, we need to introduce a new structure: DataFrames.
DataFrame pyspark
The pyspark DataFrame is the
most optimized structure in Machine Learning. It uses the underlying bases of an RDD but has been structured in columns as well as rows in an SQL structure. Its shape is
inspired by the DataFrame of the panda module.
Thanks to the DataFrame structure, we can make efficient calculations through a familiar language (similar to pandas), avoiding the cost of learning a new functional language: Scala.
Spark SQL is a Spark module that allows you to
work on structured data. It is therefore within this module that the Spark DataFrame was developed.
Spark SQL has a fairly rich one-page documentation, both in examples and explanations. Contrary to what you can find on the internet, this
documentation is the only document perpetually updated with the latest version of Spark.
This article is just an introduction to the
main concepts of Pyspark. Our trainings contains an entire module on learning this essential tool for handling big data. If you want to master this tool, let yourself be tempted by one of our
data science trainings.