Pyspark Cheat Sheet

Pyspark Cheat Sheet

Pyspark Query Dataframe
Databricks Pyspark Cheat Sheet
Pyspark Cheat Sheet Pdf
Spark Create Dataframe
Let's configure pyspark in PyCharm in Ubuntu.
First, download spark from the source. http://spark.apache.org/downloads.html
This PySpark SQL cheat sheet is your handy companion to Apache Spark DataFrames in Python and includes code samples. You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules. Python For Data Science Cheat Sheet PySpark - SQL Basics Learn Python for data science Interactively at www.DataCamp.com DataCamp Learn Python for Data Science Interactively Initializing SparkSession Spark SQL is Apache Spark's module for working with structured data.
There is a simple two step process for the configuration.
First, setup spark home, SPARK_HOME, in the ‘etc/environment'
SPARK_HOME=location-to-downloaded-spark-folder
Here, in my case, the location of downloaded spark is /home/pujan/Softwares/spark-2.0.0-bin-hadoop2.7
And, do remember to restart your system to reload the environment variables.
Second, in the pycharm IDE, in the project in which you want to configure pyspark, open Settings, File -> Settings.
Then, in the project section, click on 'Project Structure'.
We need to add two files, one py4j-0.10.1-src.zip, another pyspark.zip, in the ‘Content Root' of ‘Project Structure'
In my case, the project's name is Katyayani, so, in the menu, Settings -> Project: Katyayani -> Project Structure . On the right side, click on ‘Add Content Root' and add ‘py4j-0.10.1-src.zip' [/home/pujan/Softwares/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip] and ‘pyspark.zip'[/home/pujan/Softwares/spark-2.0.0-bin-hadoop2.7/python/lib/pyspark.zip]
After this configuration, lets test our configuration that we can access spark from pyspark. For this, write a python script in pycharm. The following screenshot shows a very simple python script and the log message of successful interaction with spark.
And, this concludes our successful configuration of pyspark in pycharm.
This page contains a bunch of spark pipeline transformation methods, whichwe can use for different problems. Use this as a quick cheat on how we cando particular operation on spark dataframe or pyspark.
This code snippets are tested on spark-2.4.x version, mostly work onspark-2.3.x also, but not sure about older versions.
Read the partitioned json files from diskapplicable to all types of files supported
Save partitioned files into a single file.Here we are merging all the partitions into one file and dumping it intothe disk, this happens at the driver node, so be careful with sie ofdata set that you are dealing with. Otherwise, the driver node may go out of memory.
Use coalesce method to adjust the partition size of RDD based on our needs.
Filter rows which meet particular criteria
Map with case classK2611 mosfet pdf. Use case class if you want to map on multiple columns with a complexdata structure.
OR using Row class.
Use selectExpr to access inner attributesProvide easily access the nested data structures like json and filter themusing any existing udfs, or use your udf to get more flexibility here.
How to access RDD methods from pyspark sideUsing standard RDD operation via pyspark API isn't straight forward, to get thatwe need to invoke the .rdd to convert the DataFrame to support these features.
For example, here we are converting a sparse vector to dense and summing it in column-wise.
Pyspark Map on multiple columns
Filtering a DataFrame column of type Seq[String]
Pyspark Query DataframeFilter a column with custom regex and udf
Sum a column elements
Remove Unicode characters from tokensEnter sandman easy guitar. Sometimes we only need to work with the ascii text, so it's better to clean outother chars.
Connecting to jdbc with partition by integer columnWhen using the spark to read data from the SQL database and then do theother pipeline processing on it, it's recommended to partition the dataaccording to the natural segments in the data, or at least on an integercolumn, so that spark can fire multiple sql queries to read data from SQLserver and operate on it separately, the results are going to the sparkpartition.
Bellow commands are in pyspark, but the APIs are the same for the scala version also.
Parse nested json dataThis will be very helpful when working with pyspark and want to pass verynested json data between JVM and Python processes. Lately spark community relay onapache arrow project to avoid multiple serialization/deserialization costs whensending data from java memory to python memory or vice versa.
So to process the inner objects you can make use of this getItem methodto filter out required parts of the object and pass it over to python memory viaarrow. In the future arrow might support arbitrarily nested data, but right now it won'tsupport complex nested formats. The general recommended option is to go without nesting.
'string ⇒ array' conversionType annotation .as[String] avoid implicit conversion assumed.
A crazy string collection and groupbyThis is a stream of operation on a column of type Array[String] and collectthe tokens and count the n-gram distribution over all the tokens.
How to access AWS s3 on spark-shell or pysparkMost of the time we might require a cloud storage provider like s3 / gs etc, toread and write the data for processing, very few keeps in-house hdfs to handle the datathemself, but for majority, I think cloud storage easy to start with and don't needto bother about the size limitations.
Supply the aws credentials via environment variable
Supply the credentials via default aws ~/.aws/config fileRecent versions of awscli expect its configurations are kept under ~/.aws/credentials file,but old versions looks at ~/.aws/config path, spark 2.4.x version now looks at the ~/.aws/config locationsince spark 2.4.x comes with default hadoop jars of version 2.7.x.
Set spark scratch space or tmp directory correctlyThis might require when working with a huge dataset and your machine can't hold themall in memory for given pipeline steps, those cases the data will be spilled overto disk, and saved in tmp directory.
Set bellow properties to ensure, you have enough space in tmp location.
Pyspark doesn't support all the data types.Databricks Pyspark Cheat SheetWhen using the arrow to transport data between jvm to python memory, the arrow may throwbellow error if the types aren't compatible to existing converters. The fixes may becomein the future on the arrow's project. I'm keeping this here to know that how the pyspark getsdata from jvm and what are those things can go wrong in that process.
Work with spark standalone cluster manager

Start the spark clustering in standalone modePyspark Cheat Sheet PdfOnce you have downloaded the same version of the spark binary across the machinesyou can start the spark master and slave processes to form the standalone sparkcluster. Or you could run both these services on the same machine also.
Standalone mode,
Worker can have multiple executors.
Worker is like a node manager in yarn.
We can set worker max core and memory usage settings.
When defining the spark application via spark-shell or so, define the executor memory and cores.
When submitting the job to get 10 executor with 1 cpu and 2gb ram each,
This page will be updated as and when I see some reusable snippet of code for spark operations
Changelog
ReferencesSpark Create DataframeGo TopPlease enable JavaScript to view the comments powered by Disqus.