The transition to Spark from python can be quite challenging at first as we are introduced to a whole new set of terminology such as clusters, pool, mounts, spark API, and so on. In this post, we will look at how Microsoft Azure Databricks may be used.
To process data, we are in need of a new cluster. We have two kinds of clusters available in Databricks. Interactive clusters and Job clusters. Interactive clusters are used if the emphasis is on data analytics whereas job clusters can be used to run fast automated workflows. We use the former when we are in the DEV phase and the latter in the Production phase. If our cluster is busy, it will take a long time to run our job, we can mitigate this issue by creating a pool. Clusters are attached to the pool, so, when we need a new cluster, we don’t need to get them from a cloud provider rather the same can be done with a pool of VM instances.
Now we can use the notebooks to start processing the data. It is a good practice to modularize the notebooks by effectively using markdown for headers – data load and configurations, data filters, joins, business logic. We can also log our data to audit by creating a logger for every run. Usage of a separate notebook to load all libraries and helper functions can be invoked by ‘%run notebook path’ command in the driver notebook. Also, to use shell commands, we can use ‘%sh’ command.
We need to mount our data to a mount point to access the data. After mounting the data, we can check if we have loaded and have access to our data by using the ‘dbutils’ command. We are ready to load our data into Spark dataframe. Giving meaningful names to our dataset can often save us time during debugging.
Now we can perform some data cleansing using regex commands. Also, date time functions, cast datatypes can serve handy for data analysis and for doing deep data discovery. Knowing how to transform our data to json can be quite useful as it can be used for looking up values similar to a dictionary in python.
We are finally ready to do some data engineering since we have lots of API. But in order to leverage the powerful APIs of spark, we need to first import them. Once imported, here are some useful functions to add to your data engineering toolset:
withColumn() – Creation of Derived column. Suppose we would like to do the transformation on an existing column, we can use withColumn. This step can be very handy in data cleansing.
regexp_replace() - Sometimes our data might have some extra characters either prepended or appended to them and we need to remove them. This can be done by using ‘regex_replace’ function and passing the correct regular expression as parameters to the command.
when() & otherwise () - This can be useful when we are interested in case logic. When we deal with fields having only numerical values, we need to ensure they all belong to the same family of ‘units.’ In such scenarios, the ‘when-otherwise’ function can be useful.
filter() and isin() – If we must filter a specified column by multiple conditions, we can just use the filter function on a list of values using isin(). Note, when we have a large filter set it is best to achieve the same using joins for better performance.
to_date() – Often when we wish to perform delta or incremental load, we just need the last few periods of data and if the data is stored in string format, we need to first convert it to standard datetime so that we can filter data using date.
rank() & partitionBy() - Sometimes updates may be generated for an existing record in the datalake but both the previous value and new value are retained. We are interested in fetching the most updated transaction depending on the logic for fetching the right transactional details. So, a combo of the window functions - rank over and partition would get the job done.
After all the processing is completed, it is a best practice to do reconciliation of data with the source using different metrics because as a techie we might often miss some business logic especially if we are dealing with calculated fields.
Once we have our golden data (verified with source systems) we can start creating pipelines to load our data into the server database.
Click here for more understanding of Databricks’ fundamental concepts.