Apache Spark assists businesses in resolving data processing issues due to its quick
data-processing capabilities. Any data processing program's performance is a crucial factor to
consider while it's being developed. The following are the most important performance
optimization strategies in Apache Spark.
Data is sent across distinct nodes in a spark cluster in the case of spark. The graph below
depicts the shuffling process in Spark. Data is shuffled between the left-hand processing nodes
and the right-hand processing nodes.
Shuffling process in Spark
The data sent during the shuffling process is transferred over the network, which is why
shuffling is costly. Large data transfers over the network are always going to be costly in
terms of system performance.
Shuffling of data across
different nodes on a spark cluster is expensive in terms of
performance.
Depending on the fields on
which the aggregate operations are carried out on a cluster, the
data shuffling can be reduced.
Partitioning a spark dataframe appropriately can help you go a long way in optimizing the
performance. For instance, consider a group by operation in a spark dataframe so as to get the
sum of a field for each group. If there are further operations on the same group where you need
to find out various other parameters for this group, then partitioning the entire dataframe
based on this key will be a great performance optimization step. This ensures that the data that
belongs to the same group will always go to the same node. This reduces the need for shuffling
as data is already segregated as we want.
The graph below depicts how partitioning on a spark dataframe works.
So, partitioning is a process by which data is segregated into small chunks on the basis of a
certain field. As a result, the data belonging to the same group segregated this way always goes
to the same location or partition. Since the data goes to the same location for the records with
the same field value, the shuffling is reduced. For example, if the employees are segregated
based on the department, each employee record belonging to the same department goes to the same
partition. Any operations on the employee records grouped by department will thus avoid any data
shuffling between nodes as the data is already segregated and is ready for the operation.
Repartitioning data on a
spark dataframe involves shuffling. Partitioning is an operation
where the data with similar key values needs to be grouped together which requires transfer of
data across different nodes.
Join operation combines 2
or more data sets on the basis of a key. This often leads in values
getting transferred across the nodes so as to aggregate data across different nodes. So
effectively a join operation leads to shuffling.
Effectively any operations
based by keys can result in a shuffling.
Contrary to the above,
mapping operation doesn’t require data to be transferred across the
nodes. Similar are the cases with operations like the filter and the unions. These operations
specifically deal with the data transformations in each of the individual nodes and hence they
don’t add to the shuffling.
It’s always handy to do the
filtering operations on a dataframe beforehand so that the
operations will be on this filtered smaller dataset.
The execution of operations in spark is in the lazy mode. For example, the code for a filter
operation in spark dataframe doesn’t get executed at the same point where the code is written.
Instead, it maintains some information about the order in which these steps will be executed
which are called the DAGs. Effectively if the filtered result is going to be used by several
processes, the calculation of the final filter dataframe happens in each of these cases.
To explain this more clearly, consider the below code. This is a looped operation in which for
each execution of the loop a computation is done on a dataframe called the filtered_df. Assume
that the filtere_df is a result of the filtering operation from another dataframe called
inbound_df. Every time the below operation is called on the filtered_df, the filtering operation
is performed on the inbound_df as the spark does the lazy evaluation.
for obj in list_objects:
compute_df = compute_dataframe(filtered_df,obj)
percentage_df = calculate_percentage(compute_df)
export_as_csv(percentage_df)
To avoid the recomputation of the data frame as shown above, every time an operation is called,
a technique called caching is employed. The idea is to store the dataframe on which the
computations are called in a cache. So, instead of recomputing the data frame on the invocation
of an operation, the dataframe that is previously stored in the cache is retrieved and
subsequent operations are invoked.
As it can be seen from the examples above, Apache Spark is an excellent tool for processing
massive amounts of data really quickly. However, when it comes to use cases involving
large-scale data processing, it's important to remember its best practices. The things mentioned
above are some of the most crucial suggestions for maximizing Apache Spark's performance.