Performance Optimization for Apache Spark

image
Talk to our Expert
image
Apache Spark assists businesses in resolving data processing issues due to its quick data-processing capabilities. Any data processing program's performance is a crucial factor to consider while it's being developed. The following are the most important performance optimization strategies in Apache Spark.
Shuffling - Key Performance Tuning Method
Data is sent across distinct nodes in a spark cluster in the case of spark. The graph below depicts the shuffling process in Spark. Data is shuffled between the left-hand processing nodes and the right-hand processing nodes. Shuffling process in Spark img
Cost associated with Shuffling
The data sent during the shuffling process is transferred over the network, which is why shuffling is costly. Large data transfers over the network are always going to be costly in terms of system performance.
Key points
Shuffling of data across different nodes on a spark cluster is expensive in terms of performance.
Depending on the fields on which the aggregate operations are carried out on a cluster, the data shuffling can be reduced.
Partitioning for Optimized Performance
Partitioning a spark dataframe appropriately can help you go a long way in optimizing the performance. For instance, consider a group by operation in a spark dataframe so as to get the sum of a field for each group. If there are further operations on the same group where you need to find out various other parameters for this group, then partitioning the entire dataframe based on this key will be a great performance optimization step. This ensures that the data that belongs to the same group will always go to the same node. This reduces the need for shuffling as data is already segregated as we want. The graph below depicts how partitioning on a spark dataframe works. img So, partitioning is a process by which data is segregated into small chunks on the basis of a certain field. As a result, the data belonging to the same group segregated this way always goes to the same location or partition. Since the data goes to the same location for the records with the same field value, the shuffling is reduced. For example, if the employees are segregated based on the department, each employee record belonging to the same department goes to the same partition. Any operations on the employee records grouped by department will thus avoid any data shuffling between nodes as the data is already segregated and is ready for the operation.
Key Points related to Shuffling and Partitioning
Repartitioning data on a spark dataframe involves shuffling. Partitioning is an operation where the data with similar key values needs to be grouped together which requires transfer of data across different nodes.
Join operation combines 2 or more data sets on the basis of a key. This often leads in values getting transferred across the nodes so as to aggregate data across different nodes. So effectively a join operation leads to shuffling.
Effectively any operations based by keys can result in a shuffling.
Contrary to the above, mapping operation doesn’t require data to be transferred across the nodes. Similar are the cases with operations like the filter and the unions. These operations specifically deal with the data transformations in each of the individual nodes and hence they don’t add to the shuffling.
It’s always handy to do the filtering operations on a dataframe beforehand so that the operations will be on this filtered smaller dataset.
Lazy Execution in Spark
The execution of operations in spark is in the lazy mode. For example, the code for a filter operation in spark dataframe doesn’t get executed at the same point where the code is written. Instead, it maintains some information about the order in which these steps will be executed which are called the DAGs. Effectively if the filtered result is going to be used by several processes, the calculation of the final filter dataframe happens in each of these cases. To explain this more clearly, consider the below code. This is a looped operation in which for each execution of the loop a computation is done on a dataframe called the filtered_df. Assume that the filtere_df is a result of the filtering operation from another dataframe called inbound_df. Every time the below operation is called on the filtered_df, the filtering operation is performed on the inbound_df as the spark does the lazy evaluation. for obj in list_objects:
compute_df = compute_dataframe(filtered_df,obj) percentage_df = calculate_percentage(compute_df) export_as_csv(percentage_df)
Caching in Spark
To avoid the recomputation of the data frame as shown above, every time an operation is called, a technique called caching is employed. The idea is to store the dataframe on which the computations are called in a cache. So, instead of recomputing the data frame on the invocation of an operation, the dataframe that is previously stored in the cache is retrieved and subsequent operations are invoked.
Conclusion
As it can be seen from the examples above, Apache Spark is an excellent tool for processing massive amounts of data really quickly. However, when it comes to use cases involving large-scale data processing, it's important to remember its best practices. The things mentioned above are some of the most crucial suggestions for maximizing Apache Spark's performance.
You may also like :
See all