Talk to our Expert
image

Architectural Patterns for Analytics Journey Leveraging Azure Cloud Tech-Stack

image
Importance of Data Infrastructure in an Organization In today’s world any organization which is not utilizing its data for gaining insights will fall behind its competitors. Data has become an important part of the working of an organization be it for monitoring, analyzing, or predicting requirements. Every organization has a certain type of data requirement. Some require historical data; some require periodic data and others require live data. To ensure your current and future data requirements are met it is advisable to create proper infrastructure for storing the data available in the organization. The stages which an organization goes through can be broadly classified as:
Low-Tier (Very low data volumes/PoC).
Mid-Tier (Small to medium data volume).
Advanced-Tier (Large data volume with data science requirements).
Cloud Technologies for Corporate Analytics Implementation Before discussing the technologies available for each of the above classifications, let us have a quick overview of cloud technologies. Backend cloud technologies are technologies available in the cloud for storing and processing data in the cloud. Just like a basic computer system, computation in cloud also needs a CPU, a RAM and Storage. In cloud we have this advantage where we can scale-up or scale-down or even shut down the resources as per our needs. Different clouds have different technologies which can be used for processing data. Technologies differ depending upon the cloud used, reporting needs, frequency of reporting, pre-aggregation on the dataset, etc. Though all the clouds have technologies to handle all kinds and volumes of data, we will be looking into the Azure Cloud for our discussion. Analytic architectures available in the Azure Cloud Now that we know what backend technologies are, let us discuss some of the options available when setting up your Business Analytics Architecture in Azure Cloud depending on the volume of data available. 1) Low-Tier: Low-tier is for exceedingly small organizations with data volumes in a few MBs. These technologies can be used for Proof-of-Concept analytics in larger organizations. These technologies are very cost effective, and the turnaround time is very small. a) Microsoft Excel: The easiest option for getting your dashboards is to use Microsoft excel/csv files as an input to Power BI. The files can be stored in a Data Lake or an FTP server and can be refreshed by adding new data to the location. This option is suggested only when the data volume is low, and the computation does not require complicated measures/columns. Once the data volume increases, this architecture cannot sustain the increased load as it cannot be scaled up. This architecture is the easiest to develop as well because it does not require expertise in complicated backend technologies. 2) Mid-Tier: Mid-tier technologies work best when data volumes are from 10 to 15 GBs. They are comparatively inexpensive, and the development costs are moderate. a) Azure Analysis Services/SQL Server Analysis Services (SSAS): SSAS is the traditional service provided by Microsoft for adding computation capacity to Power BI reports. This service has the advantage that all the calculated columns and measures could be moved from Power BI to an SSAS model. The server created to support the backend can store multiple models and can be scaled up or down. This is the “go-to” option for adding computation capacity for Power BI Reports . Traditional SSAS server can be scaled up from one tier to another and it does not automatically scale up/down depending on the requirements. To address this, Microsoft has launched the Azure Analysis Service which gives more control over the server size and hence optimizes cost of operation depending upon the load. Once the model is created, we can connect it with a direct query to Power BI report. Power BI report created using direct query does not store any report on the Power BI server. All the data is computed in the backend server and then displayed in the Power BI report. If we create an import connection in Power BI, then all the advantages of getting the SSAS/Azure AS server would be gone. b) SQL Server: SQL Servers are the most used databases when it comes to storing corporate data. For starting your journey in BI, these can be the steppingstones for creating dashboards directly from the SQL databases. Power BI supports both direct and import query modes when connecting with SQL servers. Moreover, Power BI also gives an option of writing SQL queries and sending them to SQL server. This saves the need for creating views/stored procedures in SQL. This works well because complicated queries can be stored as tables in SQL Server and then used directly for reporting. Also, data can be aggregated for visualization in the backend, reducing the computation required in Power BI. c) Azure Function with Storage Account: All the technologies we have seen above support an elevated level of transformation and aggregation. In case your requirement is to run complicated transformations or machine learning models, then you would need the flexibility of a programming language for your analysis. Azure Function becomes the answer to your needs in this case as it gives a variety of languages to code your requirements. As an Azure Function has no built-in memory, the output of the analysis is stored in a Storage Account (ADLS) available in Azure. The Power BI dashboard can then be created using the processed data from Storage account. We can also use an SQL server in this case if the data volume is large. Azure function supports programming languages like Node.js, python, PowerShell, .NET, and Java. 3) Advanced-Tier: Advanced-Tier technologies are for very large data volumes and AI/ML requirements using large datasets up to 100GBs. These technologies are expensive, highly scalable, and immensely powerful. They can even be used for live streaming data as well. a) Power BI Premium: Power BI Premium is the latest cloud-based technology which can be used for giving more computation power for handling large datasets in Power BI Service. Power BI Premium costs more compared to SSAS server but there is no extra development cost when it comes to moving the model to SSAS server. It uses dedicated Power BI embedded capacity for storage and computing. An embedded capacity can be shared by multiple workspaces. b) SQL Server Hyperscale Tier: In case the General Purpose/Standard or Premium tier is insufficient for your data needs, Microsoft has come up with a hyper-scale tier that is based on the vCore purchasing model. This enables your database to scale up to 100TB. It also gives an option to add more computing capacity depending on the processing requirements. The option for scaling up or down is available and optimizes the cost according to your computing requirements. c) Databricks with Storage Account: Databricks is a managed PaaS (platform as a service) based on Apache Spark. It is available on major cloud platforms. Databricks supports parallel processing . It is very well integrated with Azure Storage Accounts which enables it to process a large dataset very quickly and efficiently. It is user-friendly as it allows users to code in SparkSQL, Scala, PySpark and R. These are the major languages used by data engineers/data analysts all over the world. It also supports machine learning and artificial intelligence libraries which makes the data science part easy to implement and productionize. Databricks uses Jupyter like notebooks which enables multiple users to collaborate. It has Git/DevOps integration, enabling users to move back to a previous version of the notebook. As Databricks does not have any storage in its servers, data is stored in azure storage accounts after processing. Databricks clusters can be switched-on using Azure Data Factory and shutdown automatically if not in use. This makes it a very cost-effective solution. Also, the same cluster can be used by multiple notebooks. Once the data is processed and saved in storage account, it can be accessed by Power BI directly or through an SQL Server. d) Azure Synapse with Storage Account: Azure Synapse is the proprietary Data analytics, Machine Learning and Artificial Intelligence platform developed by Microsoft. It supports serverless SQL DW pool and Apache spark pool for data warehousing, code free data flow for big data transformation, data integration & orchestration to integrate and operationalize your data and a web-based studio to access all these capabilities. As it supports a serverless architecture, the cost can be optimized to a minimum. It also supports a variety of coding languages like SQL when using SQL DW, python, Scala, PySpark , C# and SparkSQL when using Apache spark pool. This architecture can use the SQL pool for storing all the data that is generated after processing raw data. It is priced higher than all the technologies discussed above but gives a complete package for your analytic needs. Conclusion The technologies discussed above are just a few examples of starting your Analytics journey in the Cloud. The technologies we have discussed are more inclined towards big data. There are other technologies used for analyzing streaming data like Kafka, event hub, etc. which will be discussed in subsequent articles. Overall, designing a sustainable and scalable architecture is the goal when starting your BI journey as each of the technologies has their pros and cons and a cost attached to it. There are no standard rules when designing architectures for Analytics. Hence, it is advisable to get help from a cloud architect when starting your journey as it will ensure that you have less trouble when addressing your future needs.
You may also like :
See all