In today’s world any organization which is not utilizing its data for gaining insights will fall
behind its competitors. Data has become an important part of the working of an organization be
it for monitoring, analyzing, or predicting requirements.
Every organization has a certain type of data requirement. Some require historical data; some
require periodic data and others require live data. To ensure your current and future data
requirements are met it is advisable to create proper infrastructure for storing the data
available in the organization.
The stages which an organization goes through can be broadly classified as:
Low-Tier (Very low data volumes/PoC).
Mid-Tier (Small to medium data volume).
Advanced-Tier (Large data volume with data science requirements).
Before discussing the technologies available for each of the above classifications, let us have
a quick overview of cloud technologies.
Backend cloud technologies are technologies available in the cloud for storing and processing
data in the cloud. Just like a basic computer system, computation in cloud also needs a CPU, a
RAM and Storage. In cloud we have this advantage where we can scale-up or scale-down or even
shut down the resources as per our needs.
Different clouds have different technologies which can be used for processing data. Technologies
differ depending upon the cloud used, reporting needs, frequency of reporting, pre-aggregation
on the dataset, etc.
Though all the clouds have technologies to handle all kinds and volumes of data, we will be
looking into the Azure Cloud for our discussion.
Now that we know what backend technologies are, let us discuss some of the options available
when setting up your Business Analytics Architecture in Azure Cloud depending on the volume of
data available.
1)
Low-Tier: Low-tier is for exceedingly small organizations with data volumes in a few
MBs. These technologies can be used for Proof-of-Concept analytics in larger organizations.
These technologies are very cost effective, and the turnaround time is very small.
a)
Microsoft Excel: The easiest option for getting your dashboards is to use
Microsoft excel/csv files as an input to Power BI. The files can be stored in a Data
Lake or an FTP server and can be refreshed by adding new data to the location. This
option is suggested only when the data volume is low, and the computation does not
require complicated measures/columns. Once the data volume increases, this architecture
cannot sustain the increased load as it cannot be scaled up. This architecture is the
easiest to develop as well because it does not require expertise in complicated backend
technologies.
2)
Mid-Tier: Mid-tier technologies work best when data volumes are from 10 to 15 GBs.
They are comparatively inexpensive, and the development costs are moderate.
a)
Azure Analysis Services/SQL Server Analysis Services (SSAS):
SSAS is the traditional service provided by Microsoft for adding computation capacity to
Power BI reports. This service has the advantage that all the calculated columns and
measures could be moved from Power BI to an SSAS model. The server created to support
the backend can store multiple models and can be scaled up or down. This is the “go-to”
option for adding computation capacity for Power
BI Reports .
Traditional SSAS server can be scaled up from one tier to another and it does not
automatically scale up/down depending on the requirements. To address this, Microsoft
has launched the Azure Analysis Service which gives more control over the server size
and hence optimizes cost of operation depending upon the load.
Once the model is created, we can connect it with a direct query to Power BI report.
Power BI report created using direct query does not store any report on the Power BI
server. All the data is computed in the backend server and then displayed in the Power
BI report. If we create an import connection in Power BI, then all the advantages of
getting the SSAS/Azure AS server would be gone.
b)
SQL Server:
SQL Servers are the most used databases when it comes to storing corporate data.
For starting your journey in BI, these can be the steppingstones for creating dashboards
directly from the SQL databases. Power BI supports both direct and import query modes
when connecting with SQL servers. Moreover, Power BI also gives an option of writing SQL
queries and sending them to SQL server. This saves the need for creating views/stored
procedures in SQL.
This works well because complicated queries can be stored as tables in SQL Server and
then used directly for reporting. Also, data can be aggregated for visualization in the
backend, reducing the computation required in Power BI.
c)
Azure Function with Storage Account:
All the technologies we have seen above support an elevated level of transformation and
aggregation. In case your requirement is to run complicated transformations or machine
learning models, then you would need the flexibility of a programming language for
your analysis.
Azure Function becomes the answer to your needs in this case as it gives a variety of
languages to code your requirements. As an Azure Function has no built-in memory, the
output of the analysis is stored in a Storage Account (ADLS) available in Azure. The
Power BI dashboard can then be created using the processed data from Storage account. We
can also use an SQL server in this case if the data volume is large.
Azure function supports programming languages like Node.js, python, PowerShell, .NET,
and Java.
3)
Advanced-Tier: Advanced-Tier technologies are for very large data volumes and AI/ML
requirements using large datasets up to 100GBs. These technologies are expensive, highly
scalable, and immensely powerful. They can even be used for live streaming data as well.
a)
Power BI Premium: Power BI Premium is the latest cloud-based technology which
can be used for giving more computation power for handling large datasets in Power BI
Service. Power BI Premium costs more compared to SSAS server but there is no
extra development cost when it comes to moving the model to SSAS server. It uses
dedicated Power BI embedded capacity for storage and computing. An embedded capacity can
be shared by multiple workspaces.
b)
SQL Server Hyperscale Tier: In case the General Purpose/Standard or Premium tier
is insufficient for your data needs, Microsoft has come up with a hyper-scale tier that
is based on the vCore purchasing model. This enables your database to scale up to
100TB. It also gives an option to add more computing capacity depending
on the processing requirements. The option for scaling up or down is available and
optimizes the cost according to your computing requirements.
c)
Databricks with Storage Account:
Databricks is a managed PaaS (platform as a service)
based on Apache Spark. It is available on major cloud platforms. Databricks supports
parallel processing . It is very well integrated with Azure Storage Accounts
which enables it to process a large dataset very quickly and efficiently. It is
user-friendly as it allows users to code in SparkSQL, Scala, PySpark and R. These are
the major languages used by data engineers/data analysts all over the world. It also
supports machine learning and artificial intelligence libraries which makes the data
science part easy to implement and productionize.
Databricks uses Jupyter like notebooks which enables multiple users to collaborate. It
has Git/DevOps integration, enabling users to move back to a previous version of the
notebook.
As Databricks does not have any storage in its servers, data is stored in azure storage
accounts after processing. Databricks clusters can be switched-on using Azure Data
Factory and shutdown automatically if not in use. This makes it a very cost-effective
solution. Also, the same cluster can be used by multiple notebooks.
Once the data is processed and saved in storage account, it can be accessed by Power BI
directly or through an SQL Server.
d)
Azure Synapse with Storage Account:
Azure Synapse is the proprietary Data analytics, Machine Learning and Artificial
Intelligence platform developed by Microsoft. It supports serverless SQL DW pool
and Apache spark pool for data warehousing, code free data flow for big data
transformation, data integration & orchestration to integrate and operationalize your
data and a web-based studio to access all these capabilities.
As it supports a serverless architecture, the cost can be optimized to a minimum. It
also supports a variety of coding languages like SQL when using SQL DW, python, Scala,
PySpark , C# and SparkSQL when using Apache spark pool.
This architecture can use the SQL pool for storing all the data that is generated after
processing raw data. It is priced higher than all the technologies discussed above but
gives a complete package for your analytic needs.
The technologies discussed above are just a few examples of starting your Analytics journey in
the Cloud. The technologies we have discussed are more inclined towards big data. There are
other technologies used for analyzing streaming data like Kafka, event hub, etc. which will be
discussed in subsequent articles.
Overall, designing a sustainable and scalable architecture is the goal when starting your BI
journey as each of the technologies has their pros and cons and a cost attached to it. There are
no standard rules when designing architectures for Analytics. Hence, it is advisable to get help
from a cloud architect when starting your journey as it will ensure that you have less trouble
when addressing your future needs.