Featured

Azure Kubernetes and Druid Database Service Architecture


Apache Druid is A Scalable Timeseries OLAP Database System. OLAP systems are heart of Business Intelligence, multi dimensional Analytical query and Data Science applications. Druid, as of this writing, can be deployed to Azure as an IaaS service and the PaaS offering still not out. Neither Druid has options to deploy as Marketplace Product. The assumption for the below architecture is Druid is deployed in OnPremise Kubernetes infrastructure with Pure storage(PureBlock and PureFile) backbone and transformations is offered with change in Storage tiers and technologies native to Azure world  

Azure Data Storage Architecture for Druid

Below diagram specifies how the Azure Storage can be distributed across multiple storage tiers for Druid Database related to different Key components for Druid Datastore:

  • Azure Managed Disk – Historical Nodes needs Low latency and High throughout Disk IO storage for Fast mapping to memory address space . This could be Standard SSD or Premium Ultra Fast SSD tier for optimized performance.Each Historical POD need access to dedicated Managed disk dynamic provisioned mount  
  • Azure File Storage – Middle manager nodes need fast Batch data access from NFS like file share which is typically offered by Azure Native File storage with NFS or SMB protocols. Since Multiple Middle Manager PODS needs access to a common  storage hence Azure file storage is easy supplement
  • Azure Gen2 Blob Data Lake Storage – Druid Deep storage needs fast but cheap storage options alternative to S3 or HDFS supplemented by ADLS Gen2 
  • Azure MySQL or PostgreSQL – Druid Metadata config store could be either MySQL or PostgreSQL . If HA is needed then PostgreSQL Patroni could be preferred choice.
Extending the Arch from : https://anskarl.github.io/post/2019/druid-part-1/

Distribution of Druid POD’s and Storage provisioned Between Azure Managed and File Storage

Druid Deep Storage needs to be provisioned from Azure Gen2 Blob Storage similar as S3 in AWS . The configuration how to include the Deep Storage Azure needs be using Azure Druid Extensions

Druid needs below property to be set before mounting Deep Storage in ADLS : 

PropertyDescriptionPossible ValuesDefault
druid.storage.typeazureMust be set.
druid.azure.accountAzure Storage account name.Must be set.
druid.azure.keyAzure Storage account key.Must be set.
druid.azure.containerAzure Storage container name.Must be set.
druid.azure.prefixA prefix string that will be prepended to the blob names for the segments published to Azure deep storage“”
druid.azure.protocolthe protocol to usehttp or httpshttps
druid.azure.maxTriesNumber of tries before canceling an Azure operation.3
druid.azure.maxListingLengthmaximum number of input files matching a given prefix to retrieve at a time1024

Storage account and Containers for Druid in Azure

Imply has Deployment steps in AKS and Config for Azure Storage account mentioned here

Storage Provisioned for Druid Data servers (Middle Manager and Historical) 

  • Create StorageClass (Azure Managed Premium Disk) 
  • Create Persistent Volume Claim
  • Create Pods using the PV and mount the PV inside a dedicated Mount store ‘

Create Azure Kubernetes Storage Class

Add new StorageClass in Kubenetes cluster

Add below YAML

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: managed-premium-retain
provisioner: kubernetes.io/azure-disk
reclaimPolicy: Retain
parameters:
  storageaccounttype: Premium_LRS
  kind: Managed

Add YAML codeblock and execute it :


apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: azure-managed-disk
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: managed-premium
  resources:
    requests:
      storage: 1Gi

Create a POD and make mount the Persistent Volume into it 

Similar like before create a Persistent Volume and execute below YAML 

kind: Pod
apiVersion: v1
metadata:
  name: nginx
spec:
  containers:
    - name: myfrontend
      image: mcr.microsoft.com/oss/nginx/nginx:1.15.5-alpine
      volumeMounts:
      - mountPath: "/mnt/Azure"
        name: volume
  volumes:
    - name: volume
      persistentVolumeClaim:
        claimName: azure-managed-disk

This created nginx POD up and running and Bound to PVC with ReadWriteOnce mode (Note Azure Disk storagecClass can’t be ReadWriteMany) and it is mounted Storage inside /mnt/Azure.

Inside POD how this will looks like :  Note 1 GB mount requested is provisioned under /dev/sdc

PS C:\windows\system32> kubectl exec -it nginx — /bin/bash
root@nginx:/# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 124G 23G 102G 18% /
tmpfs 64M 0 64M 0% /dev
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
/dev/sdc 976M 2.6M 958M 1% /mnt/Azure
/dev/sda1 124G 23G 102G 18% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 7.9G 12K 7.9G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 7.9G 0 7.9G 0% /proc/acpi
tmpfs 7.9G 0 7.9G 0% /proc/scsi
tmpfs 7.9G 0 7.9G 0% /sys/firmware
root@nginx:/#

Manually create Azure Static Disk and Attached to AKS 

This steps clearly outlined here

Backup and Restore of the Persistent Volume Snapshots

The steps outlined here how to backup the Persistent volume and restore it in AKS

Featured

Governance across Azure Naming Standards/Conventions and Best Practices

Azure Naming Best Practice and Governance


“There are only two hard things in Computer Science: cache invalidation and naming things.” 

Phil Karlton

Preamble: 

As enterprises start to utilize Azure resources, even a reasonably small footprint can begin to accumulate thousands of individual resources. This means that the resource count for much larger enterprises could quickly grow to hundreds of thousands of resources.

Establishing a naming convention during the early stages of establishing Azure architecture for your enterprise is vital for automation, maintenance, and operational efficiency. For most enterprises, these aspects involve both humans and machines, and hence the naming should cater to both of them.

Objective – Naming standards saves any teams countless number of hours to think , making mistakes and carry with those mistakes and suffer others and be religious in cloud eco system. This thread will outline what are the technical guardrails we can adopt inside naming to instrument this as practice. Idea is to follow KISS principle – “Keep it simple, stupid” . 

Naming conventions always end up with heated debates and someone could be very arrogant to propose “one-size-fits-all” kind of theory but chill ! Lets first understand what can we do best for our own benefit.

WHY TO ADOPT BEST PRACTICE/ RECOMMENDATIONS

Synchronous / Rhythmic and Orderly / Pleasant to see
Do what you want / Go All over the places ?

What is Azure Resources

Any component got to do some work is resource in Azure World. Here is simple resource structure in Azure

  • Management groups: These groups are containers that help you manage access, policy, and compliance for multiple subscriptions. All subscriptions in a management group automatically inherit the conditions applied to the management group.
  • Subscriptions: A subscription logically associates user accounts and the resources that were created by those user accounts. Each subscription has limits or quotas on the amount of resources you can create and use. Organizations can use subscriptions to manage costs and the resources that are created by users, teams, or projects.
  • Resource groups: A resource group is a logical container into which Azure resources like web apps, databases, and storage accounts are deployed and managed.
  • Resources: Resources are instances of services that you create, like virtual machines, storage, or SQL databases.

When Naming resources in cloud, degree of governance and guidelines is very important to drive a successful tidy thoughtful environment.

Hence tagging and naming is BOTH important factors for developers/administrator/config manager to follow.

What important factors driving naming standards ?

  • Consistency –  Makes DevOps engineers job easy
  • Remove Redundancy
  • Scalability and Flexibility
  • Avoid Confusion
  • Ease of Management – Easy to follow and easy to identify resources/pin point product and application owners  in few seconds
  • Billing/Charge Back – Easy to find the resource usage cost by tagging and naming
  • Better monitoring and logging
  • Habit the practice – you will live with this names and spend lot of time with them so make them pleasure to work with
  • Renaming Resources – Azure resource and resource group can’t be renamed once created. Renaming is moving all your resources from one group to other and it potentially impact lot of places

What is excluded ?

Below will not follow the naming conventions as those are automated by Azure itself. 

  • Naming for the Resources auto-generated by Azure like Managed Disk name (e.g. <VMName>_OsDisk_1_25fbb34c5b074882bcd1179ee8b87eeb)
  • Supporting resource group for Azure Kubernetes Service
  • Some resources require a globally unique name
  • Internal Resource group name related to Databricks for e.g. – databricks-rg-dev000hidpuse02dbricks001-44gxjl4c7ailk

Azure Regions for Deployment of Resource Group and Resources

  • EAST US2 is 10% cheaper from EAST US and have better resource availability
  • Co-locating resources at same location will help reducing ingress and egress cost of data transfer for any integration work

Key Resource Group(KRG) Naming

  • Resource group prefix – e.g. – rg (short stands for resource group)   (this will simplify the search while creating any resources) 
  • Department / Business Unit Name – e.g. – hosbi , crs – preceding BU Name will help us to identify the sub from the merged corporate subscription tree
  • Type of the resources under that resource groups. e.g. for db, app, devops, ml, k8s
  • Environment Type – e.g.  – dev, prd
  • Regional Instance Identifier ( azure datacenter) – e.g. eastus, westus – Not mandatory to be explicitly mentioned and maintained specially when business agreed all resources to use specific region
  • Incrementor: 01, 02, 03 etc if required (can be used for redundant copy of existing resource group or brand new resource group to be used by different team under same business unit )  .

For e.g. – rg-hosbi-db-dev , rg-hosbi-app-prd

Key Resource Identifier (KRI) Strings

  • Environment Type – e.g.  – dev, tst, prd
  • Azure resource Identifier – e.g.  graphql ,react, druid, postgres, kafka, eventhubs, redis, adls, vm, redis, nsg, lb, aks, airflow , datafactory, databricks, couchbase, oracle, splunk, adls, storage
  • Role/Function of resources/Primary Use  – e.g.  operation, config, cache, analytics, oltpds, olapds, db, web, mail, shared (when this resource will be shared across resource groups)
  • Regional Instance Identifier( azure datacenter) -e.g.  eastus, westus  – Not mandatory to be explicitly mentioned and maintained specially when business agreed all resources to use specific region
  • Incrementor: 01, 02, 03 etc if required –  e.g. can be used for HA resources like Cluster nodes, VM scale sets. 

For e.g. –  dev-databricks-olap, prd-aks-analytics-eastus, dev-redis-cache, prd-postgres-oltp-config

Note: BU Name / Product Name subject to change with organizational merge hence better to keep those name for each KRI inside Tags and not in the naming itself. 

Micro Service Container Naming convention

this can follow the standard POD naming when deploying the resource using the corresponding helm charts. Certain naming in POD can’t be controlled because it is managed inside kubernetes own realm

Naming standard challenges:

There are lot of naming rules across azure services. Some resource offers naming in hyphen or period characters where some are not. Some likes underscore while other prohibits any naming except alphanumeric characters.

Some allows naming to be certain number of characters some can have pretty lengthy naming and some allows uppercase where some are strictly lowercase

Unifying the Naming Construct and Separators

UsagesDescription of Rules
CASEUse lowercase always. camelCase, PascalCase, Initcap is not allowed.
WHITESPACEno space allowed across any name
ABBREVIATION to shorten characterswhere Resource name length > 8 characters use conformed and consistent abbreviated form like df for datafactory . 
HYPHEN or DASH as Separator to break them upIf hyphen or Dash not allowed use Underscore(“_”) to supplement. If no special characters allowed in between use 1 as separator for e.g. poc1storage1db, poc1storage1olapds, poc1storage1shared . Rare case, as an example Storage name which doesn’t accept hyphen/dash neither any special characters. But Fileservice/blob services/containers underneath it can be names with hyphenUse hyphen(“-“) on resource name or resource group name as separator between resource identifier values for e.g. –  prd-react-analytics-eastusDo not use more then 1 hyphen (“–”) as separator for e.g.  dev–aks–olapds . Don’t use hyphen characters inside the name of resources itself for e.g. eventhubs should never be called as event-hubs rather use continuous naming or abbreviation if long name
START AND END CharacterNever start any resources with hyphen(“-“) or any charactersNever end any resource name with hyphen(“-“) or any characters (like period(“.”) )Never start with Numeric value but end with numeric value is allowed, for e.g. – dev-couchbase-config-eastus2
INCREMENTORUse only when this make sense for e.g. Clusters, Multiple Nodes deployed for similar resources, VM Scale sets etcDon’t use 00 if no incrementor presentUse 0 padding – Never use 1/2/3 rather use 01/02/03 
ENVIRONMENT NameShouldn’t be exactly 3 characters (dev, tst, stg, prd, prf, uat, poc) – qa should be analogous to tst or stg environment
CHARACTERS LimitIf KRI strings length is long enough to fit the allowable limit use additional tag to identify the resources. Some cases abbreviation can be used
REDUNDANT NAMING Cross RegionAvoid creating unnecessary number of resources for redundancy and HA because some services having built-in high availability(fault domains and update domains) and no need to create in different regions.
RESOURCE functionsThis is very important to identify what is the purpose the resources and long term it will benefit identifying the purpose

Important Limitations

Scope:Resources:Constraint:
Alpha-NumericStorage Account NameCannot have dash, dot
Azure CloudSQL Server Name, Storage Account NameMust be unique across Azure not just subscription
LengthSearch Service and Virtual Machines2 to 15 characters
Lower CaseStorage Account NameCannot be upper characters

Sample abbreviated naming for some Azure Resources

Resource/Service:Short Code:Long Code:
Subscriptionsubsub
Resource Grouprgrg
Virtual Machinevmvm
Availability Setavsavset
Virtual Networkvnvnet
Subnetsubsubnet
Public IP Addresspippublicip
Network Security GroupnsgNetworksg
Storage Accountstgstorage
TrafficManagertmtrafficmgmr
Load Balancerlbloadbalancer
Application Gatewayagwappgateway
App Servicesvcappsvc
Key Vaultkvkv
App Service Planaspappplan
Sql Databasesdbsqldatabase
Sql Serversqlsqlserver
Diskdskdisk
DNS Zonednsdnszone
Log Analyticsloaloganalytics
Logic Apploglogapp
Network Interfacenifnetinterface

Tagging :  Very useful to find the resource usage in Azure by Tag Name. Also it helps to categorize the similar resources under one Application or Product.

Another great use of tagging is Billing . It is great way to report the Cost Analysis. Note tagging is Key Value and Can be changed anytime but KRG / KRI is not simple rename.

There would be multi level tagging options:

  • Subscription
  • Resource Group
  • Resource

Tag name/value length: 512 / 256

Tag Name rule: 

Name is the key , it can’t be duplicate string

Never use sequence like 1,2,3 in the name key for tag

Tagging Naming Conventions for Resource group (All parameters below are mandatory)

A common and good use of tags name and value combinations would be below:

NameSample ValueAllowed ValuesEnforced by policyDescription (why needed ? ) 
business-unithosbishould be only one valueYesif multiple product use hyphen(“-“) separator
ownerdebashis-pauluse full nameidentify who owns what, if middle name then add another hyphen(-)
email<email address>valid email addressto identify right person if same name
teamapplicationapplication, analytics, qa, esd, externalYespick one of them
environmentpocdev, tst, stg, prd, prf, uat, pocpick one of them
billingownerhosbi-<companyname>hospitality BI division companyEasy to filter and report cost sheet
regioneastus2eastus, eastus, westusby default policy to be enforced from 

Since Tag Name/Value at Resource Group level doesn’t recursively applied to Resource Name so we still need to apply Tagging pair at individual resource level.

Tagging Naming Conventions for Resource Name  (All parameters below are mandatory)

NameSample ValueDescription (why needed ? ) 
business-unithosbiif multiple product use hyphen(“-“) separator
ownerdebashis-paulidentify who owns what, if middle name then add another hyphen(-)
email<emaild Id>Key to identify if there is same name
teamanalyticsapplication, analytics, qa, esd, external , pick one from them
resource-typestoragee.g. storage, network, database, compute, log, monitor, vm, cache, web, application, queue , stream, orchestrator, iias, pass, saas, 
product-namerateif multiple product use hyphen(“-“) separator
environmentpocUsed for what platform poc, dev, test, production ?
thirdparty-swyes/no/alle.g. couchbase, mongo, kafka, splunk, oracle which is marketplace product and non managed resources
billingownerhosbi-<companyname>Easy to filter and report cost sheet

Scripts to create resources and scripts to enforce such policies… Bookmark them. Non Prod or Prod subscription can’t create resource name/ groups without Script

Enforcing the naming and tagging practice:

ARM template scripts should be executed via Azure CLI / Powershell or Azure API while creating resources / tags following above rules for Prod and Non Prod env except POC environment

POC Subscription resource group should be easy to create without Scripts but should adhere above naming principles

NOTE

Resources with a public endpoint already have an FQDN which accurately describes what they are so some cases resource name is self explanatory while looking at default public endpoint URL azure creates:

Examples:

Resource typePublic endpoint
App Service (web/logic/function app)name.azurewebsites.net
Traffic managername.trafficmanager.net
Storage accountname.blob.core.windows.net
Azure SQLname.database.windows.net
Public IP (load balancer, VM, App Gateway etc.)name.location.cloudapp.azure.com
Cosmos DBname.documents.azure.com
Service Busname.servicebus.windows.net

Ref Doc: 

Organize Azure resource effectively

MS recommendation on Naming and Tagging

DataWarehouse vs DataLake vs Lakehouse


DW vs DataLake vs Lakehouse
DATA WAREHOUSEDATA LAKELAKE HOUSE
Its a traditional Legacy Architecture Its not architecture but Data Storage repositoryIts a modern Lake House Architecture a blend of Data Lake and Warehouse
Ideal for structured data management solutions and not for complex variation of open source data types, it has closed proprietary formats,   Its purpose-built Collect large amount of many structured or unstructured data sources of variety of formats , Data semantics and Abstraction is not providedCollect literally any data at any scale , Process any data at any scale andcreate outbound data at any scale
Support business intelligence with structured data supportIssue with usability and have severe limitations , Poor DW and BI support, Unreliable Data swamps Any DW or DL limitations has been taken care of , usability is achievedvia optimized Delta lake by adding a transaction layer on top of data lake
High storage and processing cost for data at-least comparing sources like Data Lake Suitable for storing Bulk volume of data at low cost, Complex setupbrings best of datalake and datawarehouse together A lakehouse is a scalable, low-cost option that unifies data, analytics and AI.
Supports performance with set of optimization processes , degree of data quality could be enforced , versioning need to be customized / maintainedLack some critical reliability and performance features like transactions, data quality enforcement and consistency/isolationEnable critical capabilities like ACID transactions, schema enforcement/evolutionand data versioning that provide reliability, performance and security to your data lake
Seamless unification of Batch and Streaming workload is hard of achieve , limited streaming capabilitiesDifficult to handle and dealing with enormous number of files . Not easy to deal with files unless there is clear semanticsDelta Lake provides a unified framework for batch and streaming workloads on one simplified architecture that not only avoids complex , redundant system and operational challenges but also improving efficiency on data transformation pipelines, downstream activities like BI, data science and ML
Auto data versioning is not possible , No Direct support for Data Science and MLData versioning is not applicable as it is just ingest sourceDelta Lake provides the ability to version your data sets, which is a really important feature when it comes to reproducibility.The ability to essentially pin your models to a specific version of your dataset is extremely valuable because it allows other members of your data team to step in and reproduce your model training to make sure they get the exact same results
Vendor LockingIts open Object StoreNo Vendor Locking base on Open source Data Lake and Spark
No in-built GDPR , CCPA  complianceNAGDPR and CCPA data compliances leveraged by managed table in Delta Lake
Doesn’t provide wide verities of access modules for ACL’s and nothing comes OOBNADelta Lake on Databricks, you can use access control lists (ACLs) to configure permission to access workspace objects (folders, notebooks, experiments and models), clusters, pools, jobs, and data schemas, tables, views, etc

Azure Storage Options


Difference between Azure Storage / File protocols and usages

Azure has multiple options for Storage services like BLOB , Gen2 ADLS, Managed Disk , File Storage and many services use very specific access protocols to facilitate the usages for specific needs .

I have outlined those services and protocols differences in many ways in below diagrams:

ADLS GEN2 and Usages
AKS Service and multiple storage provisioner usage in Azure
Using Azure File Storage
ADLS Gen2 Storage Account Access and Usages
Azure Netapp Files Usages

Netapp Storage CVO (Cloud Volume ONTAP) in Azure


Netapp Volume in Cloud is exciting and specially the Netapp ONTAP product. ONTAP is abbreviated form of Open Network Technology for Appliance Products

ONTAP is separate license in Netapp Storage options in public Cloud offerings and it uses separate Storage appliances in Cloud.

ONTAP is great for providing the low latency high throughput IO operations at scale and speed for Block storages typically used for high volume of data transactions in milliseconds response. ONTAP has other exciting offerings of Data Tiering and simplifying the entire cloud data management at single glass on pane. Below is some key data captured for ONTAP in Cloud

  • Simplify storage management by easy setup using Azure Portal , CLI, Powershell, REST API
  • Multi protocol support (NFS/ SMB/ CIFS/ iSCSI)
  • Three performance tiers – Standard, premium and ultra and change on the fly with simple click
  • Data encryption at rest, RBAC, AD auth and network based ACL’s available
  • HIPPA and GDPR complaint with default 99.99% availability
  • Easy management of storage
  • ONTAP featured with thin/dynamic provisioning, data compresssion, de-duplication, compression, data tiering, flexible cloning, snapshot copies, ease of management via GUI or RESTful API and reduce cost footprint
  • ONTAP manages Netapp Block storage volume where multi instance read/write (typical for K8s statefulset workload) , Cloning , Data Protection, Data replications across AZ and Disaster recovery across Azure regions for HA
  • ONTAP supports iSCSI protocols for Block Storage and NFS/SMB for File
  • Non disruptive(zero performance penalty for running applications) Instantaneous point-in-time Snapshots , Applications and Database Integrated
  • Cross site snapMirror Data replication for backup / DR/ restore/ cloning and archive
  • Data tiering to move Hot Block Storage data to Object store and back and forth to save cost
  • Support Hybird Cloud Data movement and easy management by Netapp Cloud Manager
  • multi-AZ,dual-node,HA solution, designed for zero data loss (RPO = 0) and under 60-second recovery

Lets see some Demo how the ONTAP options works assuming you have already ONTAP product option picked from Azure Offerings page in Marketplace Product section.

20 Minutes HA Cluster for Deployment

API Driven
Cross region deployment
Multi Protocol Provisioning
Adding new volume
Choosing Data tiering / retention
iqn copy from Windows VM
Adding volume for the iSCSI map to Host and mapping and make it available

Initialization of Disk
Replication Setup
Target Volume for Replication
Choose replication policy
Volume replication
Primary to DR site replication
Replication features

And That’s IT !

Delta Sharing Open source by Databricks


Vendor agnostic open protocol to share the data real-time which simplifies cross organization data sharing securely and give access to that data to be analyzed by array of tools typically BI tools .

Below excerpt from the SparkAISummit of 2021 and hopefully this will help summarizes the current capabilities of Delta Share

DELTA SHARING DEMO

Sharing the Data to receipient

Data Sharing

For Data analyst to connect via Pandas Data Frame

For Analyst to Connect Delta Share via BI Tools like Tableau

Access Delta Sharing Data from PowerBI

Databricks & Perl in Azure Cloud


Really ! probing Perl “the Legacy” to see how compatible with Modern “5th generation” Data Platform ! I guess this is terrible subject to pick .
Yes, this thread is intended to criticize Perl programming and discourage using it. Perl is unique ,exotic but not attracts developers community as mainstream languages for past couple of years. The demise is not sudden but gradually painful for Perl community .

The objective of this thread is to prove how can you use Perl to connect with modern Delta Lakehouse platform Databricks not a simpler ways but with all kinds of highly discouraging bells and whistles. This is a assessment with facts and lets dive through that:

Perl usability in Azure

To use Perl in Azure it can be done :

  • using Perl Azure SDK – Not recommended as not supported yet by Microsoft – Its Alpha quality code by Open Source Community not stable yet and hence not available to CPAN (Comprehensive Perl Active Network) community . 2 Year old last commit in Github. 

       For reference : Azure Perl SDK GitHub

  • via Azure REST API – Recommended Approach which provides service endpoints that support sets of HTTP operations (methods) allow create, retrieve, update, or delete access to the service’s resources   Azure REST API Index

       For reference: Azure REST API GitHub

Connection to Databricks

  • There is no native or direct out of box connectivity exist unlike Perl DBI modules which is connected to Data sources like MySQL or Oracle directly
  • The connection must needs to happen via External JDBC or ODBC connector to Databricks Spark framework
  • Unlike Python , R , Java, Scala which is directly supported by Databricks Notebook Perl is not supported and there is no plan either to add that support
  • Perl can still be used to create Databricks resources in Azure and manage it via Azure API but to interact with resources  (Clusters , Jobs , Notebooks) in Databricks workspace required Databricks REST API

Database access using Perl DBI and SQL

Architecture

The Perl scripts use DBI, which in turn uses the appropriate Database Driver (e.g. DBD::Oracle for OracleDBD::Pg for PostgreSQL and DBD::SQLite to access SQLite).

Those drivers are compiled together with the C client libraries of the respective database engines. In case of SQLite, of course all the database engine gets embedded in the perl application.

As an example for Oracle connectivity

Connecting Databricks from Perl (Using JDBC)

Thrift JDBC/ODBC Server (aka Spark Thrift Server or STS) is Spark SQL’s part of Apache Hive’s HiveServer2 that allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols on Apache Spark.

Spark ThriftServer convert ODBC/JDBC calls into a format that can be distributed to and processed by highly-parallel engine like Spark in an efficient manner.

With Spark Thrift Server, business users can work with their Shiny Business Intelligence (BI) tools, e.g. Tableau or Microsoft Excel, and connect to Apache Spark using the JDBC/ODBC interface. That brings the in-memory distributed capabilities of Spark SQL’s query engine (with all the Catalyst query optimizations) to environments that were initially “disconnected”.

STEPS

PREREQUISITE

  • A Virtual Machine in Azure (either Windows or Linux) – In this case I am using Ubuntu 18.04 bionic image
  • Perl 5.8.6 or higher
  • DBI 1.48 or higher
  • Convert::BER 1.31
  • DBD::JDBC module
  • Java Virtual Machine compatible with JDK 1.4 or Above
  • A JDBC driver – Simba JDBC Driver as Databricks recommended and supported
  • log4j 1.2.13 extension

Install Perl and associated Dependencies in VM

There are many ways to install Perl but best to use PERLBREW . Execute all this after SSH to the VM in Cloud.

dxpadmin@dxptestvm:~$ \curl -L https://install.perlbrew.pl | bash

OR

$wget –no-check-certificate -O – http://install.perlbrew.pl | bash

Note: Install gcc , csh all other dependent packages if required for above if perlbrew failed.

Now Initiate Perl :

perlbrew init perlbrew install perl-5.32.1 perlbrew switch perl-5.32.1

Install all prerequisite for DBD::JDBC module

$sudo apt install default-jre $sudo apt install default-jdk $java -versionopenjdk version “11.0.10” 2021-01-19 $javac -versionjavac 11.0.10

Download DBD::JDBC module from CPAN:

wget https://cpan.metacpan.org/authors/id/V/VI/VIZDOM/DBD-JDBC-0.71.tar.gz

Install Dependent Module:

$cpan -D Convert::BER

Unarchive and Edit the below file and property:

$vi /home/dxpadmin/DBD-JDBC-0.71/log4j.properties log4j.logger.com.vizdom.dbd.jdbc = ALL

Correction of DBD module and Makefile :

edit 03_hsqldb.t  file inside extracted DBD Archive directory : /home/dxpadmin/DBD-JDBC-0.71/t and search for ‘exit 1’ at end of file and comment it out and Save then do below:

$perl Makefile.PL $make $make test

Note: for different version of Perl or different OS package you might need gmake or dmake whichever compatible

If required use this:

$cpan force install DBD::JDBC

SIMBA SPARK JDBC and Connectivity with Databricks

Download SimbaSpark JDBC Driver 4.2 

$wget https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.6.17/SimbaSparkJDBC42-2.6.17.1021.zip

Note : Simba 4 / 4.1 / 4.2 all supported for the connectivity. Simba JDBC driver for Spark support look at here.

Start the JDBC Proxy Server as below:

$export CLASSPATH=/home/dxpadmin/DBD-JDBC-0.71/dbd_jdbc.jar:/home/dxpadmin/DBD-JDBC-0.71/t/hsqldb/log4j-1.2.13.jar:.:/home/dxpadmin/simba/SparkJDBC42.jar:$CLASSPATHsource
$~/.bashrc 
$DRIVERS=”com.simba.spark.jdbc.Driver”
$java -Djdbc.drivers=$DRIVERS -Ddbd.port=9001 com.vizdom.dbd.jdbc.Server

It will be Accepting inbound connection as below :

PERL code to connect Databricks (perldatabricksconntest.pl)
#!/home/dxpadmin/perl5/perlbrew/perls/perl-5.32.1/bin/perl
# Its FAKE user password token and Fake HOST/URL so please don't try to make FAKE attempt to access it !

use strict;
use DBI;
 
my $user = "token";
my $pass = "dapidbf97bbmyFAKEpassautha564de4d68055";
my $host = "adb-85321FAKE86014.14.azuredatabricks.net";
my $port = 9001;
 
my $url = "jdbc:spark://adb-853217FAKE2886014.14.azuredatabricks.net:443/default%3btransportMode%3dhttp%3bssl%3d1%3bhttpPath%3dsql/protocolv1/o/85321FAKE86014/1005-FAKE-okra138%3bAuthMech%3d3%3b"; # Get this URL from JDBC string of Databricks cluster. the URL encoding is VERY Important else it leads to failure connection with weird errors
 
### my $url = "jdbc:spark://adb-853217FAKE2886014.14.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/85321FAKE86014/1005-FAKE-okra138;AuthMech=3";  # (Not use this as IT WILL GENERATE Authentication ERROR with SSL / Certificates)
 
 
my %properties = ('user' => $user,
'password' => $pass,
'host.name' => $host,
'host.port' => $port);
 
my $dsn = "dbi:JDBC:hostname=localhost;port=$port;url=$url";
my $dbh = DBI->connect($dsn, undef, undef,
{ PrintError => 0, RaiseError => 1, jdbc_properties => \%properties })
or die "Failed to connect: ($DBI::err) $DBI::errstr\n";
my $sql = qq/select * from products/;
my $sth = $dbh->prepare($sql);
$sth->execute();
my @row;
while (@row = $sth->fetchrow_array) {
print join(", ", @row), "\n";
}

In above example I am running simple select from Products table in Default Schema in Databricks Workspace . Lets look at the table data:

Now lets execute Perl Script :

This is what the JDBC Proxy server Logged : Ignore the LOG message as per Simba Documents

For Windows Azure VM requires setting up Strawberry Perl , Download DBD::JDBC module and copy the content for extracted DBD:JDBC to inside Strawberry directory : C:\Strawberry\perl\vendor\lib\DBD

It doesn’t need to run Makefile.PL and it should be ready to initiate JDBC proxy server to accept connections.

This is how it will look like in Azure Windows VM

Now Pick the Spark Driver and Add JDBC connection string (as provided by Databricks) to see successful connection : (add string without URL encoding and it will work here)

NOTE: An attempt made with similar setup configuration from Local windows machine with no success because of certificate issues thrown up . 

TROUBLESHOOTING

If the URL encoding is not done properly in ‘url’ string inside the above Perl code, JDBC proxy server will throw below error :

And this in turn leads to very confusing Authentication / SSL / Certificates/ SocketException errors like below :

Though there is various other methods exist to connect via Simba like No Authentication , OAuth 2.0 , LDAP User/Password, Use KeyStore and SSL, I have no success in either approach except the Simple Databricks Auth and Token

Challenges with Perl Language

Popularity :

Below programming index shows that Perl language value diminishes year by year .

https://www.tiobe.com/tiobe-index//

<1% in Programming community Index

Community Update

Slow community Update. Update on Perl SDK on top of Azure or any Cloud lag by 2-3 years . No Microsoft support using Perl SDK yet . 

Usability / Sustainability

Not Developer friendly . High learning curve , complex syntax and programming paradigm than language like Python.

This put immediate risk on sustainability in long term of Perl being put in front on any modern language to talk to Database. 

Perl community can’t attract any new developers and beginner users like Python successfully has

Community Support

Lacking day by day , Lack of support in popular developers channel like StackOverflow. I had couple of stackoverflow thread on Perl with very less view and reply from just one person which is not helpful either . So the community is not attractive anymore to any developers

FINAL VERDICT

Use Modern language with Modern Database support for better compatibility , better adoption and availability of easy and fast knowledge powers . This helps scale and make business sustain for years.

The level of trouble I had with Perl I never feel its worth to try Perl ODBC module for DB connectivity.

So if you love pain with your programming platform with zero to no help sure go ahead but get ready to end up with roadblock !