Databricks & Perl in Azure Cloud


Really ! probing Perl “the Legacy” to see how compatible with Modern “5th generation” Data Platform ! I guess this is terrible subject to pick .
Yes, this thread is intended to criticize Perl programming and discourage using it. Perl is unique ,exotic but not attracts developers community as mainstream languages for past couple of years. The demise is not sudden but gradually painful for Perl community .

The objective of this thread is to prove how can you use Perl to connect with modern Delta Lakehouse platform Databricks not a simpler ways but with all kinds of highly discouraging bells and whistles. This is a assessment with facts and lets dive through that:

Perl usability in Azure

To use Perl in Azure it can be done :

  • using Perl Azure SDK – Not recommended as not supported yet by Microsoft – Its Alpha quality code by Open Source Community not stable yet and hence not available to CPAN (Comprehensive Perl Active Network) community . 2 Year old last commit in Github. 

       For reference : Azure Perl SDK GitHub

  • via Azure REST API – Recommended Approach which provides service endpoints that support sets of HTTP operations (methods) allow create, retrieve, update, or delete access to the service’s resources   Azure REST API Index

       For reference: Azure REST API GitHub

Connection to Databricks

  • There is no native or direct out of box connectivity exist unlike Perl DBI modules which is connected to Data sources like MySQL or Oracle directly
  • The connection must needs to happen via External JDBC or ODBC connector to Databricks Spark framework
  • Unlike Python , R , Java, Scala which is directly supported by Databricks Notebook Perl is not supported and there is no plan either to add that support
  • Perl can still be used to create Databricks resources in Azure and manage it via Azure API but to interact with resources  (Clusters , Jobs , Notebooks) in Databricks workspace required Databricks REST API

Database access using Perl DBI and SQL

Architecture

The Perl scripts use DBI, which in turn uses the appropriate Database Driver (e.g. DBD::Oracle for OracleDBD::Pg for PostgreSQL and DBD::SQLite to access SQLite).

Those drivers are compiled together with the C client libraries of the respective database engines. In case of SQLite, of course all the database engine gets embedded in the perl application.

As an example for Oracle connectivity

Connecting Databricks from Perl (Using JDBC)

Thrift JDBC/ODBC Server (aka Spark Thrift Server or STS) is Spark SQL’s part of Apache Hive’s HiveServer2 that allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols on Apache Spark.

Spark ThriftServer convert ODBC/JDBC calls into a format that can be distributed to and processed by highly-parallel engine like Spark in an efficient manner.

With Spark Thrift Server, business users can work with their Shiny Business Intelligence (BI) tools, e.g. Tableau or Microsoft Excel, and connect to Apache Spark using the JDBC/ODBC interface. That brings the in-memory distributed capabilities of Spark SQL’s query engine (with all the Catalyst query optimizations) to environments that were initially “disconnected”.

STEPS

PREREQUISITE

  • A Virtual Machine in Azure (either Windows or Linux) – In this case I am using Ubuntu 18.04 bionic image
  • Perl 5.8.6 or higher
  • DBI 1.48 or higher
  • Convert::BER 1.31
  • DBD::JDBC module
  • Java Virtual Machine compatible with JDK 1.4 or Above
  • A JDBC driver – Simba JDBC Driver as Databricks recommended and supported
  • log4j 1.2.13 extension

Install Perl and associated Dependencies in VM

There are many ways to install Perl but best to use PERLBREW . Execute all this after SSH to the VM in Cloud.

dxpadmin@dxptestvm:~$ \curl -L https://install.perlbrew.pl | bash

OR

$wget –no-check-certificate -O – http://install.perlbrew.pl | bash

Note: Install gcc , csh all other dependent packages if required for above if perlbrew failed.

Now Initiate Perl :

perlbrew init perlbrew install perl-5.32.1 perlbrew switch perl-5.32.1

Install all prerequisite for DBD::JDBC module

$sudo apt install default-jre $sudo apt install default-jdk $java -versionopenjdk version “11.0.10” 2021-01-19 $javac -versionjavac 11.0.10

Download DBD::JDBC module from CPAN:

wget https://cpan.metacpan.org/authors/id/V/VI/VIZDOM/DBD-JDBC-0.71.tar.gz

Install Dependent Module:

$cpan -D Convert::BER

Unarchive and Edit the below file and property:

$vi /home/dxpadmin/DBD-JDBC-0.71/log4j.properties log4j.logger.com.vizdom.dbd.jdbc = ALL

Correction of DBD module and Makefile :

edit 03_hsqldb.t  file inside extracted DBD Archive directory : /home/dxpadmin/DBD-JDBC-0.71/t and search for ‘exit 1’ at end of file and comment it out and Save then do below:

$perl Makefile.PL $make $make test

Note: for different version of Perl or different OS package you might need gmake or dmake whichever compatible

If required use this:

$cpan force install DBD::JDBC

SIMBA SPARK JDBC and Connectivity with Databricks

Download SimbaSpark JDBC Driver 4.2 

$wget https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/simbaspark-drivers/jdbc/2.6.17/SimbaSparkJDBC42-2.6.17.1021.zip

Note : Simba 4 / 4.1 / 4.2 all supported for the connectivity. Simba JDBC driver for Spark support look at here.

Start the JDBC Proxy Server as below:

$export CLASSPATH=/home/dxpadmin/DBD-JDBC-0.71/dbd_jdbc.jar:/home/dxpadmin/DBD-JDBC-0.71/t/hsqldb/log4j-1.2.13.jar:.:/home/dxpadmin/simba/SparkJDBC42.jar:$CLASSPATHsource
$~/.bashrc 
$DRIVERS=”com.simba.spark.jdbc.Driver”
$java -Djdbc.drivers=$DRIVERS -Ddbd.port=9001 com.vizdom.dbd.jdbc.Server

It will be Accepting inbound connection as below :

PERL code to connect Databricks (perldatabricksconntest.pl)
#!/home/dxpadmin/perl5/perlbrew/perls/perl-5.32.1/bin/perl
 
use strict;
use DBI;
 
my $user = "token";
my $pass = "dapidbf97bbmyFAKEpassautha564de4d68055";
my $host = "adb-8532171222886014.14.azuredatabricks.net";
my $port = 9001;
 
my $url = "jdbc:spark://adb-853217host2886014.14.azuredatabricks.net:443/default%3btransportMode%3dhttp%3bssl%3d1%3bhttpPath%3dsql/protocolv1/o/8532171222886014/1005-143428-okra138%3bAuthMech%3d3%3b"; # Get this URL from JDBC string of Databricks cluster. the URL encoding is VERY Important else it leads to failure connection with weird errors
 
### my $url = "jdbc:spark://adb-853217host2886014.14.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/8532171222886014/1005-143428-okra138;AuthMech=3";  # (Not use this as IT WILL GENERATE Authentication ERROR with SSL / Certificates)
 
 
my %properties = ('user' => $user,
'password' => $pass,
'host.name' => $host,
'host.port' => $port);
 
my $dsn = "dbi:JDBC:hostname=localhost;port=$port;url=$url";
my $dbh = DBI->connect($dsn, undef, undef,
{ PrintError => 0, RaiseError => 1, jdbc_properties => \%properties })
or die "Failed to connect: ($DBI::err) $DBI::errstr\n";
my $sql = qq/select * from products/;
my $sth = $dbh->prepare($sql);
$sth->execute();
my @row;
while (@row = $sth->fetchrow_array) {
print join(", ", @row), "\n";
}

In above example I am running simple select from Products table in Default Schema in Databricks Workspace . Lets look at the table data:

Now lets execute Perl Script :

This is what the JDBC Proxy server Logged : Ignore the LOG message as per Simba Documents

For Windows Azure VM requires setting up Strawberry Perl , Download DBD::JDBC module and copy the content for extracted DBD:JDBC to inside Strawberry directory : C:\Strawberry\perl\vendor\lib\DBD

It doesn’t need to run Makefile.PL and it should be ready to initiate JDBC proxy server to accept connections.

This is how it will look like in Azure Windows VM

Now Pick the Spark Driver and Add JDBC connection string (as provided by Databricks) to see successful connection : (add string without URL encoding and it will work here)

NOTE: An attempt made with similar setup configuration from Local windows machine with no success because of certificate issues thrown up . 

TROUBLESHOOTING

If the URL encoding is not done properly in ‘url’ string inside the above Perl code, JDBC proxy server will throw below error :

And this in turn leads to very confusing Authentication / SSL / Certificates/ SocketException errors like below :

Though there is various other methods exist to connect via Simba like No Authentication , OAuth 2.0 , LDAP User/Password, Use KeyStore and SSL, I have no success in either approach except the Simple Databricks Auth and Token

Challenges with Perl Language

Popularity :

Below programming index shows that Perl language value diminishes year by year .

https://www.tiobe.com/tiobe-index//

<1% in Programming community Index

Community Update

Slow community Update. Update on Perl SDK on top of Azure or any Cloud lag by 2-3 years . No Microsoft support using Perl SDK yet . 

Usability / Sustainability

Not Developer friendly . High learning curve , complex syntax and programming paradigm than language like Python.

This put immediate risk on sustainability in long term of Perl being put in front on any modern language to talk to Database. 

Perl community can’t attract any new developers and beginner users like Python successfully has

Community Support

Lacking day by day , Lack of support in popular developers channel like StackOverflow. I had couple of stackoverflow thread on Perl with very less view and reply from just one person which is not helpful either . So the community is not attractive anymore to any developers

FINAL VERDICT

Use Modern language with Modern Database support for better compatibility , better adoption and availability of easy and fast knowledge powers . This helps scale and make business sustain for years.

The level of trouble I had with Perl I never feel its worth to try Perl ODBC module for DB connectivity.

So if you love pain with your programming platform with zero to no help sure go ahead but get ready to end up with roadblock !

Message from LinkedIn


Somebody will pat me , somebody will envy on me , somebody will congrats me , somebody will ignore me ….. That’s life ! From LinkedIn +200 Million members my profile viewed in top 1% list … ! Thanks LinkedIn for capturing every moments …

 

Message from LinkedIN