Blog Post: A simple example: how to call Python from Hive in HDInsight

Introduction

Hadoop framework distributes code execution automatically in a multi node cluster. This code is also distributed against the dataset. Code development in Hadoop can be done in Java and one has to implement a map function and a reduce function; both manipulate keys and values as inputs and outputs. At a higher level, there are two scripting languages that simplify the code: PIG is a specific scripting language, HIVE looks like SQL. So using HIVE is quite easy. It has a bunch of extension functions (called user defined functions) to transform data like regular expression tools and so on. A developer can add user defined functions, by developing them in Java. Another way to have a procedural logic that complements SQL Set-based language is to use a language like Python:

The goal of that post is to show an example of such a combination.

Here is how that could look on a small cluster. The work load is distributed on the different worker nodes:

At a worker node level, a Python process is created by core. Each process receives its part of the whole dataset:

Windows Azure comes with its Hadoop as a service called HDInsight. This allows to execute HIVE, PIG, and other Map/reduce jobs a few minutes after requesting the creation of a cluster. For HIVE, HDInsight comes with a sample table. Let’s run a HIVE + Python job against that hivesampletable table.

Hive and Python Script

In this example, we use a Python module to calculate the hash of a label in the sample table.

Hive is used to get the data, partition it and send the rows to the Python processes which are created on the different cluster nodes. Here is the code:

add file simple_sample.py;    SELECT TRANSFORM (clientid, devicemake, devicemodel)      USING 'D:\Python27\python.exe simple_sample.py' AS       (clientid string, phoneLabel string, phoneHash string)  FROM hivesampletable  ORDER BY clientid LIMIT 50;

This can be read has: in the first 50 rows of hivesampletable table, select clientid, devicemake, devicemodel , pass them to the simple_sample.py python script that can be run with D:\Python27\python.exe. The script will send back columns clientid (a string), phoneLabel (a string) and phoneHash (a string).

Hive sends data to the simple_sample.py scripts. Here is the code of that script:

import sys  import string  import hashlib    while True:      line = sys.stdin.readline()      if not line:          break        line = string.strip(line, "\n ")      clientid, devicemake, devicemodel = string.split(line, "\t")      phone_label = devicemake + ' ' + devicemodel      print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])

This script expects stdin lines. It parses them, and obtains the columned passed by Hive: clientid, devicemake, devicemodel. From that columns, it deduces the resulting columns: clientid, phoneLabel, phoneHash. In order to calculate phoneHash, it uses an imported module (hashlib). In order to output the result, the python script writes it to stdout, separated by TAB.

Let’s run it with PowerShell

Here is a sample PowerShell script that

creates an HDInsight cluster
Runs the job
Gets the result
Removes the cluster

Before running the script, the HIVE and the Python script must have been copied to the the Windows Azure storage:

Here is the PowerShell script:

Import-Module azure  Add-AzureAccount    $Subscription = 'Azdem169A44055X'  $defaultStorageAccount = 'monstockageazure'  $clusterName = 'monclusterhadoop'  $clusterVersion='2.1'  $clusterAdmin = 'cornac'  $clusterPassword = 'LElzgqy#n87'    $passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force  $clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd)    Set-AzureSubscription -SubscriptionName $Subscription -CurrentStorageAccount $defaultStorageAccount  Select-AzureSubscription -Current $Subscription    $storageAccount1 = (Get-AzureSubscription $Subscription).CurrentStorageAccountName  $key1 = Get-AzureStorageKey -StorageAccountName $storageAccount1 | %{ $_.Primary }    New-AzureHDInsightClusterConfig -ClusterSizeInNodes 3 |      Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccount1}.blob.core.windows.net" -StorageAccountKey $key1 `          -StorageContainerName $clusterName |      New-AzureHDInsightCluster -Name $clusterName -Version $clusterVersion -Location "North Europe" -Credential $clusterCredentials    Use-AzureHDInsightCluster "monclusterhadoop"    $hiveJobVT = New-AzureHDInsightHiveJobDefinition -File "wasb://messcripts@monstockageazure.blob.core.windows.net/simple_sample.hql"  $hiveJobVT.Files.Add("wasb://messcripts@monstockageazure.blob.core.windows.net/simple_sample.py")  $startedHiveJobVT = $hiveJobVT | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster "monclusterhadoop"    $startedHiveJobVT | Wait-AzureHDInsightJob -Credential $clusterCredentials    Get-AzureHDInsightJobOutput -StandardError -JobId $startedHiveJobVT.JobId -Cluster "monclusterhadoop"  Get-AzureHDInsightJobOutput -StandardOutput -JobId $startedHiveJobVT.JobId -Cluster "monclusterhadoop"    Remove-AzureHDInsightCluster -Name $clusterName

Here is a sample execution result:

PS C:\benjguin\BigData_Hadoop\demos\simple> Import-Module azure  Add-AzureAccount      PS C:\benjguin\BigData_Hadoop\demos\simple> Import-Module azure  Add-AzureAccount    $Subscription = 'Azdem169A44055X'  $defaultStorageAccount = 'monstockageazure'  $clusterName = 'monclusterhadoop'  $clusterVersion='2.1'  $clusterAdmin = 'cornac'  $clusterPassword = 'LElzgqy#n87'    $passwd = ConvertTo-SecureString $clusterPassword -AsPlainText -Force  $clusterCredentials = New-Object System.Management.Automation.PSCredential ($clusterAdmin, $passwd)    Set-AzureSubscription -SubscriptionName $Subscription -CurrentStorageAccount $defaultStorageAccount  Select-AzureSubscription -Current $Subscription    $storageAccount1 = (Get-AzureSubscription $Subscription).CurrentStorageAccountName  $key1 = Get-AzureStorageKey -StorageAccountName $storageAccount1 | %{ $_.Primary }    New-AzureHDInsightClusterConfig -ClusterSizeInNodes 3 |      Set-AzureHDInsightDefaultStorage -StorageAccountName "${storageAccount1}.blob.core.windows.net" -StorageAccountKey $key1 `          -StorageContainerName $clusterName |      New-AzureHDInsightCluster -Name $clusterName -Version $clusterVersion -Location "North Europe" -Credential $clusterCredentials        ClusterSizeInNodes    : 3  ConnectionUrl         : https://monclusterhadoop.azurehdinsight.net  CreateDate            : 03/03/2014 14:15:50  DefaultStorageAccount : monstockageazure.blob.core.windows.net  HttpUserName          : cornac  Location              : North Europe  Name                  : monclusterhadoop  State                 : Running  StorageAccounts       : {}  SubscriptionId        : 0fa85b4c-aa27-44ba-84e5-fa51aac32734  UserName              : cornac  Version               : 2.1.4.0.526800  VersionStatus         : Compatible    PS C:\benjguin\BigData_Hadoop\demos\simple> Use-AzureHDInsightCluster "monclusterhadoop"    $hiveJobVT = New-AzureHDInsightHiveJobDefinition -File "wasb://messcripts@monstockageazure.blob.core.windows.net/simple_sample.hql"  $hiveJobVT.Files.Add("wasb://messcripts@monstockageazure.blob.core.windows.net/simple_sample.py")  $startedHiveJobVT = $hiveJobVT | Start-AzureHDInsightJob -Credential $clusterCredentials -Cluster "monclusterhadoop"    $startedHiveJobVT | Wait-AzureHDInsightJob -Credential $clusterCredentials    Get-AzureHDInsightJobOutput -StandardError -JobId $startedHiveJobVT.JobId -Cluster "monclusterhadoop"  Get-AzureHDInsightJobOutput -StandardOutput -JobId $startedHiveJobVT.JobId -Cluster "monclusterhadoop"  Successfully connected to cluster monclusterhadoop      Cluster         : monclusterhadoop  ExitCode        : 0  Name            : Hive: simple_sample.hql  PercentComplete : map = 100%,  reduce = 100%  Query           :   State           : Completed  StatusDirectory : b4328d2f-589c-412e-83e5-f8a544cb321c  SubmissionTime  : 03/03/2014 14:36:48  JobId           : job_201403031426_0003      Logging initialized using configuration in file:/C:/apps/dist/hive-0.11.0.1.3.5.0-03/conf/hive-log4j.properties  Added resource: simple_sample.py  Total MapReduce jobs = 1  Launching Job 1 out of 1  Number of reduce tasks determined at compile time: 1  In order to change the average load for a reducer (in bytes):    set hive.exec.reducers.bytes.per.reducer=<number>  In order to limit the maximum number of reducers:    set hive.exec.reducers.max=<number>  In order to set a constant number of reducers:    set mapred.reduce.tasks=<number>  Starting Job = job_201403031426_0004, Tracking URL = http://jobtrackerhost:50030/jobdetails.jsp?jobid=job_201403031426_0004  Kill Command = "C:\apps\dist\hadoop-1.2.0.1.3.5.0-03\bin\hadoop.cmd" job  -kill job_201403031426_0004  Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1  2014-03-03 14:37:20,821 Stage-1 map = 0%,  reduce = 0%  2014-03-03 14:37:25,883 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:26,915 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:27,946 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:28,962 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:29,977 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:30,993 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:32,008 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:33,024 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.469 sec  2014-03-03 14:37:34,024 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 5.469 sec  2014-03-03 14:37:35,040 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 5.469 sec  2014-03-03 14:37:36,055 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.265 sec  2014-03-03 14:37:37,055 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.265 sec  2014-03-03 14:37:38,055 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 9.265 sec  MapReduce Total cumulative CPU time: 9 seconds 265 msec  Ended Job = job_201403031426_0004  MapReduce Jobs Launched:   Job 0: Map: 1  Reduce: 1   Cumulative CPU: 9.265 sec   HDFS Read: 266 HDFS Write: 2684 SUCCESS  Total MapReduce CPU Time Spent: 9 seconds 265 msec  OK  Time taken: 36.86 seconds, Fetched: 50 row(s)    100004    Motorola Droid X    02a4198bedd37119dabcbb2e8fb4ec92  100015    Apple iPod Touch 4.3.x    d9bc8c98d6a6556656e774a64f7b8bb2  100015    Apple iPod Touch 4.3.x    d9bc8c98d6a6556656e774a64f7b8bb2  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100035    LG VS910    b4bfdffa3e288ed0283ae8c8a37c455e  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100036    Samsung SCH-i400    6b314786cda6123fc06eeb855825aea7  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100041    RIM 9650    d476f3687700442549a83fac4560c51c  100042    Apple iPhone 4.2.x    375ad9a0ddc4351536804f1d5d0ea9b9  100042    Apple iPhone 4.2.x    375ad9a0ddc4351536804f1d5d0ea9b9  100042    Apple iPhone 4.2.x    375ad9a0ddc4351536804f1d5d0ea9b9    Remove-AzureHDInsightCluster -Name $clusterName

Smile

Benjamin (@benjguin)

Blog Post by: Benjamin GUINEBERTIERE

Blog Post: A simple example: how to call Python from Hive in HDInsight

Introduction

Hive and Python Script

Let’s run it with PowerShell

Trending Articles

Moondru Mudichu 07-06-2016 – Polimer tv Serial

Tavalequ

SUPREME COURT RULES AGAINST O’NEILL GOVERNMENT

Main Rahoon ya Na Rahun Lyrics Translation | Bas Itna Hai Tumse Kehna

Is DongFang Bubai better than the Greats in Condor heroes trilogy?

22-06-2015 – Moondru Mudichu Serial

Karimnagar District Police Office Mobile Numbers List in Telangana State

Passage Narration for JSC Examination 2017

Júnior Porciúncula W-10 KONTAKT

hi bro file toyota 89663-60090

Black Angus Grilled Artichokes

SPYAIR – RAGE OF DUST [Mora FLAC 24bit/96kHz]

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Ummet Ozcan – Ocean’s Voice – Single [iTunes Plus M4A]

My Sisters Plan For Me To Smell Her Feet (Fiction): Part 1,2,3 and 4!!!

Download: Enalia – Malumbo

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

DeDRM Tools 6.8.1 Released

Stock globe youwin m022(led),m022t firmwares