Azure Batch Apps Task Depencies and specifying Intermediate output file as Required Files

Azure Batch Service enables developers to provision compute resources on demand at very high scale and to process lot of compute work efficiently. The Batch makes it easy for developers to use the cloud and take advantage of scale and reliability without needing to learn about managing multiple instances, fault domain, error handling and other concepts that batch service handles. There are two developer scenario for using Azure Batch Service.

If you would like to learn more about how to get started with Azure Batch Apps please read tutorials from this article (http://azure.microsoft.com/en-us/documentation/articles/batch-dotnet-get-started/).

Azure Batch Apps is a feature of Azure Batch that provides application-centric way of managing and executing Batch workloads and this article applies to Azure Batch Apps only. It includes a management Portal where you can manage jobs, view logs and download outputs without having to write your own client code.

Job splitter will split job in subtask. Using Azure Batch Apps API some developer might be interested to use Azure Batch single job to run sequence of consecutive tasks where some phases are embarrassingly parallelizable and some are not. These tasks have several interdependencies, where some tasks can be run in parallel but others must wait for the previous tasks to provide their outputs. They still would like to run the whole pipeline as a single Azure Batch Apps job for simplicity and leverage the Azure Batch Management portal to submitting jobs and downloading outputs.

As of today there is no way in the API to specify intermediate task output files (which doesn’t exist during the job submission) as one of the required files. I tried few ways to make this work within the API, but it didn’t work. RequiredFiles takes IFileSpecifier instead of just a string (e.g. “intermediate.jpg”) and there is no implementation available for IFileSpecifier. I implemented the interface on my own specifying file name as “intermediate.jpg”.

     public class MyFileSpecifier : IFileSpecifier
     {
       public string Name { get; set; }
       public DateTime Timestamp { get; set; }
       public string OriginalPath { get; set; }
       public string Hash { get; set; }
     } 

But this didn’t work (An exception occurred processing task 2: One or more errors occurred. The remote server returned an error: (404) Not Found).

The batch apps runtime doesn’t seem to download intermediate file from the job’s container and hence ‘Task.RequiredFiles’ doesn’t contain outputs from previous completed tasks, only the original input file is available. It make sense for batch apps runtime not to download all the intermediate output files because you don’t always need these files in the Embarrassingly Parallel compute. Besides it could slow down the task execution time significantly and could be waste of disk space downloading these intermediate files unnecessarily.

As the runtime doesn’t seem to download intermediate files, the batch application cloud assembly can download only the required files from the job’s container to the target virtual machines.Job’s container is located inside the azure storage account associated with batch service. There are several ways to achieve this and some of these includes the following

Using Azure Batch Apps REST API (https://msdn.microsoft.com/en-us/library/azure/dn820126.aspx ), but this requires OAuth 2.0 authentication of your application with Azure Active Directory (Requests must be authenticated using an OAuth 2 bearer token issued by Azure Active Directory ) which involves several more steps in Azure portal and bit more coding required to make that works. I would write another blog post showing this in future.

Using Azcopy or Azure Storage library (Microsoft.WindowsAzure.Storage nuget) is the simplest approach to get this working along with Azure Batch Apps .NET API in the cloud assembly.

Using Azure Storage Library to get required Files in TVM

A storage account is automatically created when Azure batch App Service is created. This is created in the same region where Batch app service is created and the storage name starts with ‘batchapps’. We need to get the Azure Storage account Name and access key in our cloud assembly in order to get those file downloaded in the task virtual machines.

We need to add the code to ParallelTaskProcessor implementation of our code.

     protected override TaskProcessResult RunExternalTaskProcess(ITask task, TaskExecutionSettings settings)
     {
         string blobName = "resliced_brain.nii";
         string blobContainerName = "job-" + task.JobId.ToString();
         string targetFolder = LocalStoragePath;
         string strFileDownloaded = string.Empty;
         string storageAccountName = "[YOUR BATCH STORAGE ACCOUNT NAME]";
         string storageAccountKey = "[YOUR BATCH STORAGE ACCOUNT KEY]";
         string connectionString = string.Format(@"DefaultEndpointsProtocol=https;AccountName={0};AccountKey={1}",
            storageAccountName, storageAccountKey);

         //get a reference to the container where you want to put the files
         CloudStorageAccount cloudStorageAccount = CloudStorageAccount.Parse(connectionString);
         CloudBlobClient cloudBlobClient = cloudStorageAccount.CreateCloudBlobClient();
         CloudBlobContainer cloudBlobContainer =    cloudBlobClient.GetContainerReference(blobContainerName);
         CloudBlockBlob blobSource = cloudBlobContainer.GetBlockBlobReference(blobName);

        if (blobSource.Exists())
       {
           //blob storage uses forward slashes, windows uses backward slashes; do a replace
           // so localPath will be right
           string localDestination = Path.Combine(targetFolder, blobSource.Name.Replace(@"/", @"\"));
           //if the directory path matching the "folders" in the blob name don't exist, create them
           string dirPath = Path.GetDirectoryName(localDestination);
           if (!Directory.Exists(localDestination))
           {
               Directory.CreateDirectory(dirPath);
           }
           blobSource.DownloadToFile(localDestination, FileMode.Create);
           strFileDownloaded = localDestination;
      }
      string strInputFileName = "resliced_brain.nii";
      string blobContainerName = "job-" + task.JobId.ToString();
      string inputFile = DownloadFile(blobContainerName, strInputFileName, LocalStoragePath);

      // code to specify this input File as a parameter to run the external processs task
  }

Source code for the cloud assembly demonstrating this could be found at ( https://github.com/spsarkar/AzureBatchDependenciesStorage )

Sign up for the preview here ( https://account.windowsazure.com/PreviewFeatures ).

Learn about Batch ( http://azure.microsoft.com/en-us/documentation/services/batch/)