Compressing files on an Azure Storage Account fast and efficiently.

Currently, I am working on a project that requires zipping and compressing files that exist on a storage account. Unfortunately, unless I am missing something, there is no out-of-the box way how to ZIP files on an Azure storage.

There are two major possibilities that I’ve found are:

  • Azure Data Factory – It’s a cloud based ETL storage solution. In my research, I found that this tool can cost quite a lot, since you’re paying for the rented machines and tasks. Data Factory – Data Integration Service | Microsoft Azure
  • Writing a bespoke solution – of course you’ve got the flexibility of doing whatever you want but it probably takes more time to develop, test and such.

Anyway, in my case I’ve decided to write my own application; there were other requirements that I needed to satisfy, which was it too complex for me to implement it in Azure Data Factory. I’ve written the following code (some code omitted for brevity)


CloudBlockBlob blob = targetStorageAccountContainer.GetBlockBlobReference("zipfile.zip");
blob.StreamWriteSizeInBytes = 104_857_600;      

using (Stream dataLakeZipFile = await blob.OpenWriteAsync())
using (var zipStream = new ZipOutputStream(dataLakeZipFile))
{
    DataLakeDirectoryClient sourceDirectoryClient = dataLakeClient.GetDirectoryClient(sourceDataLakeAccount);
    await foreach(var blobItem in sourceDirectoryClient.GetPathsAsync(recursive: true, cancellationToken: cancellationToken))   
    {
        zipStream.PutNextEntry(new ZipEntry(blobItem.Name));
        var httpResponseMessage = await _httpClient.GetAsync(GetFileToAddToZip(blobItem.Name), HttpCompletionOption.ResponseHeadersRead);
        using (Stream httpStream = await httpResponseMessage.Content.ReadAsStreamAsync())
        {
            await httpStream.CopyToAsync(zipStream);
        }

        zipStream.CloseEntry();
    }

    zipStream.Finish();
}  

The following code does this following:

  • Create a reference to the ZIP file that is going to be created on the Storage Account. I also set StreamWriteSizeInBytes to 100MB; the largest. I never experimented with other figures. This refers to how much data to write per block.
  • Open a Stream object against the zip file. This overwrites any file with the same name.
  • Get all the files you need to ZIP. In my case, I am using the DataLake API because our files are on a Storage Account with hierarchical namespaces activated. This will work just as fine if your Storage Account doesn’t use hierarchical namspaces (you can just swap out and use the CloudBlobContainer API).
  • Open a new connection to the destination file and fetch it as a stream.
  • Copy the data received from the stream to the zip stream. This translates into HTTP requests, uploading it back to the Storage Account.
  • Close down all resources when its done.

Importantly, the code downloads files from the storage account and instantly uploads it back to the storage account as a ZIP. This does not store any data on physical disk and uses RAM to buffer the data as its downloaded and uploaded.

Of course, this part is just an excerpt of the whole system needed, but it can be adapted accordingly.

Until the next one!

Security by Obscurity – in real life!

We were discussing security by obscurity in the office today – it’s always a topic that we end up having a laugh at. If you have no idea what I’m talking about, read about security by obscurity here.

That’s all fine and funny, until you witness it. Us Maltese just witnessed it, last weekend, with a twist. Instead of being in some poorly written software, this was in a shop. Basically, a local Jewellery shop was robbed by professionals and they removed / deleted all security footage in the process!

You might say that this is not IT related – but I’m afraid that it’s very relevant. This got me thinking – how did they get access to the security footage? Was it there, exposed, just waiting for some person to meddle and delete with the footage? It seemed that these people thought so. Although I don’t have much details on how this was done, I would assume that these shops don’t have another site where these footage are kept just in case accidents like these happen.

So, what do I propose? Simple – it’s a bit illogical to keep the security footage at the same site where it’s being recorded. Ideally, this footage would be (instantly) moved to some off-site storage, making use of the cloud. Is there any provider doing this? A quick Google Search says yes: I’ve found examples such as CamCloud. Of course, I have no idea what the company offers since I’m not affiliated with it.

Given that today’s world is moving to the cloud, I can’t help but wonder if incidents like these can be mitigated by using such cloud services.