Get Files in ZIP file stored on Azure without downloading it

Recently, I was working on a task where we had to get file entries and names off ZIP files stored on Azure. We had terabytes of data to go through and downloading them was not really an option. In the end of the day, we solved this in a totally different way, but I remained curious if this is possible, and it sure is.

The aim is to get all the entry names of ZIP files stored on an Azure Storage Account. Unfortunately, using our beloved HttpClient isn’t possible (or at least, I didn’t research enough). The reason is that although HttpClient does allow us to access an HttpRequest as a Stream, the Stream itself isn’t seekable (CanSeek: false).

This is why we need to use the Azure.Storage.Blobs API – this allows us to get a Seekable Stream against a File stored in Azure Storage Account. What this means is that we can download specific parts of the ZIP file where the names are stored, rather than the data itself. Here is a detailed diagram on how ZIP files are stored, though this is not needed as the libraries will handle all the heavy lifting for us – The structure of a PKZip file (jmu.edu)

We will also be using the out-of-the-box ZipArchive library. This will allow us to open a Zip File from a Stream. This library is also smart enough to know that if a stream is Seekable, it will seek to the part where the File Names are being stored rather than downloading the whole file.

Therefore, all we need is to open a stream to the ZIP using the Azure.Storage.Blobs, pass it to the ZipArchive library and read the entries out of it. This process ends up essentially almost instant, even for large ZIP files.

using Azure.Storage;
using Azure.Storage.Blobs;
using System;
using System.IO.Compression;
using System.Linq;
using System.Threading.Tasks;
namespace GetZipFileNamesFromAzureZip
{
class Program
{
private const string StorageAccountName = "xxxxxx";
private const string StorageAccountKey = "xxxxxxxxxxxxxxx";
private const string ContainerName = "xxxxxxxxxx";
private const string FileName = "file.zip";
private const string Url = "https://" + StorageAccountName + ".blob.core.windows.net";
static async Task Main(string[] args)
{
BlobServiceClient client = new BlobServiceClient(new Uri(Url), new StorageSharedKeyCredential(StorageAccountName, StorageAccountKey));
var container = client.GetBlobContainerClient(ContainerName);
var blobClient = container.GetBlobClient(FileName);
var stream = await blobClient.OpenReadAsync();
using ZipArchive package = new ZipArchive(stream, ZipArchiveMode.Read);
Console.WriteLine(string.Join(",", package.Entries.Select(x => x.FullName).ToArray()));
}
}
}

Until the next one!

Compressing files on an Azure Storage Account fast and efficiently.

Currently, I am working on a project that requires zipping and compressing files that exist on a storage account. Unfortunately, unless I am missing something, there is no out-of-the box way how to ZIP files on an Azure storage.

There are two major possibilities that I’ve found are:

  • Azure Data Factory – It’s a cloud based ETL storage solution. In my research, I found that this tool can cost quite a lot, since you’re paying for the rented machines and tasks. Data Factory – Data Integration Service | Microsoft Azure
  • Writing a bespoke solution – of course you’ve got the flexibility of doing whatever you want but it probably takes more time to develop, test and such.

Anyway, in my case I’ve decided to write my own application; there were other requirements that I needed to satisfy, which was it too complex for me to implement it in Azure Data Factory. I’ve written the following code (some code omitted for brevity)


CloudBlockBlob blob = targetStorageAccountContainer.GetBlockBlobReference("zipfile.zip");
blob.StreamWriteSizeInBytes = 104_857_600;      

using (Stream dataLakeZipFile = await blob.OpenWriteAsync())
using (var zipStream = new ZipOutputStream(dataLakeZipFile))
{
    DataLakeDirectoryClient sourceDirectoryClient = dataLakeClient.GetDirectoryClient(sourceDataLakeAccount);
    await foreach(var blobItem in sourceDirectoryClient.GetPathsAsync(recursive: true, cancellationToken: cancellationToken))   
    {
        zipStream.PutNextEntry(new ZipEntry(blobItem.Name));
        var httpResponseMessage = await _httpClient.GetAsync(GetFileToAddToZip(blobItem.Name), HttpCompletionOption.ResponseHeadersRead);
        using (Stream httpStream = await httpResponseMessage.Content.ReadAsStreamAsync())
        {
            await httpStream.CopyToAsync(zipStream);
        }

        zipStream.CloseEntry();
    }

    zipStream.Finish();
}  

The following code does this following:

  • Create a reference to the ZIP file that is going to be created on the Storage Account. I also set StreamWriteSizeInBytes to 100MB; the largest. I never experimented with other figures. This refers to how much data to write per block.
  • Open a Stream object against the zip file. This overwrites any file with the same name.
  • Get all the files you need to ZIP. In my case, I am using the DataLake API because our files are on a Storage Account with hierarchical namespaces activated. This will work just as fine if your Storage Account doesn’t use hierarchical namspaces (you can just swap out and use the CloudBlobContainer API).
  • Open a new connection to the destination file and fetch it as a stream.
  • Copy the data received from the stream to the zip stream. This translates into HTTP requests, uploading it back to the Storage Account.
  • Close down all resources when its done.

Importantly, the code downloads files from the storage account and instantly uploads it back to the storage account as a ZIP. This does not store any data on physical disk and uses RAM to buffer the data as its downloaded and uploaded.

Of course, this part is just an excerpt of the whole system needed, but it can be adapted accordingly.

Until the next one!

Security by Obscurity – in real life!

We were discussing security by obscurity in the office today – it’s always a topic that we end up having a laugh at. If you have no idea what I’m talking about, read about security by obscurity here.

That’s all fine and funny, until you witness it. Us Maltese just witnessed it, last weekend, with a twist. Instead of being in some poorly written software, this was in a shop. Basically, a local Jewellery shop was robbed by professionals and they removed / deleted all security footage in the process!

You might say that this is not IT related – but I’m afraid that it’s very relevant. This got me thinking – how did they get access to the security footage? Was it there, exposed, just waiting for some person to meddle and delete with the footage? It seemed that these people thought so. Although I don’t have much details on how this was done, I would assume that these shops don’t have another site where these footage are kept just in case accidents like these happen.

So, what do I propose? Simple – it’s a bit illogical to keep the security footage at the same site where it’s being recorded. Ideally, this footage would be (instantly) moved to some off-site storage, making use of the cloud. Is there any provider doing this? A quick Google Search says yes: I’ve found examples such as CamCloud. Of course, I have no idea what the company offers since I’m not affiliated with it.

Given that today’s world is moving to the cloud, I can’t help but wonder if incidents like these can be mitigated by using such cloud services.