Why a three node Zookeeper cluster is probably not a great idea

Just to get this out of the way, you might find conflicting information on the internet saying that now, Kafka doesn’t require a Zookeeper installation, referred to as KRaft. As of time of writing, KRaft was just marked production ready just a month ago. At this point I still think that it might be a bit too early to adopt, especially given that tools to support KRaft may be immature. Confluent still marks KRaft as early access. So, let’s focus on a Kafka with Zookeeper.

What is Zookeeper? It is a tool which end products may use; on its own it doesn’t really do much. If your product requires distributed coordination while being highly reliable, then Zookeeper is a great product. In fact, therefore Kafka uses it. Kafka uses Zookeeper to hold multiple data structures, such as:

  • The alive brokers
  • Topic Configuration
  • In-Sync Replicas
  • Topic Assignments
  • Quotas

Truth be told, you should not be super concerned on how they work in the background; this is just information that Kafka stores to remain functional. The main takeaway is that these are all decisions that need to be co-ordinated across a cluster, which Zookeeper does on behalf of Kafka.

So, let’s dive into why I’m writing the article in the first place. Production-worthy Zookeeper should not be a single node; it should be a cluster of nodes, referred to as an Ensemble. For an Ensemble to be healthy, it requires the majority (over 50%) of the servers to be online and communicating with each other. The majority is also referred to as a Quorum. Refer to this guide for more information – https://zookeeper.apache.org/doc/r3.4.10/zookeeperAdmin.html

Online, you’ll find this magic formula – 2xF + 1. The F in this case refers to how many servers you tolerate to have offline to remain functional.

If, for example, you don’t tolerate any servers to be offline to remain functional, 2×0 + 1 = 1 – you only need one server. But this means you can never ever afford a failure, which isn’t good. So, we’re not going to even consider that.

3 Server Ensemble

Let’s try it with tolerating one offline server – 2×1 + 1 = 3. Therefore, the Ensemble size should be 3 servers, to allow one failure. In fact, this is what you’ll find online – a minimum of three servers. At face value, it might look OK, but I believe it’s just a ticking timebomb.

When everything is healthy, the ensemble will look like the below.

When a node goes down, the ensemble will look like the below. It will still function, but cannot tolerate any more losses.

If a network split happens, or another node goes offline, a Quorum cannot be formed, hence it goes offline and intervention is needed.

You might say, tolerating one failure sounds GREAT. But, I’d say is far more than enough. Let’s face it, servers will need to be patched and this happens often. What happens if patching fails (or takes an awfully long time), and other server happens to suffer from some failure? You may say that this is highly unlikely to happen, but recently I just witnessed a server that took almost a day to be patched, fixed and brought back online. If, during that time, a blip happened in the network, the two servers would disconnect and would never manage to come back online, because of the Split Brain problem – https://en.wikipedia.org/wiki/Split-brain_(computing)

In my opinion, there will be times where you will operate with one node down, whether it would be patching or the bare metal suffers some problem, therefore 3 nodes aren’t enough.

5 Server Ensemble

Here’s what I think is a TRUE minimum – allowing 2 nodes to fail before going offline. To allow for two nodes, using the previous formula: 2×2 + 1 = 5. Therefore, the Ensemble size should be 5 servers, to allow for two failures.

When everything is healthy, the ensemble will look like the below.

When a node goes down, the ensemble will look like the below. It will still function and we can tolerate another loss.

When another node goes down, the ensemble will look like the below. It will still function, but cannot tolerate any more losses.

With 3 nodes down, less than 50% of the nodes are available, hence a Quorum cannot be formed.

You can also get taken offline with an offline node and a split network.

With a 5 server ensemble, if you are doing maintenance on a particular node, you still afford to lose another node before the ensemble is taken offline. Of course, this still suffers from the split brain problem, where if your 5 server ensemble suffers an offline node and gets a network split, directly in the middle. You’d end up with 2 nodes on each split and unable to serve any requests. Great care will need to be taken on which part of a data centre the servers are provisioned on.

This scaling can continue going upwards e.g. 7 node cluster. Of course at that point, you’ll need to factor in the higher latency that it brings (because more nodes will need to acknowledge messages) and the overall maintenance required. I’ve also omitted discussing the correct placement of availability zones for each server node. Great care must also be taken when doing so.

Until the next one!

Get Files in ZIP file stored on Azure without downloading it

Recently, I was working on a task where we had to get file entries and names off ZIP files stored on Azure. We had terabytes of data to go through and downloading them was not really an option. In the end of the day, we solved this in a totally different way, but I remained curious if this is possible, and it sure is.

The aim is to get all the entry names of ZIP files stored on an Azure Storage Account. Unfortunately, using our beloved HttpClient isn’t possible (or at least, I didn’t research enough). The reason is that although HttpClient does allow us to access an HttpRequest as a Stream, the Stream itself isn’t seekable (CanSeek: false).

This is why we need to use the Azure.Storage.Blobs API – this allows us to get a Seekable Stream against a File stored in Azure Storage Account. What this means is that we can download specific parts of the ZIP file where the names are stored, rather than the data itself. Here is a detailed diagram on how ZIP files are stored, though this is not needed as the libraries will handle all the heavy lifting for us – The structure of a PKZip file (jmu.edu)

We will also be using the out-of-the-box ZipArchive library. This will allow us to open a Zip File from a Stream. This library is also smart enough to know that if a stream is Seekable, it will seek to the part where the File Names are being stored rather than downloading the whole file.

Therefore, all we need is to open a stream to the ZIP using the Azure.Storage.Blobs, pass it to the ZipArchive library and read the entries out of it. This process ends up essentially almost instant, even for large ZIP files.

using Azure.Storage;
using Azure.Storage.Blobs;
using System;
using System.IO.Compression;
using System.Linq;
using System.Threading.Tasks;
namespace GetZipFileNamesFromAzureZip
{
class Program
{
private const string StorageAccountName = "xxxxxx";
private const string StorageAccountKey = "xxxxxxxxxxxxxxx";
private const string ContainerName = "xxxxxxxxxx";
private const string FileName = "file.zip";
private const string Url = "https://" + StorageAccountName + ".blob.core.windows.net";
static async Task Main(string[] args)
{
BlobServiceClient client = new BlobServiceClient(new Uri(Url), new StorageSharedKeyCredential(StorageAccountName, StorageAccountKey));
var container = client.GetBlobContainerClient(ContainerName);
var blobClient = container.GetBlobClient(FileName);
var stream = await blobClient.OpenReadAsync();
using ZipArchive package = new ZipArchive(stream, ZipArchiveMode.Read);
Console.WriteLine(string.Join(",", package.Entries.Select(x => x.FullName).ToArray()));
}
}
}

Until the next one!

Compressing files on an Azure Storage Account fast and efficiently.

Currently, I am working on a project that requires zipping and compressing files that exist on a storage account. Unfortunately, unless I am missing something, there is no out-of-the box way how to ZIP files on an Azure storage.

There are two major possibilities that I’ve found are:

  • Azure Data Factory – It’s a cloud based ETL storage solution. In my research, I found that this tool can cost quite a lot, since you’re paying for the rented machines and tasks. Data Factory – Data Integration Service | Microsoft Azure
  • Writing a bespoke solution – of course you’ve got the flexibility of doing whatever you want but it probably takes more time to develop, test and such.

Anyway, in my case I’ve decided to write my own application; there were other requirements that I needed to satisfy, which was it too complex for me to implement it in Azure Data Factory. I’ve written the following code (some code omitted for brevity)


CloudBlockBlob blob = targetStorageAccountContainer.GetBlockBlobReference("zipfile.zip");
blob.StreamWriteSizeInBytes = 104_857_600;      

using (Stream dataLakeZipFile = await blob.OpenWriteAsync())
using (var zipStream = new ZipOutputStream(dataLakeZipFile))
{
    DataLakeDirectoryClient sourceDirectoryClient = dataLakeClient.GetDirectoryClient(sourceDataLakeAccount);
    await foreach(var blobItem in sourceDirectoryClient.GetPathsAsync(recursive: true, cancellationToken: cancellationToken))   
    {
        zipStream.PutNextEntry(new ZipEntry(blobItem.Name));
        var httpResponseMessage = await _httpClient.GetAsync(GetFileToAddToZip(blobItem.Name), HttpCompletionOption.ResponseHeadersRead);
        using (Stream httpStream = await httpResponseMessage.Content.ReadAsStreamAsync())
        {
            await httpStream.CopyToAsync(zipStream);
        }

        zipStream.CloseEntry();
    }

    zipStream.Finish();
}  

The following code does this following:

  • Create a reference to the ZIP file that is going to be created on the Storage Account. I also set StreamWriteSizeInBytes to 100MB; the largest. I never experimented with other figures. This refers to how much data to write per block.
  • Open a Stream object against the zip file. This overwrites any file with the same name.
  • Get all the files you need to ZIP. In my case, I am using the DataLake API because our files are on a Storage Account with hierarchical namespaces activated. This will work just as fine if your Storage Account doesn’t use hierarchical namspaces (you can just swap out and use the CloudBlobContainer API).
  • Open a new connection to the destination file and fetch it as a stream.
  • Copy the data received from the stream to the zip stream. This translates into HTTP requests, uploading it back to the Storage Account.
  • Close down all resources when its done.

Importantly, the code downloads files from the storage account and instantly uploads it back to the storage account as a ZIP. This does not store any data on physical disk and uses RAM to buffer the data as its downloaded and uploaded.

Of course, this part is just an excerpt of the whole system needed, but it can be adapted accordingly.

Until the next one!

Overclocking your Zen 3 / Ryzen 5000 with Precision Boost Overdrive 2 and Curve Optimizer

Ever since I have written my experience using Precision Boost Overdrive 2 and Curve optimizer in my last blog post, I have been asked several questions on how to overclock your Ryzen 5000 CPU. Let’s discuss the basics for overclocking on Ryzen 5000.

Please treat this guide as a beginner starting guide – you’ll need to spend a lot of time tweaking, especially on the curve optimizer. This is not an ultimate overclocking guide and some people might (and already did) not agree with the values and flow of this guide. Having said that, even if other approaches may be better, they will be slightly better, maybe 1-3% better, within margin of error. Following this guide WILL net you a performance gain; maybe not the BEST performance gain but a measurable one.

The following guide should work for the following CPUs:

  • Ryzen 9 5950x
  • Ryzen 9 5900x
  • Ryzen 7 5800x
  • Ryzen 5 5600x

The following should similarly work for Ryzen 3000 series, but you will not have access to the Curve Optimizer. Blame AMD for this.

Ryzen 5000 – Traditional overclocking is dead

Traditional overclocking involved going into the BIOS, typing a nice voltage and a reasonable clock speed and you are done. You can do it, and you will get a nice score in Cinebench, but you’ll lose everyday performance. Why?

By turning on a fixed max boost clock, you will be losing the higher boost clocks achieved when doing lightly threaded workloads (unless you manage to overclock to a fixed 5Ghz..if you do that, please write a guide for us!). These are the kind of workloads that you go through every day, which are the most important.  If you set a max overclock of say 4.6 GHz, you won’t be able to go over 4.6 GHz in common tasks, which will slow them down.

Ryzen’s boost algorithm is smart

On the other hand, Ryzen’s boost algorithm is designed to go past the usual clocks and boost as much as possible, given there is enough power coming in and the temperatures are in check. Trust the AMD engineers in this case. In my case, my 5900x is easily able to go past 5GHz.

The golden trio – PBO2, Power Settings and Curve Optimizer

In order to achieve an actual “overclock” on Ryzen 5000, we’ll need to dive into three major components – PBO2, Curve Optimizer and Power Settings

Precision Boost Overdrive 2

Precision Boost Overdrive (PBO for short) is when you extend the out of the box parameters that dictate performance on a Ryzen CPU – Temperature, SoC (chip) power and VRM Current (power delivery). PBO extends the maximum threshold for these components, allowing faster clock speeds to be achieved for a longer time. In short, this is AMD’s inbuilt overclocking capabilities baked into your CPU.

PBO Triangle – via https://hwcooling.net

Power Settings

Here, I’m referring to the three major power settings – PPT, TDC and EDC. PPT is the total power that the CPU can intake. TDC is the amount of amperage the CPU is fed, under sustained load (thermally and electrically limited). EDC is the amount of amperage the CPU is fed, under short bursts (electrically limited). Allowing the CPU to take more power overall allows the CPU to boost to higher clock speeds. From the PBO triangle analogy, this positively impacts the left and right vertices – SoC power and VRM Current, while negatively impacting the top vertex – heat.

Curve Optimizer

Curve optimizer allows you to undervolt your CPU. Undervolting means that you’re pushing slightly less voltage, which consumes less power and generates less heat. This, combined with Precision Boost Overdrive 2 means that you’re pushing less heat, allowing the CPU to boost clock speeds. From the PBO triangle analogy, this mostly impacts the top vertex – heat.

Striking a balance with your settings and overclocking your Ryzen 5000

Now, that we’ve established our three main players, let’s tackle them one by one. To access these settings, you’ll need to access your BIOS – these settings are typically located in Advanced -> AMD Overclocking -> Precision Boost Overdive. Here’s a sample from my ASRock x570 Steel Legend.

After discussions with my readers, people seem to be suggesting different priorities when it comes to overclocking. I believe that a modest yet stable overclock can be achieved by prioritizing these:

  1. Scalar / Max CPU Override
  2. Power Settings
  3. Curve Optimizer

Some readers believe that the best priority is:

  1. Curve Optimizer
  2. Power Settings
  3. Scalar / Max CPU Override

If you are confused like me, pick the easiest and consider following this guide. Both will provide a nice performance gain and the differences you might see from one method to another may be in 1-2% more gain, which is negligible in real life.

Precision Boost Overdrive 2

This should be the easiest, let us just follow AMD’s recommendations. Looking at their slides here –  AMD Precision Boost Overdrive 2 : Official Tech Briefing! – YouTube) we can start by looking at the setting that matter to turning on PBO.

  • Precision Boost Overdrive – Advanced
    • Allows us to turn on PBO and allows us to make manual adjustments to PBO settings
  • PBO Scalar – 10X
    • Should allow you sustain boost clocks for longer.
    • Some readers debate whether this value should actually be 1x; I cannot verify this. These readers debate that setting it to 10x will raise your overall voltage. During my brief testing, I’ve observed that this is not the case, but this statement can (and might change) with more testing
  • Max CPU Boost Clock Override – 200Mhz
    • Raises your max frequency by 200Mhz. On a 5900x, this translates to a theoretical limit of 5150Mhz, which is realistic.
    • I am told by my readers that setting a +200 boost on the Max CPU Boost Clock Override might negatively impact how much you’ll end up pushing on the Curve Optimizer. Unfortunately I’ve not neither the time or data to back up this fact.
    • PURE SPECULATION / MY THOUGHTS AHEAD (No data to back up this claim whatsoever) -By reducing the Max CPU Boost Clock Override, you’ll of course be losing the highest single core boost clock speeds, **potentially** reducing single core performance, but you’ll be able to push more multi core score, or reaching the “lower” max single core performance more regularly. These will require extensive testing separately (and probably translate into margin of error when it comes to results).

Power Settings

In their slides (link above), AMD suggest using Power Limits = Motherboard. I strongly discourage this as it may limit your power intake (this was noticed both by me and readers in my blog – My Experience with Precision Boost Overdrive 2 on a 5900X – Albert Herd, comment by Julien Galland).

For my 5900X, these are the settings that I’ve applied. If you got a 5950X, 5900x 5800x, these values may (or may not) be suitable for you. If you got a 5800X or lower, these values are too high and will hinder performance. Applying lower settings to accommodate your CPU – apply a decent bump to the values quoted below by AMD. Unfortunately, I don’t own anything else apart from a 5900X so I cannot vouch for these settings for other models.

  • If you got very good cooling (such a custom loop or strong cooling in general)
    • PPT – 185W
    • TDC – 125A
    • EDC – 170A
  • If your cooler will get too hot with these settings, try a more conservative setting. In my case, this setting hovers around 70-75C
    • PPT – 165W
    • TDC – 120A
    • EDC – 150A

You might notice that your CPU might run too “cool” or too hot. In this case, adjust your figures accordingly. In a multi core benchmark, these figures should all hit a 100%. In most workloads, its the EDC that plays a role, not TDC (since most workloads are considered as short burst). I also noticed that going too low on EDC will cause instability.

Leave SOC TDC and SOC EDC to 0, these should not impact us (I believe this mostly applies for APUs).

For completeness sake, please keep in mind AMD’s default values when making adjustments to these values:

  • Package Power Tracking (PPT): 142W 5950x, 5900x and 5800x and 88W for 5600x.
  • Thermal Design Current (TDC): 95A 5950x, 5900x and 5800x and 60A for 5600x.
  • Electrical Design Current (EDC): 140A 5950x, 5900x and 5800x and 90A for 5600x.

Curve optimizer

This is probably the most annoying one. The numbers you’re inputting here will vary significantly from one chip to another, so your mileage may vary. These are my values:

  • Negative 11 for the first preferred cores on CCX 0 (as indicated by Ryzen Master)
  • Negative 15 for the second preferred core on CCX 0 (as indicated by Ryzen Master)
  • Negative 17 for the other cores.

If you want to start safe, you can apply a Negative 10 offset on all cores.

Testing this setting is extremely painful. You’ll notice that crashes will not happen under load; crashes will happen under idle conditions, where your CPU undervolts too much. Hopefully, AMD will look at this algorithm in future BIOS updates and provide more stability. In my experience, Geekbench 5 – Cross-Platform Benchmark is a great tool to stress my CPU out, it tends to crash it when the settings are not right.

Please keep in mind the note that I’ve written about the Max CPU Boost Override (under the header – Precision Boost Overdrive 2). Some users note that they prefer to keep Max CPU Boost Override lower and push for a more aggressive curve.

In my next post, we will look at how to get the best performance from your RAM, by applying specific DRAM configurations according to the RAM sticks you own. If you feel adventurous and feel like you can do it on your own:

Thanks for reading!

My Experience with Precision Boost Overdrive 2 on a 5900X

Looking for the TL;DR? These are my everyday settings:

  • PPT – 185W, TDC – 125A, EDC – 170A. To run these power settings, you’ll need a beefy cooler. If the CPU gets too hot with these power settings, try PPT- 165W, TDC – 115A, EDC – 150A
  • Negative 11 for the first preferred cores on CCX 0 (as indicated by Ryzen Master)
  • Negative 15 for the second preferred core on CCX 0 (as indicated by Ryzen Master)
  • Negative 17 for the other cores.
  • These moved my multithreaded Cinebench R20 score from 8250 to around 8800-9000 (6-9% gain) and my single threaded Cinebench R20 score from 630 to 650 (3% gain).

__________________________________________________________________________________________________________________

Recently AMD announced a new algorithm for the Precision Boost Overdrive (PBO), aptly named Precision Boost Overdrive 2 (PBO2). You can read more here: AMD Ryzen™ Technology: Precision Boost 2 Performance Enhancement | AMD and here: AMD Introduces Precision Boost Overdrive 2, Boosts Single Thread Performance | Tom’s Hardware. This post is not intended to explain the technicalities of this feature, rather than how to take advantage of it.

To get started, you will need to navigate to the BIOS. Unfortunately, now you cannot use Ryzen Master to do this, but AMD claims that this will be part of Ryzen Master in their future releases. In the PBO section, you will need to adjust some settings.

Navigating to AMD Overclocking in the BIOS

My specs are as following:

  • AMD Ryzen 5900x
  • ASRock X570 Steel Legend
  • 32GB C17 Memory
  • 750w PSU
  • 240MM AIO from BeQuiet.

At first, naively, I’ve set the power limits (PPT, TDC and EDC) to 0, which means unlimited. This in turn has a negative effect. It will let the CPU get as much power as it can. This translates into unnecessary power consumption, which will limit the maximum clock speed achieved. I’d suggest sticking to values which will keep the CPU under (or close to 80C under full load).

In my case, the maximum power settings I manage to sustain are: PPT – 185W, TDC – 125A, EDC – 170A. The recommended values for your CPU will vary according to the silicon quality and the cooling provided. Cooling 185W is not an easy feat, you’ll need a good cooler (such as a good NH-D15 (noctua.at), some good AIO (I am using Pure Loop | 240mm silent essential Water coolers from be quiet!).

Setting the PPT, TDC and EDC in a well balanced value is extremely important, this will help you strike the balance between the power consumption needed by the CPU while maintaining realistic temperatures. If the CPU gets too hot with these power settings, try PPT- 165W, TDC – 115A, EDC – 150A

I have set the PBO scalar to manual and 10x. I will be honest I am not sure what impact this has, but it looks like a setting which needs tweaking. I’ve tried 1X and honestly I did not feel any difference. From what I can understand, this is the length of how much the CPU will remain pumping high voltage / clocks until it dials it down. In burst scenarios, this should not have any impact.

Max CPU Boost Clock Override should be set to 200MHZ. This allows for higher clock speeds on single threaded workloads. My 5900x can hit 5.15 GHz with this setting on a single core. 5.15 GHZ is not a one-off number. I regularly see this during light workloads

Navigating to the Curve Optimizer in BIOS

Now, for the most important part: The Curve Optimizer. For the best and second core for each CCD, I have set this to negative 10, and for the other cores I have set it to minus 15.

The next step is quite difficult to instruct, as it purely depends on your silicon quality. In my case, I found the following settings to work for me:

  • Negative 11 for the first preferred cores on CCX 0 (as indicated by Ryzen Master)
  • Negative 15 for the second preferred core on CCX 0 (as indicated by Ryzen Master)
  • Negative 17 for the other cores

It took quite a lot of testing to arrive to these figures. You can find the first and second preferred cores from Ryzen Master.

Per Clock adjustments in the Curve Optimizer

Firstly, I started with negative 20 on all cores. This resulted in awesome Cinebench R20 scores but poor stability. I have then went to negative 15 on all cores. This was not bad, but I was experiencing a crash every now and then, especially when the PC is running cold and is able to push more clocks. It would run all day, but on boot, pushing it will instantly result a crash. This tells me that the algorithm was trying to push for more clocks, but the undervolting was too aggressive.

I then went to negative 10 on all cores and it is fully stable. Finally, I pushed negative 15 for those cores which are not first or second. This remained stable, and eventually I started changes the values slightly everday. Sometimes I go too much and get a WHEA BSOD (especially when the PC is cool and under light workloads).

These moved my multithreaded Cinebench R20 score from 8250 to around 8800-9000 (6-9% gain) and my single threaded Cinebench R20 score from 630 to 650 (3% gain). These are small gains, but when they are coming at you with no cost, it’s good to take advantage of it. And yes, these do not really translate to any tangible performance uplift in everyday computing.

Preferred Cores (Star is 1st, dot is second)

The performance uplift is thanks to higher sustained clocks. With PBO turned off, I was sustaining around 4.1 GHz core clock and with PBO on, I am sustaining between 4.4-4.5 GHz in Cinebench R20.

Cinebench scores with PBO2
Full load under Cinebench R20

Simpler workloads (non AVX) will clock past 4.5 GHz. I suspect that Ryzen calms down the clocks by a bit during AVX workloads, but I cannot confirm this.

Full load under a synthetic load – Memtest 64

Please let me know your experience with PBO2 and whether you find this post useful. If you got better settings than mine, I appreciate the feedback! Of course, keep in mind that as AMD said, no processor is the same; some might need more voltage than others to remain stable. It also depends on the power delivery quality, the sustained temperatures, the quality of the thermal paste, the overall case temperature and a plethora of other things, as mentioned in the first link to AMD’s site.

Automating bank cheque analysis by using Microsoft Cognitive Services

Just looking for code? https://github.com/albertherd/ChequeAnalyser

Microsoft Cognitive Services is a rich set of AI services, such as Computer Vision, Speech Recognition, Decision making and NLP. The great thing about these tools is that you don’t really have to be an AI expert to make use of these tools, as these models come pre-trained and production ready. You’ll just feed it your information and let the framework work for you.

We’ll be looking at one area of Microsoft’s Cognitive Services – Computer vision. More specifically, we’ll be looking at the handwriting API – you’ll provide the handwriting and the system will provide you with the actual text. We’ve already worked with the Computer Vision API from Microsoft Cognitive services – we used this API to tag our photo album.

Let’s look at today’s scenario – we’re a fictitious bank which processes bank cheques. These cheques come hand-written from our clients, which contain instructions on how to transfer money from one account to the holder’s account.

A cheque typically has the following information:

  • Issue Date
  • Payee
  • Amount (in digits)
  • Amount (in words, for cross reference)
  • Payer’s account number
  • (Other information, which was omitted for this proof of concept

This is how our fictitious cheque looks like.

Fictitious Cheque
Ficticious cheque – credit: https://www.dreamstime.com/royalty-free-stock-image-blank-check-false-numbers-image7426056

This is how our fictitious cheque looks when we’re looking at the regions we’re interested in, represented in bounding boxes.

0_analysis_template
Cheque Template with bounding boxes representing areas of interest

Let’s consider these three handwritten cheques.

The attached application does the following analysis:

  • Import these cheques as images.
  • Send the images over to Microsoft Cognitive Services
  • Extract all the handwriting / text found in the image
  • Consider only those text which we’re interested in (as represented with bounding boxes previously)
  • Forward this extracted information to whatever system needed. In our case, we’re just printing them to screen.

The below is the resultant information derived from the sample cheques.

ChequeAnalyserResult
Results of Cheque Analysis

Most of the heavy lifting is done by the Microsoft Cognitive Services, making these AI tools available to the masses. Of course, with a bit more business logic, the information that can be extracted from these tools can be greatly improved, making them production ready.

As with the previous example, this example uses the TPL Dataflow library, which is an excellent tool for Actor-Based multithreaded applications.

If you want to try this yourself, you’ll need:

Until the next one!

How to avoid getting your credit card details stolen online (some practical ways)

Let’s immediately jump through the points, then some back-story.

  • Only submit credit card details to make a purchase on shops which are super famous, such as Amazon, eBay, official product merchants
  • Use 3rd party paying capabilities where possible, such as PayPal. With this method, the merchant never has access to your credit card details, only to the authorized funds.
  • Ideally use a debit card rather than a credit card online. In case your card gets stolen, the thief can only use the available funds, without ending up in debt.
  • Also, only transfer the money your’re going to use when making the purchase. The card should empty in general.
  • Use applications such as Revolut which have capabilities to enable / disable online transactions on demand.
  • Revolut also allows for disposable credit cards – the number changes every time you make a transaction. This means that even if your credit card is stolen, the card is now dead and worthless. You’ll need to pay a monthly fee, though.
  • Avoid saving credit card details on your browser. Although usually CVV is usually not saved, writing it down again won’t take long. This avoids the possibility of a virus sniffing down your credit card details if they are saved.

Why am I writing these? I’ve just stumbled on a Cyber Security Episode done by MITA / Maltese Police (great initiative, by the way). Although the content made sense, I felt that some practical points were missing (original video here). The premise of the video is to only buy from sites using HTTPS as it’s secure and you’ll know the seller. Some points on the premise:

  • Running HTTPS ONLY GUARANTEES that the transmission between you and the merchant is secure. It does NOT mean you truly know who’s responsible as a merchant. There are ways to “try” and fix this (through EV certificates) but EVs are probably dead as well.
  • What happens after your credit cards are securely transported is unknown. The following may occur:
    1. Merchant might be an outright scammer running a merchant site with an SSL certificate (which nowadays, can be obtained for free, https://letsencrypt.org/)
    2. Merchant might store your credit card details insecurely; he may end up getting hacked and credit card details will get stolen.

 

Store your MySQL Docker database info in a Docker Volume

If you spin a MySQL Docker container, you’ll notice that once the container is stopped, all the information is lost! In order not to lose any information from your MySQL Docker database, a volume will need to be attached to the container. Let’s do that!

Let’s create a new volume – this will be used to store all your database informaton

docker volume create mysql

Once the volume is created, it can be attached to a newly spun MySQL container

docker run --name mysql -e MYSQL_ROOT_PASSWORD=albert --mount source=mysql,target=/var/lib/mysql -d mysql

Any datatabses created on this MySQL instance is now preserved! Let’s test it out. Let’s connect to the container and create a new database:

docker exec -it mysql mysql -uroot -p

Supply your password (in this example, the password would be the value that we’ve supplied for the MYSQL_ROOT_PASSWORD – albert

Once you’re connected, let’s see the current databases. The default installation will have 4 default databases

SHOW DATABASES;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+

Let’s go ahead and create a new database!

CREATE DATABASE test;

If you get all the databases now, you should get the following

SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| sys |
| test |
+--------------------+

De-attach from MySQL – type exit. Let’s now remove the current container and re-create a new container; re-attach the previously created volume.

docker container rm mysql -f
docker run --name mysql -e MYSQL_ROOT_PASSWORD=albert --mount source=mysql,target=/var/lib/mysql -d mysql

Let’s re-attach and get the list of databases (don’t forget to supply the password):

docker exec -it mysql mysql -uroot -p
SHOW DATABASES;

The output should now read as follows – the database test still exists, even after deleting and re-creating the container

+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| performance_schema |
| sys |
| test |
+--------------------+

How to get your first Docker container up and running on Ubuntu in 2 minutes or less

Looking to install Docker and to play around a bit? If you follow the main guide on the Official Docker site, you’ll be surprised how many steps need to be taken to get Docker Installed.

Thankfully, if you keep scrolling, you’ll notice that there is a convenience script! That’s great! Let’s get Docker installed!

Make sure you got cURL installed:

sudo apt-get install curl

Then let’s use the convenience script to install Docker!

sudo curl -sSL https://get.docker.com/ | sh

Once the installation is complete, let’s get your first container running!

sudo docker run hello-world

This should show something along the lines of:

Hello from Docker!
This message shows that your installation appears to be working correctly

Then it means that your installation is now complete!

Tag your photo album using Microsoft Cognitive Services

Just interested in the source code? – https://github.com/albertherd/PhotoTagging

Computer vision is one of the key areas that has seen huge growth in both capability and popularity. Though it seems that it’s still out of reach to many; I honestly felt lost when I was trying to play around in this field. It feels like we’re trying to solve problems which have already been solved by other companies. It seems Microsoft shares this vision though, as they’ve introduced Machine Learning features in the form of SaaS.

I’ve stumbled upon Microsoft Cognitive Services through a presentation and I was genuinely amazed. What’s amazing isn’t the results that this service yields – I’ve expected nothing less than excellent results from such tools. What amazed me is how EASY to get involved – there is no fiddling with following pages and pages of guides just to download, install and play around with some software.

Microsoft Cognitive Services enables you to do a huge array of Machine-Learning powered applications, ranging from vision, decision making, natural language processing and other areas. Let’s play around vision – can we use Microsoft’s Cognitive Vision Services and help us organise our photo library?

The idea is that I have many photos, with subjects ranging from food, vacation, family, friends and whatnot. What if my photos contain the proper EXIF tags such as subject and tags? This will allow me to classify my photos by subject and allow me to search through them. What if I can find my photos instantly instead of sifting manually through thousands of photos? I’ll presume that it’s not just me though, everyone has a smartphone nowadays, so this is everyone’s pain.

Great – now we have an objective! Let’s make the tools work for us now. The process will be simple – upload a photo to Microsoft’s Cognitive Vision Services, get the tags and a nice description and slap it to the actual file. Oh, when I say EXIF tags, these can be viewed in File Explorer like below. (Windows 10 Dark Theme in File Explorer here)

ExifInWindowsExplorer

Ready to tag your photo library? Let’s go!

Get a Microsoft Cognitive Services Account

Since this is an online service, you’ll need to have an active account with Microsoft Azure. Get your free account from here. Don’t worry, the free service is more than enough to get you playing around. I’ve used the free tier to develop, test and write this blog and I still have plenty of free capacity left.

Create a new Cognitive Services Resource and get the API key

Now that you have an active Azure account, navigate to the Azure Portal and create a new Cognitive Services Resource. Follow the wizard and get the service created – choose whatever region works best for you. I’ve chosen West Europe and the free tier in my case. Once it’s created, we’ll need two things – the URL to our endpoint and our API Key. From the quick start page, get API endpoint and the API Keys.

Get your photos tagged!

Okay, we got all the resources needed, it’s time to get some work done! I’ve created an application to get a photo, upload it our new Cognitive Services resource, get tags and description and apply it to our photo.
Follow these steps to get your photo tagging game going!

  1. Download / Clone my application from GitHub
  2. Open the application and navigate to PhotoAnalyser.cs. Change the subscriptionKey and uriBase to the ones you got previously. The keys in the solution are placeholder keys only.
  3. Run the application – have your photo directory ready as this is asked for at runtime.
  4. Let it do its magic!

In the below example, photo analysis tells us that it’s a pizza on a plate and it also gave us some appropriate tags. Try downloading and viewing the pizza photo -tags and title are preserved as EXIF data.

Keep in mind that the code in the provided solution is not production ready – it’s merely meant as a playground.

Explore the solution

What’s the fun of having a piece of software working without knowing how it works underlying? Here are some points about the application, in no particular order:

  1. It’s making one of the excellent TPL Dataflow framework from Microsoft – this enables the application to scale with ease and to work around the pesky throttling that the free tier carries with it.
  2. It is resizing the images since they don’t need to be large, plus this speed the process up.
  3. It’s using the ImageSharp to resize and add Exif tags to the images.
  4. Given that this application is manipulating images, it is memory intensive. I’ve seen this image hit close to 4GB in memory usage.
  5. It’s split into a library and a consumer just in case.

Continue exploring the Microsoft Cognitive Services stack

Computer vision is just one of the areas in the Microsoft Cognitive Services stack; there are other excellent services to enrich your applications. They also have excellent documentation on this; I’ve followed this to build my application.

That’s it for today! This was an extremely fun project to learn and experiment with new technologies! Until the next one.