And we’re back to action!

Now I don’t know if anyone is actually reading these yet, the stats would suggest not…but hey ho, this was never intended as being that type of page. More just aimless ramblings.

With that being said, yes I’ve been quite of late. Three months too many months to be exact…stuffs been going on, life is busy, mental health has been shit. So I took a break from most things, albeit a not intentional one.

But, to kick back off, we’re going for a technical post. This time, Azure Local, or Azure Stack HCI. Microsoft’s answer to hyperconvergence, we’ll go over what it is. What caused me to spend 5 days building one…where to avoid going wrong…what works, and what doesn’t…everything in between.

Azure Local? What’s that?!

So Azure Local, otherwise known as Azure Stack HCI – is Microsoft’s hyperconvergance technology.

“Hyperconvergance? What’s that?” I hear you ask. Maybe, I don’t know if you’re asking that. Wikipedia will tell you that it’s “a software defined IT infrastructure that virtualises all elements of the convential hardware-defined systems”…

So what does that actually mean? Basically traditional infrastructure you would have your compute side of things (your servers) then your storage which most of the time is a SAN device of some description.

Hyperconvergance essentially combines all of that into multiple nodes, so your compute and storage is shared with no additional SAN hardware, and the storage is dealt with at a software level by the hyperconverged operating system.

Hyperconverged infrastructure generally also involves some sort of cloud based control pane, for Azure Local, weirdly that’s Azure, Nutanix has their own offering too, they all work in the same way.

So how does it work?

Imagine a traditional infrastructure setup, you’ll get a couple of servers/nodes from your vendor of choice, then you’d get a SAN/Block storage device, whether that’s direct attached or network, it doesn’t matter. You setup your nodes in a cluster (Microsoft Hyper-V’s Failover Cluster, VMWare vMotion/Clustering) and as part of that, you create your storage volumes on your block storage device and that talks to your cluster to share the storage between the nodes and you build your stuff. Simple stuff? Ignoring all the iSCSI setup and weird stuff like that.

Hyperconvergance does away with that block storage aspect. So you still have your nodes, and these are generally off the shelf for the most part (but do need to be verified by the vendor…more on that later) and you still create a cluster between the nodes. But in a “traditional” cluster, you generally don’t have a lot of storage on the nodes themselves. Hyperconvergance is the opposite, all your storage is on the nodes.

Say you want a three node cluster, you’d populate that with your normal RAID1 for the OS, then you’d populate it with some nice big solid state/NVMe drives as storage. When you build the cluster, that storage is combined at the software level and presented to the cluster. In Microsoft land, this uses Storage Spaces Direct, an offshoot of the Storage Spaces technology.

When the cluster is built, the HCI setup automatically (most of the time) groups all your main storage together into one big pool. You then carve storage out of that pool into different volumes, and that’s used in the cluster as normal.

When you create a volume you specify your level of redundancy, generally for Azure Local, that’s a two-way mirror or a three-way mirror. Does what they say on the tin really. Obviously for redundancy, you want a three-way mirror, but that effectively means your storage efficiency is 33%, or thereabouts. So lets say you have three nodes with four disks at 1TB each, and you make a three way mirror, you’ll only be able to use about 2.95TB of storage.

This does mean you can lose two disks or even two nodes in this situation. So the redundancy is very useful.

Why use HCI over traditional convergance?

Well that’s the million dollar question isn’t it. Why start using HCI over your normal converged tech?

HCI is a great option for virtualised technologies, and the fact you can link it to the cloud, especially in Azure (or AWS for Nutanix) means you can expand the capabilities massively. Want to restore your on-premise VM’s to the cloud? You can do that. Want to spin up normal serverless tech from Azure on your on-premise hardware – you can do that (kinda)

HCI is also generally much more cost-efficient and generally more flexible for the most part especially for modern virtualisation workloads. SAN’s are great but they’re pretty restrictive. HCI gets rid of that. BUT as much as they’re more cost efficient, for Azure Local specifically, it’s all linked to Azure which means you get some monthly costs from Microsoft on top of everything else. It’s not massive, and much cheaper than full cloud. But its there.

Sounds great right?

Generally, yeah it is. One of the main issues is when you want to expand your capabilities. Say you’re running out of storage, with converged tech, you’d chuck in a couple more disks, or even add a new shelf to your SAN. HCI – you can’t do that. If you want more storage, that’s a new node you need, and this has to be identical to your existing nodes. Regardless of if you need the extra compute or not, you’re going to have to get it with HCI whether you like it or not. So while the initial outlay is less, expansion can be more expensive.

It brings some other weird and wonderful stuff with it. You can’t just use a standard server with hyperconvergance. They have to be specific models with specific hardware and firmware versions that are all signed off by your HCI vendor of choice. If not, you’re going to run into a bad time with compatibility between the nodes and ultimately you’re going to get errors or worse.

Now this is also a good thing. You’re guaranteed to get kit that works with the HCI setup, but obviously that has a bit of a premium along with it as a result, so it’s a bit of a mixed bag.

Management is generally a lot nicer though, and opens up options of granular control for certain areas. Got a couple of devs who need to manage their own VM’s? Give them access to the Azure Portal and let them run wild – easy enough. That can be done on normal converged tech but requires some sort of investment into something like SCVMM – and the fuck about that causes as a result (thanks Microsoft)

So why the woes?

The main point of this post after all.

HCI is marketed as this all singing all dancing piece of technologies, easy to use etc.

It’s a fucking pain in the arse to set up if something isn’t quite right. And getting support on it is a nightmare at times.

I shit you not with that opening paragraph, it took me a week to get a fully functional Azure Local cluster configured. Using authenticated hardware and switches from HPE. All that worked fine. The issue is with Azure itself.

You have to create the cluster via Azure (generally with the help of Windows Admin Centre) and while the help articles and guides aren’t terrible, if something goes weird then finding an answer is just impossible.

Generally, these issues manifest as weird and wonderful networking issues.

Networking in Azure Local is based on “intents” – as in you specify a couple of NICs and their intention could be compute, management or storage (or all three, or a combination)

Generally you would have compute and management sharing a set of NIC’s then storage on it’s own NIC.

These are separated off into VLAN’s as expected which is fine. However if you want multiple NIC’s for your storage, that’s multiple VLAN’s, you can’t share a VLAN. I also found that using anything other than Microsoft’s default VLAN’s caused issues (711 – 719) that I just couldn’t be bothered to diagnose after days of trying. You then dedicate a subnet to that VLAN and Azure Local uses it as needed. Storage NIC’s? Different VLAN’s – make sure they’re trunked together – that’s not mentioned anywhere in the guides.

The whole wizard is a nightmare. You essentially have to finish the initial setup of the cluster before doing anything else, because if you click away from it for even a moment, the entire process craps itself. You get left with resources half registered in Azure but they also then won’t work in a fresh wizard so you have to completely strip them from Azure – which is generally easier just to rebuild them.

So, you finish the wizard and you get an Azure Local cluster in Azure. Great right?

Nah. You still get weird issues from this point. I got weird and wonderful firewall related issues, basically the nodes complained they couldn’t talk to each others storage networks over the cluster ports (3516 I think)

They could, 100% they could.

I rebuilt the cluster twice before I figured this out…

Considering the wizard was generally default, the subnets and VLAN’s are the ones from Microsoft – it managed to stick IP’s from one storage VLAN onto the second storage VLAN, meaning it tried to ping the wrong subnet and caused problems and there was no easy way to figure this out…

It was just a nightmare from start to finish in all honesty.

Would it have been easier if we got support from Microsoft. Maybe? But that’s a decent expense in all honesty, and considering a traditional failover cluster takes an afternoon to get something usable – it’s pretty poor.

So do you hate HCI?

Nah, not really. I think it’s a great and useful technology. Just the process, if you’re going in fresh – is a nightmare to figure out at times with no real troubleshooting steps (No, Copilot doesn’t fucking help at all)

Anyway, ramble over, rant over…for now…

Next post – soon…

By Delo

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.