I wanted to name this post “I built NAT between Azure Virtual Machines, this makes me an Azure Networking God”, but that title seemed a little wordy. I was working with a client recently with an interesting Azure setup and some interesting requirements. The database tier for this application is using Azure SQL Managed Instance and will be part of a Failover Group which allows the SQL Managed Instance to be failed over between regions as needed. The web and application tier of the application will be within Azure Virtual Machines which will be protected with Azure Site Recovery (ASR).
For Disaster Recovery Tests and for an actual Disaster Recovery Failover we wanted to have to make as few networking changes as possible.
In general, everything was pretty easy to get set up. We setup ASR, and ASR created an Azure virtual network (vNet) automatically as part of the Replication Setup process. We then set up our production Managed Instance and connected it to the production vNet. When we went to set up the Disaster Recovery SQL Managed Instance we ran into the root of our problem. In order to put the two Managed Instances within a Failover Group so that they could move data between them, the Managed Instances needs to be able to talk to each other. But the vNets had the same IP space, which means that we couldn’t peer the networks, as one of the requirements for connecting vNets to each other is that the networks need to have different IP Spaces.
In order to get around this requirement we configured a separate Azure vNet for the Disaster Recovery Managed Instance, which we could them peer to the production vNet. This then allowed the Production and the Disaster Recovery Managed Instances to be able to talk to each other.
The Big Problem
The next requirement was that the Disaster Recovery Virtual Machines needed to be able to connect to the Disaster Recovery Managed Instance with as few network changes as possible. We couldn’t just peer the ASR-created vNet to the Disaster Recovery Managed Instance vNet, as we’d have an IP conflict as the Disaster Recovery Managed Instance vNet can’t connect to two vNets with the same IP ranges. The solution to this was to create an additional vNet, with a machine in that vNet with the Windows Routing and Remote Access Service (RRAS) installed on it. This gives me a total of 4 Azure vNets peered together as shown in the picture below.
In order to make Routing and Remote Access Services (RRAS) happy it required having two subnets within the NatvNet and that machine needed to have two vNICs, with each vNIC in a separate subnet. I set up two Subnets in this vNet which were 192.168.21.0/26 and 192.168.21.64/26. The first vNIC gets the IP address of 192.168.21.4 and the second vNIC gets the IP address of 192.168.21.68. When RRAS is setup you have to specify an IP subnet range that will be used for the NAT. This range should be part of the IP range of the second vNIC. In this case, I used the range of 192.168.21.70-192.168.21.75. In order to make Azure happy (I’ll explain why in a minute) I needed to add the IP addresses that were being used for the NAT to the vNIC that was in the same subnet that they were from.
The final change in Azure that needed to be made was to create a routing table that directs the network traffic to the 192.168.20.0 network and use the next network hop for that being 192.168.21.4. This routing table is then applied to all the Subnets within the vNET that ASR created (called ProductionAsr in the image above). The network design and the VMs layout is shown below and should give a good view of the design.
The IP Requirement
One of the things to keep in mind that Azure network is a bit stubborn. When IPs are being used and those IP addresses aren’t known to the Azure environment, then the Azure environment doesn’t know which physical machine on the network to send the traffic to. I could easily see this in Wireshark when I ran Wireshark on the NAT server (the machine in the NatvNet subnet that was running.
I was using a machine in my ProductionAsr network with an IP address of 192.168.4.5. When I accessed 192.168.20.4 (the IP addresses of my Managed Instance) and looked at the network traffic (without having the IP addresses 192.168.21.70-192.168.21.75 assigned to the vNIC) I could see the network traffic going from 192.168.4.5 to 192.168.21.4, and I could see the network traffic going from 192.168.21.68 to the Managed instance IP address of 192.168.20.4, but the network traffic wasn’t coming back from the Managed Instance.
I did some additional testing by putting a Virtual Machine in the Managed Instance network and gave it an IP address of 192.168.20.5. When I installed Wireshark on this machine I could see the traffic coming to the machine, and I could see the machine sending the traffic back to 192.168.21.68, but it was never arriving at that machine. I noticed that it wasn’t actually using 192.168.21.68, but it was instead sending the traffic from the NAT machine to the Managed Instance network using 192.168.21.70. So on a hunch, I added 192.168.21.70 as an IP address which the server on the NatvNet could use. About 15 seconds after the Azure Portal said that it had made the change, I was able to make my connection to the Managed Instance as I needed to.
The final setup of this was a bunch of Virtual Machines in a vNet. These machines were all protected with ASR. For the database we’re using an Azure SQL Managed Instance which is setup in a Failvoer group for Disaster Recovery.
With this NAT configuration we were able to have our SQL Managed Instance data movement between the two instances, while allowing the production servers to access either SQL Managed Instance. The Disaster Recovery Virtual Machines were able to access the SQL Managed Instance when doing a DR test, or in a DR failover. In the DR test requires that we need to be able to write to the DR Managed Instance we can do this by simply removing the DR Managed Instance from the failover group and putting a host file entry so that the production server name resolves to our DR Managed Instance IP address. This connects all the Disaster Recovery VMs to the Disaster Recovery Managed Instance with just a couple of simple changes.
A Failover Event
For an actual failover event the failover process is even easier. The VMs are failed over using ASR. The SQL Database Managed Instance failover group is failed over. And that’s is. At that point everything is failed over and working.
Recently I was moving a client from their existing CoLo to Azure. Right now they are going with IaaS for their web servers and PaaS (Azure SQL Managed Instance) for their databases. This means that we needed to set up a Site-to-Site VPN between their existing on-prem environment at the CoLo and their Azure vNet so that they could transfer everything over, replicate AD, etc. Their on-prem router was a Cisco ASA running version 8.6 of the Cisco IOS.
Our plan was to enable BGP on the connection as we had several vNet’s that we wanted to have connected to the VPN, and we only wanted one VPN gateway. However, with a Cisco ASA, we found that BGP wasn’t an option as Azure has a new feature available within the configuration of the connect called “Policy Based Traffic Selector” which required that BGP be disabled.
The problem that we were facing with our VPN connection was that even though all of the IKE Phase 1 and IKE Phase 2 settings were connect (we verified them against the Cisco ASA settings probably 10 times) we were still getting protocol mismatch when the VPN connection would attempt to come online.
We eventually tried turning on the “Use policy based traffic selector” option on the connection (which made us disable BGP) and the connection came up on the first try. For this, we were using a policy-based VPN Endpoint, and these settings are pretty new (it’s March 2021) as I don’t remember seeing them on previous VPN connections.
As you can see from the screenshot above, we specified the IKE Phase 1 and IKE Phase 2 settings, but this doesn’t need to be done as using any of the settings that Microsoft has will work, but we wanted to take the guesswork out of the migration process.
The end result was one of success, the tunnel came up and we could set up things like AD Replication so that we could continue with the migration process.
Transparent Data Encryption (TDE) with a Customer Provided Key, Azure SQL Managed Instance (MI), and Failover Groups should all be easy to setup. Azure SQL MI with either of the other two is pretty straightforward to setup. All three of them at the same time require a little doing in order to get them all working correctly together. This is because you as the admin need to manually move the key which TDE is using to encrypt the databases to the secondary region manually, and you need to tell the Azure SQL MI in the secondary region where to find the key.
Having a customer-managed key for TDE requires some special setup to ensure that the key is available to both instances. This is done by placing the key within Azure Key Vault. In order to set up a failover group you need to have not just the Azure Key Vault in the primary site, but you also need to set up one in the secondary site which will be hosting your Azure SQL MI.
Once you have an Azure Key Vault in both sites, you can use the GUI to export the key from the primary site and restore it to the Azure Key Vault in the secondary site. After that is done you’ll need to give the Azure SQL MI in the secondary site access to the key, which is done through the Access Policies section of Azure Key Vault. The managed instance needs three permissions to access keys; which are Get, Unwrap Key, and Wrap Key. No other rights need to be granted to the MI.
To grant rights to the Managed Instance click the Add Access Policy link.
Then select the three rights needed in the Key Permissions drop down.
Then click on “None Selected” next to “Select principal”. This will bring up a list of all users and applications which are registered with Azure Active Directory. Find the Managed Instance for the secondary environment and click “Select” at the bottom of the screen, then click “Add” on the prior screen (where you selected the permissions).
Once this is done, go into the setting for the Managed Instance in the secondary site, and select Transparent data encryption from the menu. On this screen change the setting from “Service-managed Key” to “Customer-managed Key”. Then either select a key or enter a key identifier of the key which will be used for new databases from the key vault (if all your databases are protected with one key, just select this key).
Save these settings changes and edit the properties of your Azure SQL MI in the primary site. On this server select Failover Groups and click the “Add Group” button at the top of the blade that opens. Select all the correct information for the secondary instance and the group name and save the settings.
It’ll take a while, but when it is done the map should look something like this (depending on what regions you selected of course). Once you’ve gotten to this point you’ve set up your Azure SQL Managed Instance using TDE with a Customer-managed Key and a Failover group.
You should have a beer at this point, you’ve earned it.
When creating resources and resource groups, the locations in which the resource groups are created rarely get a thought. Resource group locations matter, not when everything is working but when there is a failure at an Azure region.
When you select the location for your resource group, you aren’t limited to putting resources from that location in the resource group. Resources from any Azure region can be placed in an Azure resource group which is stored in any Azure region. When the Azure region that is hosting the resource group fails, no changes can be made to the resources within that resource group, or to the resource group itself.
Firewalls as an Example
Let’s look at an example that I was working on for a client. Production for this client was is North Central US and the Disaster Recovery environment for this client is in South Central US. This client wanted to use Azure Firewall and have that deployed for both their North Central US and South Central US environments. As we were discussing the firewall rules that we were going to put in place we wanted to have a common rule set for outbound rules so that the exact same rules would be in place both for both Azure regions. This meant that we needed to place all the Azure Firewalls in the same resource group (as Azure Firewalls Policies can only inherit rules from another policy, when that policy is stored in the same region the parent policy).
This presented a problem. If we stored the resource group and the policies in North Central US and North Central US failed, then we couldn’t edit the Disaster Recovery policy. If we stored the policies and the resource group in South Central US and South Central US failed then we couldn’t edit or change the production firewall policy until our DR site came back up. The same applied to the resource group location. If we put the resource group in North Central US and the policies somewhere else, and production failed, there’s no guarantee that we’d be able to make firewall changes if those changes required being able to update the resource group itself.
What we ended up going in this case was putting the resource group and the Firewall Policies in a third Azure region, US West 2 in this case. This was if North Central US fails we can still edit our Azure Firewall Policies for our Disaster Recovery environment, and if South Central US fails we can still edit our Azure Firewall Policies for our Production environment.
This same process applies to a web farm that is spread out across multiple Azure regions. If for example, you had a WebApp in two Azure regions with one in Central US and the second one in North Europe you wouldn’t want these to both be in the same resource group. If they were in the same resource group, you’d want that resource group to be in a third site. Because whatever site holds the resource group needs to be online for you to be able to make changes to the resources within the resource group, provided that those changes need to update the resource group directly.
The solution to this is to either have one resource group for each region or put the resource group itself in a third region so that a failure of either region won’t affect it.
As Microsoft MVP’s and Partners as well as VMware experts, we are summoned by companies all over the world to fine-tune and problem-solve the most difficult architecture, infrastructure and network challenges.
And sometimes we’re asked to share what we did, at events like Microsoft’s PASS Summit 2015.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.