Morning Fellow Guru's
we have a VM300 in Azure on a host that actually exceeds the recommended spec. However we have moved around 150 - 200 users onto the VM and very recently we are seeing latency and high Dataplane spikes.
So the VM was on 9.1.3h1, I have uplifted to 9.1.5 and I have enabled Azure Accelerated Networking on the Untrust Egress.
I have been told that this is possibly an issue with GP and IPsec and to switch it off and go SSL. Previously we had two 3020 in HA that never suffered this problem.
So do I switch to SSL or does anyone know if the upgrade to 9.1.5 will address the issues?
Thanks guys!!
Darren
Perhaps the fix is : PAN-153705 Fixed an issue where packets were not evenly distributed among a process (pan_tasks), which caused latency and poor performance. As usual, it's not a lie, but it's hard to tell....
Reaper has already mentioned this, but I found the following knowledge and will share it with you. https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000PP8rCAG I heard a rumor that in the latest PAN-OS (9.1.8) with AES-GCM selected as the encryption algorithm, a IPsec tunnel session can be seen in action being processed across multiple cores, so the above problem may be solved in a while (at least, I think we need to make sure that AES-GCM algorithm is selected, though).
@Darren Bisbey
Please don't be offended and ask.
To begin with, do you understand and design the performance and capacity of VM300 on Azure?
The reason I'm saying this is because I failed too (in my case, it was worse).
https://docs.paloaltonetworks.com/vm-series/9-1/vm-series-performance-capacity/vm-series-performance-capacity/vm-series-on-azure-performance-and-capacity.html
https://docs.paloaltonetworks.com/vm-series/10-0/vm-series-performance-capacity/vm-series-performance-capacity/vm-series-on-azure-performance-and-capacity.html
With reference to the above, you should consider that the VM300 data plane is flatlining (100%) when the IPsec throughput reaches 900Mbps.
If you are designing your capacity in a private cloud-oriented product like the one below, you are very much mistaken.
That is, it will be roughly half of what you assume. The reason for this has to do with hyperthreading and the hypervisor's CPU allocation schedule.
It is usually the logical cores, not the physical cores, that are allocated to the VMs by the hypervisor. And hyperthreading behaves as if one physical core were two logical cores. Hence, if the hypervisor is on the same physical core, the capacity is roughly halved.
That is, if the CPU pinning (or CPU affinity) is not properly set up and working, it will not work like a data sheet.
Also, you can't change these settings unless you are running a hypervisor yourself, which of course you can't do in an open cloud.
What if I increase the size of the VMs (number of CPUs allocated)? Unfortunately, an over-allocated CPU is allocated to the management plane. Well, if the management plane is rarely used, then the hyperthreading might work well and the performance might be better. Incidentally, when I asked TAC, they didn't rule out that possibility, but they said they don't support that configuration.
Honestly, it would be extremely difficult to operate, as achieving datasheet-like performance would require in-depth knowledge of the CPU and hypervisor to achieve.
https://www.paloaltonetworks.com/apps/pan/public/downloadResource?pagePath=/content/pan/en_US/resources/datasheets/vm-series-kvm
In addition, designing capacity for SSL-VPN and SSL decription is honestly more difficult.
If the following test results are to be believed, we should assume that it is less than 18% of Threat Prevention throughput.
https://eantc.de/fileadmin/eantc/downloads/News/2019/EANTC-TestReport-PaloAlto-v1.0.pdf
That is, I believe that 234 Mbps (18% of 1.3Gbps) is the limit of the throughput for SSL-VPN and SSL decryption for VM300 on Azure.
Another important question is whether there is a evidence to show that the cause of the high load comes from IPsec.
It's good to formulate a hypothesis and take action, but if you don't find the evidence to support your hypothesis as a result of your actions, then you should stop and think about it.
What is the throughput of IPsec?
Does "debug dataplane pow performance" mean that "tunnel_encap" and "tunnel_decap" are overloaded?
If, after checking, it is not what you expected, you should stop and think, because the IPsec load may be irrelevant.
After confirmation, if IPsec is still a factor, then you're at the limit of your capacity.
The best thing you can do is to reduce the spam itself, but if you can't do that, I think you can use a VPN Gateway and GRE.
https://azure.microsoft.com/en-us/pricing/details/vpn-gateway/
If you terminate the IPsec communication and then use the GRE tunnel, you should be able to continue your routing operations as before.
If your IPsec counterpart is a paloalto and you have PAN-OS 8.1, you will need to terminate the GRE tunnel with the device behind the paloalto.
However, I'm sure it's a tricky configuration, so I think it's a last resort.
Finally, I'll share the Resource List on Performance and Stability.
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g0000008TzwCAE&lang=en_US%E2%80%A9
Further investigation: we seem to have a DC spamming our other DC's and the offending DC is in Azure so high levels of traffic going over the VPN. Without the PA we would of not found this, but disappointing its flatlining the PA DP.
Must be the amount of data and the various security running...maybe?
Update :
I have applied the split tunnel for teams traffic and this has massively reduced the DP usage and also we have reduced the amount of decryption we were doing.
Finally IPsec is also back on.
Thanks to all that helped. Fast responses. I love this group!!!
Darren
@Darren Bisbey
Does the spike occur continuously?
Otherwise, CPU utilization may be rising due to hypervisor delays (CPU steal) caused by platform updates.
And this happens at least once or twice a month and lasts from 10 to 30 seconds (which should be basically the time of day when VM utilization is low).
Also, IPsec should not be disabled; GlobalProtect's architecture proxies (i.e., decrypts) SSL-VPN.
Enabling SSL-VPN increases the load on the data plane because of the increased load on ssl_proxy_proc and ssl_encode.
If you continue to have problems, it is recommended to use "debug dataplane pow performance" to find out which operations are more demanding.
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000CmV2CAK
If IPsec load is a factor, we can see that the load on tunnel_encap and tunnel_decap has risen as shown in the following example.
https://knowledgebase.paloaltonetworks.com/KCSArticleDetail?id=kA10g000000CmQvCAK
In that case, you need to reduce the IPsec traffic itself, so make sure to split the splittable targets, such as o365 traffic.
Personally, I recommend splitting by IP address only. The reason is that there are a lot of black boxes and glitches in the movement of GP clients.
Splitting by IP address only adds the split target to the routing table of the client terminal and makes it easier to isolate the problem.
Also, in the case of o365, the IP address alone is sufficient to control major traffic (as a rule of thumb, it's about 70% of o365 traffic).
For splitting o365 traffic, follow the example below to extract and split the address bands that belong to Optimize.
https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/microsoft-365-vpn-split-tunnel?view=o365-worldwide
https://techcommunity.microsoft.com/t5/office-365-blog/how-to-quickly-optimize-office-365-traffic-for-remote-staff-amp/ba-p/1214571
Sadly no, we got high DP and then that would drop packets, we think...
There's no mention of GlobalProtect performance issues in the release notes so I wouldn't count on 9.1.5 fixing your issue.
For the second option however, IPSec is inherently 'faster' than SSL as there's less overhead both in the tunnel and the firewall, so my instinct would tell me switching to ssl would not improve latency. Since this is a cloud solution however, my instinct might be wrong so you could give it a try. it's an easy toggle to correct if the latency is not addressed.
are there any indications of what might be causing latency, processes that are consuming more resources thn previously? (show system resources and show running resource-monitor)