High availability/cluster problem ,with master node?

elf4o

New Member
Joined
Jul 27, 2022
Messages
17
Reaction score
2
Credits
181
Hi everyone i have suse linux enterprise v15 sp1 ,
my issue is i have 2 servers, both same versions same machines…
below two servers are in High availability of SAP Hana on Azure VMs on SLES. We observed an issue making Server01 as primary node possible due to some underline issue in cluster management . Hana services are running fine in both the nodes so ideally any of the nodes should have the capability of being master node . Since the node Server01 is not working as expected under cluster ,we may face issue while there is a scenario of node Server02 is unavailable that means will be ended with service unavailability.

Here is some more information…
We had a drive issue both servers hdd were full with logs… now issue is fixed free space is added to them…

So before server drives was filled, with node 1 server01 was working, as master node before and seems failed over to node 2 during this issue . We have observed this during our last patching activity and database was running with 01 node as primary since then . So this need more investigation as why this is not happening now . We have checked already and found the hana database are running as expected still cluster service is not making this as primary .

We have not any evidence of fail over activity as this have not being performed yet .

Please good people advice what to do in order to fix this.
 


Contact SUSE support since you seem to be managing servers for a company and you should have a contract with them since it's enterprise and when you have a contract you pay for support as well.
 
Are those company servers you are managing and responsible for?
 
I would start with the logs to see if there’s anything that is not working correctly.

It is hard to give you advice without seeing the actual configuration of your Hana cluster. You did provide the high level details, but not enough information to be able to just answer on the fly.

What happens in the logs when you try to failover from one node to another? Is there a test environment where you can see the failover behaviour so that you have something to compare? Are you doing things like Host Auto failover? Or are you doing some type of replication? What are the issues you’re observing with node 1?
 
Are those company servers you are managing and responsible for?

Hi i am windows certified admin,
Servers are dr - testing environment,


I would start with the logs to see if there’s anything that is not working correctly.

It is hard to give you advice without seeing the actual configuration of your Hana cluster. You did provide the high level details, but not enough information to be able to just answer on the fly.

What happens in the logs when you try to failover from one node to another? Is there a test environment where you can see the failover behaviour so that you have something to compare? Are you doing things like Host Auto failover? Or are you doing some type of replication? What are the issues you’re observing with node 1?

Can you tell me some command or what information you need and how to provide it,
how to see/check configuration for hana cluster is there a command or something ,

Basicly, i need to switch node 1 to master node but i dont see anywhere written master node,, right now 2nd server is the first...
according to the crm cluster status

name : hal cluster

services coro sync active/running disabled,

pacemaker/ active/running/enabled.


printing ring status local node id 2,
ring id 0
id - ip adress....

status ring 0 is active with no faults...
 
I don't really know Hana all that well to be able to provide you with some recommendations, but if I were you, I would look at the documentation first, here's where I would start:

https://documentation.suse.com/sles-sap/15-SP1/html/SLES4SAP-guide/cha-s4s-cluster.html

There is no shortcut, if you/re going to support Hana clustering, you need to understand how it works. That link I shared with you will give you a good understanding of clustering and I am sure it will help with troubleshooting.
 
I don't really know Hana all that well to be able to provide you with some recommendations, but if I were you, I would look at the documentation first, here's where I would start:

https://documentation.suse.com/sles-sap/15-SP1/html/SLES4SAP-guide/cha-s4s-cluster.html

There is no shortcut, if you/re going to support Hana clustering, you need to understand how it works. That link I shared with you will give you a good understanding of clustering and I am sure it will help with troubleshooting.
Hi i already checked out the documentation before posting here,
but i was thinking my clustering is LINUX clustering not hana clustering.
 
Hi i am windows certified admin,
Servers are dr - testing environment,




Can you tell me some command or what information you need and how to provide it,
how to see/check configuration for hana cluster is there a command or something ,

Basicly, i need to switch node 1 to master node but i dont see anywhere written master node,, right now 2nd server is the first...
according to the crm cluster status

name : hal cluster

services coro sync active/running disabled,

pacemaker/ active/running/enabled.


printing ring status local node id 2,
ring id 0
id - ip adress....

status ring 0 is active with no faults...
1. Always have a support contract for the servers you are running, for the os and the hardware.
2. Hire capable people of managing an os if you have no experience for a specific os.
So the company you work for doesn't have a support contract with SUSE and hasn't hired Linux sysadmins to manage the Linux systems and fix the issues that may occur on the systems. And now they are asking you a Windows admin to fix the issue and you are asking for other people to fix the issue on a forum. Seems like the company doesn't want to spend any money on their IT department but expects things to work.

In short you can have active/active clusters and active/passive clusters. With active/active clusters there glusterfs will usually be used or another distributed filesystem. With active/passive one cluster is usually active and the other passive. If the active one becomes unavailable the cluster software will transfer the cluster resources to one that is available.

What you have to do is find out what the resource group is what the resource services are part of and then move them to the other node. Since from my understanding of your explaining your cluster setup is an active/passive cluster since you switched to the other node. And you want the cluster services to stay on the secondary node and not switch back to the broken node. Making it what you call the master node because it currently being the active/passive one. You will have to do something with configuring resource and location restraints. But knowing you are a Windows admin you will most likely not know how to do that, not that it is your fault. Tell your manager to hire the needed Linux admins to solve problems that are beyond your expertise.

About Suse Linux Enterprise Server, you probably need a license and have them activated on the servers to receive updates. So they are not currently able to receive updates? That's my guess I work with RHEL systems and I would think with SLES it would be a similar structure when it comes to receiving updates and support.

Lastly one question, do the company Windows servers also not have an active contract for Microsoft support and are running unactivated/unlicensed?
 
1. Always have a support contract for the servers you are running, for the os and the hardware.
2. Hire capable people of managing an os if you have no experience for a specific os.
So the company you work for doesn't have a support contract with SUSE and hasn't hired Linux sysadmins to manage the Linux systems and fix the issues that may occur on the systems. And now they are asking you a Windows admin to fix the issue and you are asking for other people to fix the issue on a forum. Seems like the company doesn't want to spend any money on their IT department but expects things to work.

In short you can have active/active clusters and active/passive clusters. With active/active clusters there glusterfs will usually be used or another distributed filesystem. With active/passive one cluster is usually active and the other passive. If the active one becomes unavailable the cluster software will transfer the cluster resources to one that is available.

What you have to do is find out what the resource group is what the resource services are part of and then move them to the other node. Since from my understanding of your explaining your cluster setup is an active/passive cluster since you switched to the other node. And you want the cluster services to stay on the secondary node and not switch back to the broken node. Making it what you call the master node because it currently being the active/passive one. You will have to do something with configuring resource and location restraints. But knowing you are a Windows admin you will most likely not know how to do that, not that it is your fault. Tell your manager to hire the needed Linux admins to solve problems that are beyond your expertise.

About Suse Linux Enterprise Server, you probably need a license and have them activated on the servers to receive updates. So they are not currently able to receive updates? That's my guess I work with RHEL systems and I would think with SLES it would be a similar structure when it comes to receiving updates and support.

Lastly one question, do the company Windows servers also not have an active contract for Microsoft support and are running unactivated/unlicensed?

i am external hire via agency,
i am not a company owner. i cant hire specific people.
They want to manage all servers with 1 person only,
its called cost cutting.
Yes they dont spend on IT department yes.
All of the nodes i fixed, so nothing is broken , i just need to change from node 2 to node 1.
I dont have the specific commands to find out how ...

We dont have any contracts with companies or suse,microsoft or someone else i am the person.

All incluside for everything even things i dont understand. I cant change their mind or budgets,
Everymachine we own is with digital license ,so no worries about that.
From windows side i dont know how to check on linux. The suse which i see is Old version, so i guess no updates... I am from Europe, small country here things are difficult its not like US.
Do you think if i do server reboots, this will fix the issue for both machines ?
What is the command to check cluster configurations etcs... may be its a simple fix ?
 
i am external hire via agency,
i am not a company owner. i cant hire specific people.
They want to manage all servers with 1 person only,
its called cost cutting.
Yes they dont spend on IT department yes.
All of the nodes i fixed, so nothing is broken , i just need to change from node 2 to node 1.
I dont have the specific commands to find out how ...

We dont have any contracts with companies or suse,microsoft or someone else i am the person.

All incluside for everything even things i dont understand. I cant change their mind or budgets,
Everymachine we own is with digital license ,so no worries about that.
From windows side i dont know how to check on linux. The suse which i see is Old version, so i guess no updates... I am from Europe, small country here things are difficult its not like US.
Do you think if i do server reboots, this will fix the issue for both machines ?
What is the command to check cluster configurations etcs... may be its a simple fix ?
It's called wanting to sit on the front-row for a penny. I know it's not your company that's why I was advising you to tell your manager or the person above you to hire the needed expertise. I'm from Europe as well, SUSE 15 is still supported although SP4 seems to be the current subversion.
If you are actually updating the os then you most likely have an active license because otherwise you wouldn't be able to install updates. I'm still not quite getting your cluster setup since after rereading it seems like all of them are active. I have a day job and I'm not about to help you fix your problem but you can use the following command to view the cluster status and that would give me a better idea of what is actually running.
Code:
pcs status
And I don't actually understand the full problem you are having, you worried about not being able to run the services if the master node fails? From the other reply you shared I saw corosync, that usually goes in hand with pacemaker. And with all the cluster setups I've had to deal with you can manage the cluster from any of the nodes using the pcs command. However I don't know your exact setup and I'm not about to start to try to solve this issue through a forum topic since there is lots of information that could be missed or miss-communicated. As said before the best way to solve this would be to hire a Linux sysadmin with clustering experience who can login to the system to view the setup for themselves.
 
It's called wanting to sit on the front-row for a penny. I know it's not your company that's why I was advising you to tell your manager or the person above you to hire the needed expertise. I'm from Europe as well, SUSE 15 is still supported although SP4 seems to be the current subversion.
If you are actually updating the os then you most likely have an active license because otherwise you wouldn't be able to install updates. I'm still not quite getting your cluster setup since after rereading it seems like all of them are active. I have a day job and I'm not about to help you fix your problem but you can use the following command to view the cluster status and that would give me a better idea of what is actually running.
Code:
pcs status
And I don't actually understand the full problem you are having, you worried about not being able to run the services if the master node fails? From the other reply you shared I saw corosync, that usually goes in hand with pacemaker. And with all the cluster setups I've had to deal with you can manage the cluster from any of the nodes using the pcs command. However I don't know your exact setup and I'm not about to start to try to solve this issue through a forum topic since there is lots of information that could be missed or miss-communicated. As said before the best way to solve this would be to hire a Linux sysadmin with clustering experience who can login to the system to view the setup for themselves.


I cant change upper management budgets ,they denied all my requests,
currently system is with sp1 , I dont do any updates...

I typed as root command pcs status , but i got no return, are you sure thats a suse command or for other distro only?
I am not worried about anything, my upper management is worried and i need to resolve their issues, because for them thats critical...... i never setted up the system, i fairly new, so i dont know who done the set up its probably no longer here ...
when i type pcs i got message if pcs is not a typo you can use command-not-found to lookup the package that contains it, like this cnf pcs
, when i enter cnf pcs command not found.
 
I cant change upper management budgets ,they denied all my requests,
currently system is with sp1 , I dont do any updates...

I typed as root command pcs status , but i got no return, are you sure thats a suse command or for other distro only?
I am not worried about anything, my upper management is worried and i need to resolve their issues, because for them thats critical...... i never setted up the system, i fairly new, so i dont know who done the set up its probably no longer here ...
when i type pcs i got message if pcs is not a typo you can use command-not-found to lookup the package that contains it, like this cnf pcs
, when i enter cnf pcs command not found.
I work with Rhel so if that command can't be found it must be a different variant or format of the command on SLES for pacemaker. Tell them it's not your area of expertise so you can't solve something you have no expertise and experience in. Basically tell them to go to hell in such a way that they will enjoy the trip there. I would do the same if the person above me came to me stressing out about how something broke on a Windows server. It's also very unrealistic to ask someone with no expertise in a certain area to fix a problem on something they know nothing about. It's managements problem not yours. If I wanted to I could probably figure out eventually how to do to in Suse but I'm not about to spend a load of time on a problem that the management of a company created. You may have better luck on the OpenSuse forums.

I sincerely mean this: good luck and I hope the management there pulls their heads out of their asses(, but not likely to happen).
 
Last edited:
I work with Rhel so if that command can't be found it must be a different variant or format of the command on SLES for pacemaker. Tell them it's not your area of expertise so you can't solve something you have no expertise and experience in. Basically tell them to go to hell in such a way that they will enjoy the trip there. I would do the same if the person above me came to me stressing out about how something broke on a Windows server. It's also very unrealistic to ask someone with no expertise in a certain area to fix a problem on something they know nothing about. It's managements problem not yours. If I wanted to I could probably figure out eventually how to do to in Suse but I'm not about to spend a load of time on a problem that the management of a company created. You may have better luck on the OpenSuse forums.

I sincerely mean this: good luck and I hope the management there pulls their heads out of their asses(, but not likely to happen).

They are aware that its not my area of expertice, but they simply doesnt care , they think of me as a hotel all inclusive.... i will try to resolve it if possible........ can simple reboot help this issue to be resolved?
Can you point me to open suse forums how they are called on official suse forum is no body around.
 
They are aware that its not my area of expertice, but they simply doesnt care , they think of me as a hotel all inclusive.... i will try to resolve it if possible........ can simple reboot help this issue to be resolved?
Can you point me to open suse forums how they are called on official suse forum is no body around.
Sounds like typical management. As I said before, from my understanding you are worried about the cluster services/resources not running if the master node fails and it's currently acting strange so you transferred the cluster services over to another node. Am I understanding that correctly?

If yes then I can tell you from my experience with Corosync/Pacemaker clusters on Rhel you can manage the cluster from any node that is a member of the cluster and if one node totally fails they will automatically be shift to a node that is healthy. In Rhel the command for pacemaker is pcs but it seems another command is used on SLES. Corosync/Pacemaker whether run on Rhel, SLES or another Linux distribution should work the same, where the cluster services can be managed from all cluster nodes that are part of the same cluster and if one node fails the cluster resources/services will shift over to a healthy node.

I just did a quick search for you and found this.
And it looks like the command crm is used on SLES. So try running the following as root on each of the nodes.
Code:
crm status
And from all nodes you should get back the status of the cluster. Try it and see what happens.
 
Sounds like typical management. As I said before, from my understanding you are worried about the cluster services/resources not running if the master node fails and it's currently acting strange so you transferred the cluster services over to another node. Am I understanding that correctly?

If yes then I can tell you from my experience with Corosync/Pacemaker clusters on Rhel you can manage the cluster from any node that is a member of the cluster and if one node totally fails they will automatically be shift to a node that is healthy. In Rhel the command for pacemaker is pcs but it seems another command is used on SLES. Corosync/Pacemaker whether run on Rhel, SLES or another Linux distribution should work the same, where the cluster services can be managed from all cluster nodes that are part of the same cluster and if one node fails the cluster resources/services will shift over to a healthy node.

I just did a quick search for you and found this.
And it looks like the command crm is used on SLES. So try running the following as root on each of the nodes.
Code:
crm status
And from all nodes you should get back the status of the cluster. Try it and see what happens.
Thank you so much for the command
Crm status comand works,

I will try to explain again,
due to sap hana filled disk with logs system decided to switch, system made server 02 master and server 01 slave.
all of the disk issues is fixed now, but i dont know how to revert this..
I am not worried upper management is worried and i am there to help..
i really love IT stuff thats why i also work out of hours like this for free... just for the sake of the technology to learn something new.

So you want to tell me that right now if something happens if Server 02 fails, for some reason which is the current master, slave server 01 will take his place, can you double confirm ?



When i enter the command i got back.

Stack : corosync

Current DC - server01 version 2.01 some numbers partition with quorum..
stack corosync,
current dc - server01 some simular number partition with quorum..

last change : 2022 by root via crm_attribute on server02.

2 nodes configured
,7 resource configured.

Online server01 ,server 02,

full list of resources
rsc_st_azure stonith fence_azure arm, started server 01.
clone set : cln _saphana topology h71 hdb00 rsc_sap hana topolgy... numbers..

started server01 ,server 02
clone set msl_saphana_h71 numbers... rsc_sap_hana h71 numbers.... promotable?

MASTERS - Server 02
slave server 01.
resource group g ip numbers,,, and some more stuff under that.
after that i receive
failed resource actions :
rsc_sap hana numbers promote 0 , on server 01 , unknown error , 1 call =106 status complete exitreason = last rc change monday 25 ...


May be thats the reason unknown error simply a guess ,at least i got the right command now.

Can you tell me if this Cluster is Linux cluster or its special sap/hana cluster, how to understand the difference?
 
Stack : corosync

Current DC - server01 version 2.01 some numbers partition with quorum..
stack corosync,
current dc - server01 some simular number partition with quorum..

last change : 2022 by root via crm_attribute on server02.

2 nodes configured
,7 resource configured.

Online server01 ,server 02,

full list of resources
rsc_st_azure stonith fence_azure arm, started server 01.
clone set : cln _saphana topology h71 hdb00 rsc_sap hana topolgy... numbers..
Looking at this first part above it looks like there are two nodes in this cluster: server01 and server02. It looks like server02 is the active one and I see a fence device as resource and a saphana resource. Normally with physical clusters you would have a fence device running on each node. Looking at there is only one fence device and azure is the name it makes me think this is a virtual cluster.
started server01 ,server 02
clone set msl_saphana_h71 numbers... rsc_sap_hana h71 numbers.... promotable?

MASTERS - Server 02
slave server 01.
resource group g ip numbers,,, and some more stuff under that.
after that i receive
failed resource actions :
rsc_sap hana numbers promote 0 , on server 01 , unknown error , 1 call =106 status complete exitreason = last rc change monday 25 ...
Now looking at this second part above the only thing that looks familiar is the failed "resource actions" So that makes me think this is a Corosync/Pacemaker cluster with Hana specific stuff running on it which I have no experience with. Having run a wrong command will not effect the resources since the command was available on the system. You would have to run run a correct command together with the correct sub command to have an effect on the cluster.

Looking at the failed resource actions it looks like an error happened on server1 which probably caused the resources to shift to server2. If I were to guess for the Hana stuff it looks like it does work with master and slave server and that currently the master server is server02 and the slave server01. So I would say for 75% it will run the same as I would expect but I don't know how the Hana stuff works since that seems to be really specific but I would expect it to work mostly the same.

It sounds like you want to switch the resources back to server1 since the disk problems are solved. Not sure what type of disk but since it seems to be an active/passive setup I would expect a network or san disk. So with a Corosync/Pace make cluster to switch the resources to another node it's basically telling the resource or the resource group to move over to another node.

However I don't see a group configured here so chances are you will have to move over the resources one by one and I'm not sure what effect that will have on the services running there. So I would still get someone on location who knows more about the Hana part. Also since resources are currently running on server02 now and the services are available I wouldn't worry about having to switch back to the other node as the master node. That way you can get someone else to find someone who actually knows and understands how the Hana part of the cluster setup works. That's my opinion of the whole setup. Good luck!
 
Last edited:
Hi can you tell me do you think my these are linux clusters?
Or these are specific sap/hana/ clusters which one of these two is the case? All of the servers are virtual machines yes , nothing is physical. i guess its virtual cluster , regarding server disks they are from azure its possible they are virtual disks as well...
premium ssd storage.
 
Last edited:
Yes it's a Linux clustering pacemaker/corosync setup but seem a bit different from the clusters I have setup before. I haven't used the promote/demotion settings and I'm getting a strong idea that SAP/Hana is just the software running on top of the clustering setup since the resource is called saphana.
started server01 ,server 02
clone set msl_saphana_h71 numbers... rsc_sap_hana h71 numbers.... promotable?

MASTERS - Server 02
slave server 01.
resource group g ip numbers,,, and some more stuff under that.
after that i receive
failed resource actions :
rsc_sap hana numbers promote 0 , on server 01 , unknown error , 1 call =106 status complete exitreason = last rc change monday 25 ...
So it seems this failed resource was trying to startup again and because of the ordering it tried to start it on server1 where it failed so then it probably tried to startup the resources on the server2.
 
Yes it's a Linux clustering pacemaker/corosync setup but seem a bit different from the clusters I have setup before. I haven't used the promote/demotion settings and I'm getting a strong idea that SAP/Hana is just the software running on top of the clustering setup since the resource is called saphana.

So it seems this failed resource was trying to startup again and because of the ordering it tried to start it on server1 where it failed so then it probably tried to startup the resources on the server2.
All of the servers are virtual machines yes , nothing is physical. i guess its virtual cluster , regarding server disks they are from azure its possible they are virtual disks as well...
premium ssd storage.

Is this cluster more related to sap//hana person or its more related to linux expert, where you can put it?
 

Members online


Top