CLEANACCESS Archives

October 2005

CLEANACCESS@LISTSERV.MIAMIOH.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Rajesh Nair (rajnair)" <[log in to unmask]>
Reply To:
Perfigo SecureSmart and CleanMachines Discussion List <[log in to unmask]>
Date:
Tue, 11 Oct 2005 11:22:44 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (228 lines)
Jason,

I am one of the "Perfigo" people.  I hope whatever I have to say is more
useful.  Else, I have to go hang my head in shame... :-)

I am glad to hear that there have been no "packages/applications" that
were installed on the CAMs. 

Jason, I don't believe that the problem is that the read-only accounts
have been created or destroyed.  Its that the "ReadOnly" group in
pg_catalog.pg_group is not there on the standby machine.  Here's what
happens:

1) The standby machine comes up (either a reboot or a service perfigo
start). 
2) It detects that the peer is alive and active. It detects that it can
talk to the peer and get its database.
3) It saves its own db in a pre-failover snapshot and gets a snapshot
from its peer. 
4) Owing to the changes in the db schema on the peer, the snapshot
contains "unexpected" (by unexpected, I mean directives that are not
generally in any out-of-the-box CCA installation's db snapshot - for
e.g. GRANT SELECT ON.... TO ReadOnly) SQL statements.
5) The standby machine unzips the snapshot and proceeds to try and
create the database from the uncompressed snapshot. 
6) In doing so, it encounters the "GRANT .... " statement. 

Now, if the group "ReadOnly" exists in pg_catalog.pg_group, this
statement would be executed and generate no errors and CCA's db sync
process would be none-the-wiser.  However, on the standby, this group
does NOT exist and that is what creates the errors.  

Hence, my earlier statement that if you manually created the entry in
pg_group on the standby machine (exactly the same entry as the active
machine has), things might just work.  

At some point, this information (pg_group entry or entries) was lost on
the standby.  Now, I can say with certainty that the failover process
does not alter any pg_catalog information.  I can also say that the
3.5.5 or 3.5.6 upgrades does not alter than info.  Of course, a
re-install will remove that information.  So, I can understand why it
doesn't work now.  During the initial upgrade process, did you follow
all the instructions for a failover upgrade or was there some variation?
The reason I ask is because a normal upgrade should not cause you to
lose pg_catalog information simply because we don't change that.  And
these particular upgrades did not even have any postgres
patches/upgrades/etc. 

One thing I do want to mention - in the future, it is entirely possible
that we upgrade postgres versions or apply postgres patches.  We can
only test that such version upgrades/patches do not affect anything that
we do to the database.  I am sure you can understand that we cannot
possibly know or test for such changes affecting other changes to the
database that we are unaware of.  

Hence, I would request that you use the API
(https://<cam>/admin/cisco_api.jsp) to interact with the database if
possible.  I am also aware that the API might have shortcomings w.r.t.
missing functionality, etc.  Please make us aware of that.  Let us know
what else might be missing.  When you talk to Cisco TAC, make sure you
ask them to document enhancement requests.  We can and do add additional
API functions regularly.  

HTH,
Rajesh.


-----Original Message-----
From: Perfigo SecureSmart and CleanMachines Discussion List
[mailto:[log in to unmask]] On Behalf Of Jason Richardson
Sent: Monday, October 10, 2005 7:30 PM
To: [log in to unmask]
Subject: Re: Problems with database sync between CAMs after upgrade to
ver. 3.5.5

Hi Rajesh, we have been having that same conversation with Cisco SE's
since last Friday, and all of the sync problems started somewhere
between 5AM and 5:10AM last Wed. when we did the upgrade to 3.5.5. 
Prior to that the CAMs were synced just fine so the upgrade was at least
the catalyst for our problems.

One of the problems has been differences in terminology as the SEs have
also referred to the "package" or "app" that we installed as reason for
telling us that our config is unsupportable and that we have to reload
everything from scratch.  My tech, however, insists that he didn't
install an app or package at all, he simply edited the pg_hba.conf file
to allow read-only access to the postgres database by a few other
machines, and created a few read-only database accounts for doing so,
following instructions that have been posted on this listserv, although
not posted by anyone at Cisco.  We have since removed those accounts to
restore the system to plain vanilla, but the problem remains.  Maybe the
differences in terminology are irrelevant and we're talking about the
same thing.  The end result is that we ended up reloading the OS on the
back-up CAM at the SE's suggestion, but we are not inclined to take our
primary down to reload it without talking to someone who knows postgres
better than the L2 than we talked to today.  I'm assuming that some of
the Perfigo people came over when Cisco acquired the company so I am
hopeful that we will eventually be put in touch with one if we hold on. 
We will try what you suggested (with all caveats taken into account) and
continue to work with the L2s and hope that someone here has something
to add.

Thanks,

---
Jason Richardson
Manager, IT Security and Client Development Enterprise Systems Support
Northern Illinois University


>>> [log in to unmask] 10/10/05 9:10 PM >>>
Hi Jason,

I am from Cisco and I know about this case.  I don't think this has
anything to do with the upgrade to ver 3.5.5. 

I believe what happened on your CAMs is that the package that you
installed (pgadmin or something similar) modified the database system
tables.  For instance, one of the differences I noticed was that there
were entries in pg_group whereas in a "unmodified" CAM, there are no
entries in the pg_group.  

The package that you installed on the CAM modified NOT ONLY the system
tables but also modified the CCA database (controlsmartdb).  Try the
following on your CAM that is up and running:
# psql -h 127.0.0.1 controlsmartdb postgres controlsmartdb=# \dp

What you will see is that each of the tables has some additional access
privilege information.  This information, as you can see, has been
written into the controlsmartdb database. 

Hence, what happens is that when the inactive box is rebooted/restarted
and gets the database snapshot from its peer for synchronization, the
database snapshot will contain information about these privileges (e.g.
it will contain instructions to grant read access to <foo_table_name>
for the "ReadOnly" group).  However, such access privilege information
is not valid on this inactive machine because those groups (e.g.
ReadOnly) do not exist for whatever reason.  You will have to consult
the documentation for the package you installed for more information. 

I suspect that the only issue would be the group information and the
entries in the pg_group system table.  However, I am unsure as I don't
know what package was installed and may not be able to comment about the
package because we won't necessarily know how it functions. 

Hence, what the Cisco TAC engineer told you is entirely reasonable from
his/her point-of-view.  Since they don't know what package you
installed, nor could they be expected to know how that package affects
the system, they would not be able to hazard any suggestions.  

I can offer a suggestion - but please note that this is only a
suggestion and may not work at all and should be taken with more than a
pinch of salt because I am totally unfamiliar with the package you tried
to install on the CAM.  :-) Sorry, I have to provide the disclaimer up
front. 

I suspect that if you try to replicate the entries from pg_group on the
working system to the pg_group on the inactive system and then try the
failover, it might work.  If the only issue with the database restore is
that the appropriate pg_group entries (i.e. ReadOnly) are not available,
this might work.  However, you might very well run into other issues
(i.e. other changes to pg_catalog system that I am unaware of at this
point) and this might only be the first one. 

Please let me know how things proceed.

Regards and hope this helps,
-Rajesh.

-----Original Message-----
From: Perfigo SecureSmart and CleanMachines Discussion List
[mailto:[log in to unmask]] On Behalf Of Jason Richardson
Sent: Monday, October 10, 2005 5:49 PM
To: [log in to unmask]
Subject: Problems with database sync between CAMs after upgrade to ver.
3.5.5

Hi all, ever since upgrading our two CAS and CAM servers from 3.5.3.1 to
3.5.5, and the agent to 3.5.8, (the Cisco SE that we trust to give us
good advice was not comfortable with 3.5.6 or 3.5.8 yet), we have been
unable to get our CAMs to sync the database.  We have two for HA, but we
have only been running with our primary since last Wed. AM when we
completed the upgrade.  I've pasted my tech's explanation of the issue
below.  Please let us know if you have experienced the same or anything
like it because we have pretty much exceeded the Cisco L2's knowledge
that has been working with us.  The current status is that the back-up
CAM has been reinstalled, but it will not sync with the primary because
it hangs on a non-existent postgres user group named "read_only".  The
accounts that we created were read only but they have been removed.

TIA,

---
Jason Richardson
Manager, IT Security and Client Development Enterprise Systems Support
Northern Illinois University

 
We had a bit of a meltdown with the backup CAM. We upgraded to version
3.5.5 last Wednesday and after the patch the failover stopped syncing
with the main database. Our upgrade happened at about 5 AM Wednesday
morning and the backup had a copy of the database until 5:11 AM. The
standby was still sending the heartbeat, just the data wasn't in sync.
I
had made some changes to the CAMs a while back to allow read only access
to the database, but after the upgrade all the changes had reverted to
original configuration. 
 
What I had done before the upgrade: 
Addedd IP addresses to pg_hba.conf to allow access to the database
Created read-only account so as not to use the admin account. 
 
With these changes, the main and failover were syncing fine until the
upgrade. Thursday I realized that the changes I had made had been
reverted to defaults so I added them back in. After doing so, I was able
to read the data in the backup and noticed that there was no data since
5:11 AM Wednesday morning. 
 
Our Network Engineers contacted Cisco and were told that because of what
I had done, they were unable to help and therefore need to re-install
the standby. This is where we are now. 
 
I would really like to know what may have caused this loss of
communication between the databases. I'm fairly positive the changes I
made would not have done it as it was syncing fine after I had made
those and the problem arose after the upgrade which set it to defaults.
 

ATOM RSS1 RSS2