[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ProgSoc] Trogdor hates me.



> What happened? Trogdor was up as of 5pm tonight, but was dead by the
> time I got home at 7pm.
>
> If someone could let me know what the heck is going on, I'd be most
> appreciative.

Okay, this is driving me nuts.  I've installed CVS on sutekh as well so
access to it isn't interrupted, and I've returned sutekh to the MX
rotation but right now it's feeling like a game of whack-a-mole.  To
slightly reduce the problem I've added a service alias of
cvs.progsoc.uts.edu.au for use with CVS.  That way users only need to put
that into their client, and if a problem pops with the server it points
to, I can make a single change at the server end rather than every client
having to change it.

While trogdor has been running for almost three weeks straight without
issues, if this is a sign of unreliability on trogdor's part (and not
just someone pulling a power cord or something), then we're in some
really deep trouble, because it means we have *no* reliably working
servers, since all we have then is:

-yeenoghu: busted since something unknown went horribly wrong while Anand
was performing software upgrades, it now looks like the filesystem is
corrupt to the extent that it can't boot properly.  Worse than that, the
CD-ROM drive isn't bootable, and the floppy drive seems to be broken,
meaning there's no way to boot into anything else.  I *think* I can get
things working, but it's going to take quite some time.

-sutekh: has been intermittently crashing for a few weeks now.  In fact
it's been quite unreliable ever since the power outage a little while ago.
While the cause isn't clear, the complete randomness of the crashes, which
are almost never the same twice, seems to suggest hardware problems.

-trogdor: well, looks like it's just gone down now, but I'll have to see.

-geryon: nowhere near powerful enough to do anything anyway.

As for the two new servers we've received, the current debian bootable
install CDs don't work on them properly.  While I expect I can get around
this problem, it'll still be a while.  About the only working machines we
have are orgo and medusa, which is  fortunate, since the entire progsoc
network relies on them being up.  I'm not going to so much as touch them
for now, since I don't want to risk making things worse than they already
are.

While normally I would never ask  this, if anyone has a reliable, working,
low-mid end x86 system (400mhz/128mb is about all that's needed) to spare
that they can donate or lend to progsoc for a while, it'd be quite
welcome, since at a rush I can get basic services onto a machine of that
power in less than a day, and be pretty well certain that it'll work
until I can figure out what the hell is wrong with everything else.  I
hate having services down constantly like this, but unfortunately the
current hardware we have seems to be falling apart or at least related to
our current problems, so setting up a basic x86 system seems to be the
best way to guarantee services, since there's not much I can realistically
do otherwise for now.

I'll post an update when I have more information, but that's all I have
for now, I only just found out trogdor went down.  If any exec are reading
this and are in tomorrow, could you please plug geryon's serial connection
back into trogdor from the new sunblade it's plugged into?  At least then
I can do some remote admin on it.

Thanks for your patience,
David

PS: I've also fixed mail on sutekh, looks like when anand upgraded it to
exim4 for mail he used trogdor's config file, but that meant it pointed
the mailing lists to the sparc binary for majordomo which obviously
doesn't work too well as I discovered when I first tried to send this mail
out.

-
You are subscribed to the progsoc mailing list. To unsubscribe, send a
message containing "unsubscribe" to progsoc-request@xxxxxxxxxxxxxxxxxxx
If you are having trouble, ask owner-progsoc@xxxxxxxxxxxxxxxxxx for help.