Re: [Hampshire] Repeated server crash overnight

Top Page
Author: Brad Macpherson via Hampshire
To: hampshire
CC: Brad Macpherson
Subject: Re: [Hampshire] Repeated server crash overnight

Reply to this message
gpg: failed to create temporary file '/var/lib/lurker/': Permission denied
gpg: keyblock resource '/var/lib/lurker/pubring.gpg': Permission denied
gpg: Signature made Mon Mar 13 16:05:38 2023 GMT
gpg: using RSA key 898A78DCF0DC9B14FF90997EB472A863FC94FD2A
gpg: Can't check signature: No public key
G'day all,

On 13/03/2023 14:32, James Dutton via Hampshire wrote:
> On Mon, 13 Mar 2023 at 08:03, rmluglist2--- via Hampshire 
> <hampshire@??? <>> wrote:
>     Hi all____
>     __ __
>     I have an Ubuntu box which is on 24/7/365.   It has ufw running
>     allowing nothing from outside my lan.____
>     __ __
>     A couple of times recently, I’ve come in to find the machine locked
>     up with a lot of disk access (it can be ping’d but I can’t ssh into
>     it and it doesn’t respond to mouse or keyboard on the console – only
>     power cycling brings it back).   As I say, this has now happened
>     twice in the last 3-4 nights.____
>     __ __
> I have seen this behaviour sometimes.
> By default Linux can block all interactive conversations when using high 
> disk access
> High disk access can be caused by a number of things:
> 1) some app actually needs the disk
> 2) Faults on the disk, causing many retries.
> 3) Swap file access
> After a reboot, you can look for faults on the disk with "smartctl -a  
> /dev/sda" and see if there are any log messages there about failed 
> sectors, or sector reallocation counts increasing etc.
> If an app needs the disk, it is probably something kicked off by cron.
> You can force these apps to use a lower priority for io with "ionice"   
> Google ionice for suitable ways to run it.
> But, I think a good diagnosis is probably to disable cron altogether for 
> say a week, and see if the problem disappears.
> Then at least you will then know that cron and the apps it runs are the 
> problem.

I've seen this behaviour with ClamAV; in the end I had to remove it. The
database gets to a certain point where it won't fit in memory along with
the rest of the system; swap doesn't help, you'd need to add RAM to
accommodate it.

> Another possible cause, is an app causing it to run low on memory that
> results in unpredictable behaviour when memory allocation fails, and it
> seems a lot of programs don't behave well when that happens. This might
> also cause excessive swap file access.
> These are all problems that are difficult to diagnose while they are
> happening, so the trick is to set up monitoring to watch for each of the
> cases.
> E.g. take metrics of free RAM and when the fault happens, you can look
> at the metrics graph, to see if that is the problem etc.
> take metrics of the disk access on a per app basis.
> Normally the lock up will not be immediate, it will get slow first and
> then eventually lock up. So at least some metrics are written before the
> lock up.
> Kind Regards
> James



Please post to: Hampshire@???
Web Interface: