Re: [Hampshire] Repeated server crash overnight

Top Page

Reply to this message
Author: James Dutton via Hampshire
To: rob, Hampshire LUG Discussion List
CC: James Dutton
Subject: Re: [Hampshire] Repeated server crash overnight
On Mon, 13 Mar 2023 at 08:03, rmluglist2--- via Hampshire <
hampshire@???> wrote:

> Hi all
> I have an Ubuntu box which is on 24/7/365. It has ufw running allowing
> nothing from outside my lan.
> A couple of times recently, I’ve come in to find the machine locked up
> with a lot of disk access (it can be ping’d but I can’t ssh into it and it
> doesn’t respond to mouse or keyboard on the console – only power cycling
> brings it back). As I say, this has now happened twice in the last 3-4
> nights.
> I have seen this behaviour sometimes.

By default Linux can block all interactive conversations when using high
disk access
High disk access can be caused by a number of things:
1) some app actually needs the disk
2) Faults on the disk, causing many retries.
3) Swap file access

After a reboot, you can look for faults on the disk with "smartctl -a
/dev/sda" and see if there are any log messages there about failed sectors,
or sector reallocation counts increasing etc.

If an app needs the disk, it is probably something kicked off by cron.
You can force these apps to use a lower priority for io with "ionice"
Google ionice for suitable ways to run it.
But, I think a good diagnosis is probably to disable cron altogether for
say a week, and see if the problem disappears.
Then at least you will then know that cron and the apps it runs are the

Another possible cause, is an app causing it to run low on memory that
results in unpredictable behaviour when memory allocation fails, and it
seems a lot of programs don't behave well when that happens. This might
also cause excessive swap file access.

These are all problems that are difficult to diagnose while they are
happening, so the trick is to set up monitoring to watch for each of the
E.g. take metrics of free RAM and when the fault happens, you can look at
the metrics graph, to see if that is the problem etc.
take metrics of the disk access on a per app basis.
Normally the lock up will not be immediate, it will get slow first and then
eventually lock up. So at least some metrics are written before the lock up.

Kind Regards

Please post to: Hampshire@???
Web Interface: