inn 2.7.0, expire takes ages and hangs

Discussion:

(too old to reply)

Adam W.

2024-10-29 13:47:54 UTC

Hi,

Users started complaining that they're getting "400 Expiring process ..."
errors on my server. I looked and it's true, expire was running since it
was started from the cron at night, but it was hanged. strace showed that
expire process (expire -v1, called from news.daily) was waiting for data
on a socket:

$ strace -f -p 28158
strace: Process 28158 attached
select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}

It just waits on select, and does nothing else.

$ ls -l /proc/28158/fd/
total 0
lr-x------ 1 news news 64 Oct 29 14:23 0 -> 'pipe:[318604403]'
l-wx------ 1 news news 64 Oct 29 14:23 1 -> 'pipe:[318607484]'
lrwx------ 1 news news 64 Oct 29 14:23 10 -> /usr/local/news/spool/cnfs/cnfs1
lrwx------ 1 news news 64 Oct 29 14:23 11 -> /usr/local/news/spool/cnfs/cnfs2
lrwx------ 1 news news 64 Oct 29 14:23 12 -> /usr/local/news/spool/cnfs/cnfs3
lrwx------ 1 news news 64 Oct 29 14:23 13 -> /usr/local/news/spool/cnfs/cnfs4
lrwx------ 1 news news 64 Oct 29 14:23 14 -> /usr/local/news/spool/cnfs/cnfs5
lrwx------ 1 news news 64 Oct 29 14:23 15 -> /usr/local/news/spool/cnfs/junk1
lrwx------ 1 news news 64 Oct 29 14:23 16 -> /usr/local/news/spool/cnfs/ctrl1
lrwx------ 1 news news 64 Oct 29 14:23 17 -> /usr/local/news/spool/cnfs/priv1
lrwx------ 1 news news 64 Oct 29 14:23 18 -> /usr/local/news/spool/cnfs/big81
lrwx------ 1 news news 64 Oct 29 14:23 19 -> /usr/local/news/spool/cnfs/alt1
l-wx------ 1 news news 64 Oct 29 14:23 2 -> 'pipe:[318607484]'
lrwx------ 1 news news 64 Oct 29 14:23 20 -> /usr/local/news/spool/cnfs/spam1
lrwx------ 1 news news 64 Oct 29 14:23 21 -> /usr/local/news/spool/cnfs/spam2
l-wx------ 1 news news 64 Oct 29 14:23 22 -> /usr/local/news/db/history.n
lr-x------ 1 news news 64 Oct 29 14:23 23 -> /usr/local/news/db/history.n
lr-x------ 1 news news 64 Oct 29 14:23 24 -> /usr/local/news/db/history
lrwx------ 1 news news 64 Oct 29 14:23 3 -> '/tmp/#7734312 (deleted)'
lrwx------ 1 news news 64 Oct 29 14:23 4 -> 'socket:[318607485]'
lrwx------ 1 news news 64 Oct 29 14:23 5 -> 'socket:[318607492]'
lr-x------ 1 news news 64 Oct 29 14:23 6 -> /usr/local/news/db/history
lrwx------ 1 news news 64 Oct 29 14:23 7 -> /usr/local/news/db/history.n.dir
lrwx------ 1 news news 64 Oct 29 14:23 8 -> /usr/local/news/db/history.n.index
lrwx------ 1 news news 64 Oct 29 14:23 9 -> /usr/local/news/db/history.n.hash

$ cat /proc/28158/fdinfo/5
pos: 0
flags: 02
mnt_id: 9
scm_fds: 0

I killed the process, restarted the server, and it seems to be fine
(unless it's not?).

So now questions:

1. Is it possible that I messed anything up by forcefully killing it?

2. Why did it hang / how can I diagnose it / can I diagnose it?

3. I added noexpire to news.daily cron line to avoid it in the future. I
guess there will be no consequences in my setup, as:

a) I'm using huge CNFS buffers for bulk storage (around 100 GB),
b) I'm using timehash for low-volume local groups (currently 53 MB),
c) my expire.ctl is set to never expire ("*:A:never:never:never").

Am I correct?

4. What is responsible for expiring history? Expireover or expire? I'd
guess the former (which I still keep enabled), but now I'm not sure
anymore.

BTW, I remember that it happened before, here's the sample report, just
nobody complained then (which is strange, because people are using this
server constantly, maybe it wasn't throttled then?). Note the time it took
expire to run:

expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
Article lines processed 6057586
Articles retained 6051793
Entries expired 5793
expire end Sat Oct 19 16:15:51 CEST 2024

One earlier run was even longer:

expire begin Thu Oct 10 03:10:09 CEST 2024: (-v1)
Article lines processed 6012924
Articles retained 6007938
Entries expired 4986
expire end Fri Oct 11 19:58:42 CEST 2024

Most of the runs take at most two minutes, like this one:

expire begin Mon Oct 7 03:06:06 CEST 2024: (-v1)
Article lines processed 5986522
Articles retained 5979903
Entries expired 6619
expire end Mon Oct 7 03:07:51 CEST 2024

Another question: why does it expire anything if I have expiration
disabled? I guess "Entries expired" should be 0 in all cases? Or is it
about removing history entries for articles that are no longer in CNFS (or
that are too old, I have "remember" set to 721: "/remember/:721")?

Thanks.

Adam W.

2024-10-29 14:07:59 UTC

Permalink

Post by Adam W.
2. Why did it hang / how can I diagnose it / can I diagnose it?

I'll answer myself... this might be some hint:

expire begin Tue Oct 29 03:08:30 CET 2024: (-v1)
expire: cannot reserve server
Article lines processed 0
Articles retained 0
Entries expired 0
So it tried to reserve the server, and hanged there.

I'm looking at ICCcommand code and I think this condition is wrong:

#v+
/* No data -- if we timed out, return. */
if (ICCtimeout) {
#v-

If ICCtimeout is zero (and it is, because expire never calls ICCsettimeout
and 120 second timeout is used then), then this block will never be
entered. I guess the block was there, someone added a default timeout to
the line before select (T.tv_sec = ...) to prevent select() from hanging
forever if the app didn't specify the timeout, but forgot to remove this
"if" (the block should execute always, not be guarded by "if")?

So I guess we have two issues here. First: why there's no response to the
command, and how to diagnose it? Second: the bug (if I understand it
correctly) in the above code.

The problem is that I can't really do any disruptive tests as it's the
production server, it has to work (currently it's the most used news
server in Poland).

Matija Nalis

2024-11-02 16:27:57 UTC

Permalink

Post by Adam W.
$ strace -f -p 28158
strace: Process 28158 attached
select(6, [5], NULL, NULL, {tv_sec=28, tv_usec=42890}
It just waits on select, and does nothing else.

Usually I'd use lsof(8) to see what is on the other side of that socket, and
try to continue debugging there.

Post by Adam W.
1. Is it possible that I messed anything up by forcefully killing it?

Was it regular TERM or KILL (-9) signal?
But I guess it should be fine in either case, since it was paused and was
not updating anything on the disk.

Post by Adam W.
2. Why did it hang / how can I diagnose it / can I diagnose it?

you can add "-v level" to increase verbosity of expire(8).

Post by Adam W.
3. I added noexpire to news.daily cron line to avoid it in the future. I

<snip>

well, yes in the short term, but see below...

Post by Adam W.
4. What is responsible for expiring history? Expireover or expire? I'd
guess the former (which I still keep enabled), but now I'm not sure
anymore.

According to expire(8), it should be the one expiring the history.

So, if you added "noexpire" to news.daily (which avoids calling expire(8)
AFAICT), your history database would likely grow unbounded (unless you trim
it yourself elsewhere)

Post by Adam W.
Another question: why does it expire anything if I have expiration
disabled?

And how exactly did you have "expiration disabled"?
You still seemed (at the time) to be running expire(8) via cron or
something (news.daily?)

Now that you have added "noexpire" to cron.daily, I would expect that
expire(8) would not be called at all.
Or do you have reason to believe otherwise?

--
Opinions above are GNU-copylefted.

Adam W.

2024-11-04 22:38:12 UTC

Permalink

Post by Matija Nalis

Post by Adam W.
It just waits on select, and does nothing else.

Usually I'd use lsof(8) to see what is on the other side of that socket, and
try to continue debugging there.

Good idea, I forgot about this possibility, and I wanted to bring server
back up quickly.

Post by Matija Nalis

Post by Adam W.
1. Is it possible that I messed anything up by forcefully killing it?

Was it regular TERM or KILL (-9) signal?

TERM, expire process accepted it.

Post by Matija Nalis

Post by Adam W.
2. Why did it hang / how can I diagnose it / can I diagnose it?

you can add "-v level" to increase verbosity of expire(8).

Nice. I just read the manual (should've done it earlier...), now it's more
clear.

Post by Matija Nalis

Post by Adam W.
Another question: why does it expire anything if I have expiration
disabled?

And how exactly did you have "expiration disabled"?

I mean the article expiration, not history. I use CNFS, so articles expire
only when buffers roll over, and I don't forcefully expire my timehash (at
least for now). So the only expiration I'd expect is cleaning the history
(but probably these statistics are also about it).

For some reason I was convinced that expire only expires articles, and
expireover takes care of the overview AND history. Which doesn't make
sense, as overview can be disabled on transit servers, but they still need
history...

Post by Matija Nalis
Or do you have reason to believe otherwise?

No, it's clear now.

So, it seems I still need expire to trim the history.

I think I'll make the copy of the server (I'll just copy the running
instance, it might be inconsistent but it won't matter), reconfigure its
feeds (so my server feeds the copy), and experiment on a copy. Lots of
data to transfer over the Internet, but I don't have a better idea on how
else to diagnose it...

Adam W.

2024-11-05 15:50:47 UTC

Permalink

Post by Matija Nalis
Usually I'd use lsof(8) to see what is on the other side of that socket, and
try to continue debugging there.

What I know now is that the socket is:

/usr/local/news/run/ctlinndcKoXl2

It appears when expire process starts and disappears when it terminates.
In strace I see that this path is sent to the /usr/local/news/run/control
socket, owned by innd.

I guess that's how ctlinnd works...

So it's possible that something went wrong between expire and innd,
probably in innd itself... some race?

I see that innd is waiting on four sockets now, one of them being
/usr/local/news/run/control, another /usr/local/news/run/nntpin, and
two other being IPv4 and IPv6 listening sockets.

Maybe there's some race in that select loop, maybe if one of these sockets
is readable in just the wrong time, the other is ignored... I'd have to
dig deeper into inn's source.

What I'm surprised is that I'm the first person to stumble upon this bug,
and my setup is pretty typical -- normal modern Debian system, vanilla inn
without any modifications, etc.

Post by Matija Nalis
you can add "-v level" to increase verbosity of expire(8).

Unfortunately all it seems to be doing is printing history lines and what
it's doing with them. Might be useful, but I don't know if in this case.

I'm testing it, but it doesn't want to hang yet. Hopefully it will. I'm
also thinking of stress-testing just the ctlinnd part of it.

Julien ÉLIE

2024-11-02 20:41:58 UTC

Permalink

Hi Adam,

Post by Adam W.
3. I added noexpire to news.daily cron line to avoid it in the future. I
guess there will be no consequences in my setup

expire rebuilds the history file, and makes the database more efficient
because it resizes it depending on the number of stored articles. I
would therefore suggest to keep running expire.

Post by Adam W.
4. What is responsible for expiring history? Expireover or expire?

The expire program cleans up and removes old entries in the history
database.

Post by Adam W.
expire begin Sat Oct 19 03:08:38 CEST 2024: (-v1)
Article lines processed 6057586
Articles retained 6051793
Entries expired 5793
expire end Sat Oct 19 16:15:51 CEST 2024

I don't know why it took so much time. It definitively shouldn't have.

Post by Adam W.
Another question: why does it expire anything if I have expiration
disabled? I guess "Entries expired" should be 0 in all cases? Or is it
about removing history entries for articles that are no longer in CNFS (or
that are too old, I have "remember" set to 721: "/remember/:721")?

expire indeed removes entries for articles no longer present in your
spool because they were cancelled or superseded for instance for your
timehash spool, or your CNFS buffers have wrapped. Also, the history
keeps track of Message-IDs rejected by filters like cleanfeed, pyClean
or like. These Message-IDs are kept during /remember/ days.

--
Julien ÉLIE

« Ex nihilo nihil. » (Perse)

Adam W.

2024-11-04 22:39:51 UTC

Permalink

Post by Julien ÃLIE
expire rebuilds the history file, and makes the database more efficient
because it resizes it depending on the number of stored articles. I
would therefore suggest to keep running expire.

Thanks Julien. It's clear now. I'll keep digging then...