Newsgroups: news.software.b
Path: utzoo!henry
From: henry@utzoo.uucp (Henry Spencer)
Subject: Re: Lots of dups
Message-ID: <1989Oct26.164042.4692@utzoo.uucp>
Organization: U of Toronto Zoology
References: <1989Oct25.164024.14894@ctr.columbia.edu> <1989Oct25.205129.16397@brutus.cs.uiuc.edu>
Date: Thu, 26 Oct 89 16:40:42 GMT

In article <1989Oct25.205129.16397@brutus.cs.uiuc.edu> coolidge@cs.uiuc.edu writes:
>>I run a Cnews machine with a few high-speed (NNTP) feeds.  My problem is that
>>two of them have excessive (> 80%) duplication.
>
>The problem is that with REALLY fast feeds, even processing articles
>once a minute is not fast enough. The problem lies in nntpd accepting
>multiple copies because the queue hasn't been run yet...

The fundamental problem, however, goes even deeper.  There are two inherently
conflicting desires:

1. Minimum processing latency, so that an article which has arrived is known
	as soon as possible, e.g. so that it need not be brought in again.
	This pushes you towards processing individual articles the instant
	they show up.

2. Minimum processing overhead, so that machine resources devoted to news
	are minimized.  One major way of doing this -- one of C News's
	biggest performance wins over B News -- is to amortize setup and
	teardown overhead over multiple articles.  This means delaying
	processing until several articles can be run as a batch.

There is NO WAY to satisfy both of these desires simultaneously.  All you
can do is strike some sort of compromise, depending on your own priorities.

In particular, if you have very fast feeds and optimize for minimum overhead,
you will inevitably receive lots of articles more than once, although C News
will throw away the duplicates quite efficiently.  (I don't really see that
there is cause for great alarm about efficiently-discarded duplicates.)

C News is generally slanted towards minimum overhead, given our observation
that B News was increasingly eating our machines alive doing one article at
a time.

(Incidentally, the oft-seen suggestion of running relaynews as a daemon
doesn't really help very much.  An article is not really received until
it, its history-file line, and the update to the relevant active-file
line(s), are flushed out to disk.  Flushing data to disk is a big part
of the setup/teardown overhead.  If you do it once per article, the
overhead goes way up.  If you batch the disk flushing, you're back to
having a significant window in which the article has been received but
this fact is not universally and positively known yet.  The relaynews
daemon might be a net improvement, but it is *not* an escape from the
fundamental dilemma.)
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
