[HN Gopher] Compressing data with sample points and polynomial i...
___________________________________________________________________
Compressing data with sample points and polynomial interpolation
Author : speckx
Score : 57 points
Date : 2024-08-21 14:45 UTC (5 days ago)
(HTM) web link (www.johndcook.com)
(TXT) w3m dump (www.johndcook.com)
| jonathan_landy wrote:
| Love this. Very interesting that same amount of compression
| (samples) can give ever more accuracy if you do a bit more work
| in the decompression -- by taking higher order fits to more of
| the sample points.
| nerdponx wrote:
| This is pretty much the core principle underlying modern
| machine learning. More parameters means more faithful fit for
| the data, at the cost of over-fitting and generalizing poorly
| on unseen data from outside the range of data that was used to
| tune the parameters. In this particular application, we aren't
| that worried about overfitting because we know the actual
| function used to compress the data in the first place, so we
| know that our decompression function is "correct" _and_ we know
| the range of the data. So we can keep adding parameters to
| reduce reconstruction error. Meanwhile in applied ML and stats,
| cubic and even quadratic models should be used and interpreted
| only with extreme caution and detailed knowledge of the data
| (how it was prepared, what the variables mean, what future data
| might look like, etc).
| patrick451 wrote:
| This also seems to a difference between interpolation and
| extrapolation. The table doesn't just fit a polynomial to
| theta between 0 and pi/8 and expect you to extrapolate for
| theta > pi/8. That would have catastrophic results. It has
| always seemed to me like one of the big problems with ML is
| knowing whether a given inference is an interpolation or an
| extrapolation.
| pyinstallwoes wrote:
| Interpolation would require deriving from the resultant and
| extrapolation would guarantee no inference is how I make
| sense of it
| KolenCh wrote:
| In that sense, extrapolation should never be used in
| "production". At best it is for exploration.
|
| One characteristics of ML is that this distinction often is
| not clear. (Hallucination, generalization, etc.)
| SideQuark wrote:
| This article makes the mistake of assuming the naive error bound
| is the actual error on the specific problem. This problem has
| very smooth functions, infinitely differentiable, and has
| derivatives tabulated. The error bound he states doesn't use this
| information.
|
| The correct answer is to check this problem specifically, not
| apply much looser bounds.
| KolenCh wrote:
| New readers of John Cook may not know this: his posts are
| pedagogical, often are illustrating the gist of the idea or
| message, not including all the details or the optimal analysis.
| This post mentioned the motivation too.
|
| Also, he is a consultant, and this blog generates interests of
| that. He often works in clients' data and perform EDA and
| provides solutions, such as statistical reasoning or methods.
| Ie while he write codes, it's often not for production. He
| points people in the right direction so to speak.
| SideQuark wrote:
| Except this is not pointing in the right direction. It's
| demonstrably wrong. He then criticizes the author of the
| table, calling that person wrong.
|
| If I were hiring a consultant claiming math analysis skills
| and they posted such a fundamental mistake, I'd honestly move
| on to someone else.
|
| I've read Cook's post a long time, and he doesn't often make
| such logic errors. I suspect he'll retract his claims once he
| realizes his errors.
| fuzzfactor wrote:
| >Fifty years ago scientists were concerned with a different
| application of compression: reducing the size of mathematical
| tables.
|
| Guilty as charged.
|
| >Books of tabulated functions are obsolete now, but the
| principles used in producing these tables are still very much
| relevant.
|
| Even more relevant back then in reverse, after computers came
| within reach.
|
| But you have to get there first.
|
| Going back _more_ than 50 years, to arrive at the tables to begin
| with, an algorithm was calculated across a series of values, and
| the output tabulated according to increments that may or may not
| be the same as that of the original data collection. This could
| be especially true with physical properties or natural phenomena.
| OTOH with pure math functions which are well-defined to begin
| with, somebody has to pick the range and increments. Either way
| this needs to be done so that the table serves its intended
| purpose.
|
| Frequently one of the important features was the accuracy beyond
| what could be obtained from an engineering slide rule. In cases
| where a slide rule delivered adequate significant figures for the
| job, people were not motivated to generate a half-kilo book of
| numbers. But when more decimal places were needed it was tables,
| 4 to 6 decimal places or significant figures made it worth it
| lots of times.
|
| Although sometimes a "table or two" had been calculated using a
| slide rule in a very definitive way, very close to the highly
| valid raw data. The most trusted way it could be at the time. As
| the leading author of the table, numerically characterizing the
| property for publication, the best guess estimate of significant
| figures would be made beyond that which the slide rule was
| reliably capable. Simply because more decimal places were needed
| when that's the purpose of the table to begin with.
|
| Publish them and they become the best reference there is,
| regardless of their certainty.
|
| Further if tables find their way into an institutional workflow,
| they can be made official, and therefore required. To the full
| number of significant figures supported by the recognized weakest
| link in the calculation chain, often the table(s) itself. Which
| brings any deviations in precision to the forefront every time
| for scrutiny. Even though uncertainty can be well within the
| accuracy of the original slide rule, because it was too sloppy
| physically for the original task. Too bad, the table's been
| official forever, you can't use a computer to replace the table
| unless it outputs the exact same values as the table, to the full
| number of decimal places. The accuracy of a table can even be
| recognized as questionable across-the-board, that's OK if
| everybody agrees to use it anyway, so there.
|
| One of the things that helped to make progress was doing
| everything in integers that would have otherwise been done in
| floating point. Something like it was done with a slide rule
| where you had to keep good track of the decimal point your own
| self.
|
| It is all too common for floating point numbers to give the
| illusion that there are unlimited decimal places in everything.
|
| Getting back to the article, when _building_ a table from an
| appropriate function using an adequate calculating device,
| ideally the repeatable number of significant figures to be
| published with complete certainty, combined with the size of the
| increment between entries, will allow interpolation between
| published values to be almost as trustworthy as the published
| values themselves.
|
| Even if you are not "publishing" the values, the number of
| decimal places _accurately_ retained during critical calculations
| will be the same limitation, and the increment comes into play
| whether or not it is the same granularity going into a
| "database" or withdrawing from it.
___________________________________________________________________
(page generated 2024-08-26 23:00 UTC)