hngopher.com

       [HN Gopher] Compressing data with sample points and polynomial i...
       ___________________________________________________________________
        
       Compressing data with sample points and polynomial interpolation
        
       Author : speckx
       Score  : 57 points
       Date   : 2024-08-21 14:45 UTC (5 days ago)
        
 (HTM) web link (www.johndcook.com)
 (TXT) w3m dump (www.johndcook.com)
        
       | jonathan_landy wrote:
       | Love this. Very interesting that same amount of compression
       | (samples) can give ever more accuracy if you do a bit more work
       | in the decompression -- by taking higher order fits to more of
       | the sample points.
        
         | nerdponx wrote:
         | This is pretty much the core principle underlying modern
         | machine learning. More parameters means more faithful fit for
         | the data, at the cost of over-fitting and generalizing poorly
         | on unseen data from outside the range of data that was used to
         | tune the parameters. In this particular application, we aren't
         | that worried about overfitting because we know the actual
         | function used to compress the data in the first place, so we
         | know that our decompression function is "correct" _and_ we know
         | the range of the data. So we can keep adding parameters to
         | reduce reconstruction error. Meanwhile in applied ML and stats,
         | cubic and even quadratic models should be used and interpreted
         | only with extreme caution and detailed knowledge of the data
         | (how it was prepared, what the variables mean, what future data
         | might look like, etc).
        
           | patrick451 wrote:
           | This also seems to a difference between interpolation and
           | extrapolation. The table doesn't just fit a polynomial to
           | theta between 0 and pi/8 and expect you to extrapolate for
           | theta > pi/8. That would have catastrophic results. It has
           | always seemed to me like one of the big problems with ML is
           | knowing whether a given inference is an interpolation or an
           | extrapolation.
        
             | pyinstallwoes wrote:
             | Interpolation would require deriving from the resultant and
             | extrapolation would guarantee no inference is how I make
             | sense of it
        
             | KolenCh wrote:
             | In that sense, extrapolation should never be used in
             | "production". At best it is for exploration.
             | 
             | One characteristics of ML is that this distinction often is
             | not clear. (Hallucination, generalization, etc.)
        
       | SideQuark wrote:
       | This article makes the mistake of assuming the naive error bound
       | is the actual error on the specific problem. This problem has
       | very smooth functions, infinitely differentiable, and has
       | derivatives tabulated. The error bound he states doesn't use this
       | information.
       | 
       | The correct answer is to check this problem specifically, not
       | apply much looser bounds.
        
         | KolenCh wrote:
         | New readers of John Cook may not know this: his posts are
         | pedagogical, often are illustrating the gist of the idea or
         | message, not including all the details or the optimal analysis.
         | This post mentioned the motivation too.
         | 
         | Also, he is a consultant, and this blog generates interests of
         | that. He often works in clients' data and perform EDA and
         | provides solutions, such as statistical reasoning or methods.
         | Ie while he write codes, it's often not for production. He
         | points people in the right direction so to speak.
        
           | SideQuark wrote:
           | Except this is not pointing in the right direction. It's
           | demonstrably wrong. He then criticizes the author of the
           | table, calling that person wrong.
           | 
           | If I were hiring a consultant claiming math analysis skills
           | and they posted such a fundamental mistake, I'd honestly move
           | on to someone else.
           | 
           | I've read Cook's post a long time, and he doesn't often make
           | such logic errors. I suspect he'll retract his claims once he
           | realizes his errors.
        
       | fuzzfactor wrote:
       | >Fifty years ago scientists were concerned with a different
       | application of compression: reducing the size of mathematical
       | tables.
       | 
       | Guilty as charged.
       | 
       | >Books of tabulated functions are obsolete now, but the
       | principles used in producing these tables are still very much
       | relevant.
       | 
       | Even more relevant back then in reverse, after computers came
       | within reach.
       | 
       | But you have to get there first.
       | 
       | Going back _more_ than 50 years, to arrive at the tables to begin
       | with, an algorithm was calculated across a series of values, and
       | the output tabulated according to increments that may or may not
       | be the same as that of the original data collection. This could
       | be especially true with physical properties or natural phenomena.
       | OTOH with pure math functions which are well-defined to begin
       | with, somebody has to pick the range and increments. Either way
       | this needs to be done so that the table serves its intended
       | purpose.
       | 
       | Frequently one of the important features was the accuracy beyond
       | what could be obtained from an engineering slide rule. In cases
       | where a slide rule delivered adequate significant figures for the
       | job, people were not motivated to generate a half-kilo book of
       | numbers. But when more decimal places were needed it was tables,
       | 4 to 6 decimal places or significant figures made it worth it
       | lots of times.
       | 
       | Although sometimes a "table or two" had been calculated using a
       | slide rule in a very definitive way, very close to the highly
       | valid raw data. The most trusted way it could be at the time. As
       | the leading author of the table, numerically characterizing the
       | property for publication, the best guess estimate of significant
       | figures would be made beyond that which the slide rule was
       | reliably capable. Simply because more decimal places were needed
       | when that's the purpose of the table to begin with.
       | 
       | Publish them and they become the best reference there is,
       | regardless of their certainty.
       | 
       | Further if tables find their way into an institutional workflow,
       | they can be made official, and therefore required. To the full
       | number of significant figures supported by the recognized weakest
       | link in the calculation chain, often the table(s) itself. Which
       | brings any deviations in precision to the forefront every time
       | for scrutiny. Even though uncertainty can be well within the
       | accuracy of the original slide rule, because it was too sloppy
       | physically for the original task. Too bad, the table's been
       | official forever, you can't use a computer to replace the table
       | unless it outputs the exact same values as the table, to the full
       | number of decimal places. The accuracy of a table can even be
       | recognized as questionable across-the-board, that's OK if
       | everybody agrees to use it anyway, so there.
       | 
       | One of the things that helped to make progress was doing
       | everything in integers that would have otherwise been done in
       | floating point. Something like it was done with a slide rule
       | where you had to keep good track of the decimal point your own
       | self.
       | 
       | It is all too common for floating point numbers to give the
       | illusion that there are unlimited decimal places in everything.
       | 
       | Getting back to the article, when _building_ a table from an
       | appropriate function using an adequate calculating device,
       | ideally the repeatable number of significant figures to be
       | published with complete certainty, combined with the size of the
       | increment between entries, will allow interpolation between
       | published values to be almost as trustworthy as the published
       | values themselves.
       | 
       | Even if you are not "publishing" the values, the number of
       | decimal places _accurately_ retained during critical calculations
       | will be the same limitation, and the increment comes into play
       | whether or not it is the same granularity going into a
       | "database" or withdrawing from it.
        
       ___________________________________________________________________
       (page generated 2024-08-26 23:00 UTC)