[HN Gopher] Intel Distribution for Python
       ___________________________________________________________________
        
       Intel Distribution for Python
        
       Author : EntICOnc
       Score  : 90 points
       Date   : 2021-07-21 06:29 UTC (16 hours ago)
        
 (HTM) web link (software.intel.com)
 (TXT) w3m dump (software.intel.com)
        
       | RocketSyntax wrote:
       | is there a pip package?
        
       | mushufasa wrote:
       | There is a longstanding issue around MKL and OpenBLAS
       | optimization flags making intel systems artificially faster than
       | amd ones for numpy computations.
       | https://stackoverflow.com/questions/62783262/why-is-numpy-wi...
       | 
       | If there are true optimizations to be had, wonderful. But those
       | should be added to core binaries pypi / conda. I am worried that
       | Intel here may be trying to again artificially segment their
       | optimization work on their math libraries for business rather
       | than technical reasons.
        
         | jxy wrote:
         | That SO performance benchmark would be so much more useful if
         | the OP had also run OpenBlas on the xeon.
        
         | mistrial9 wrote:
         | what, no Debian/Ubuntu ? _sigh_
        
           | zvr wrote:
           | Of course:                   echo "deb
           | https://apt.repos.intel.com/oneapi all main" | sudo tee
           | /etc/apt/sources.list.d/oneAPI.list
           | 
           | You can read the "apt" section of the package managers, if
           | that's what you prefer. https://software.intel.com/content/ww
           | w/us/en/develop/documen...
        
         | dsign wrote:
         | Thanks for bringing out that link, I had had that nagging
         | question about how specific Intel performance libraries were to
         | Intel hardware. At least in this case, it seems not much.
        
         | gnufx wrote:
         | At least single-threaded "large" OpenBLAS GEMM has always been
         | similar to MKL once it has the micro-architecture covered. If
         | there's some problem with the threaded version (which one?),
         | has it been reported like it would be for use in Julia? Anyway,
         | on AMD, why wouldn't you use AMD's BLAS (just a version of
         | BLIS). That tends to do well multi-threaded, though I'm
         | normally only interested in single-threaded performance. I
         | don't understand why people are so obsessed with MKL,
         | especially when they don't measure and understand the
         | measurements.
        
         | vitorsr wrote:
         | > those should be added to core binaries pypi / conda
         | 
         | They have.
         | 
         | PyPI:
         | 
         | https://pypi.org/user/Intel-Python
         | 
         | https://pypi.org/user/IntelAutomationEngineering
         | 
         | https://pypi.org/user/sat-bot
         | 
         | Anaconda:
         | 
         | https://anaconda.org/intel
        
         | mhh__ wrote:
         | Do AMD even have optimized packages available? Don't get me
         | wrong, I'm not a huge fan of what Intel get up to but AMD's
         | profiling software is dreadful so I'm not exactly surprised
         | that Intel don't even entertain the option.
        
         | thunkshift1 wrote:
         | What do you mean by 'artificially faster'?
        
           | jchw wrote:
           | Intel libraries whitelist their own CPUs for using certain
           | extension instruction sets, instead of using the relevant CPU
           | ID feature flag for that feature as their own documentation
           | tells you to.
        
             | jeffbee wrote:
             | CPUID is insufficient. CPUID can tell you that a CPU has a
             | working PDEP/PEXT, but it can't tell you that a CPU's PDEP
             | _sucks_ like the one on all AMD processors prior to Zen3.
        
               | sitkack wrote:
               | The real answer is to do feature probing and benchmarking
               | the underlying implementation. In the cloud you never
               | really know the hardware backing your instance.
        
               | jchw wrote:
               | This argument crops up every time but it's irrelevant;
               | MKL does and always has worked absolutely fine on AMD
               | processors with the checks disabled, and no,
               | reproducibility is not a feature of MKL that is enabled
               | by default and it never was. Intel even had to add a
               | disclaimer that MKL doesn't work properly on non-Intel
               | processors after legal threats, and they still ran with
               | that for literally years despite knowing it could just be
               | fixed.
               | 
               | When this first cropped up, I was using _Digg_.
               | 
               | Edit: removed note that they fixed the cripple AMD
               | function; they didn't, they actually just removed the
               | workaround that made it easier to disable the checks; I
               | was misinformed. Apparently now some software does
               | runtime patching to fix it, including Matlab...
        
               | jeffbee wrote:
               | Yeah I don't think all the hacks are out, yet. But my
               | point is only that the availability of some feature is
               | not the only input to the decision to use that feature at
               | runtime. Some of these conditions may look suspiciously
               | like shorthand for IsGenuineIntel(), even if they are
               | legit, like blacklisting BMI2 on AMD, because BMI2 on AMD
               | was useless over most of its history.
        
               | gnufx wrote:
               | Recent MKL will generate reasonable code for Zen if you
               | set a magic environment variable, but it was very limited
               | (possibly only sgemm and/or dgemm when I looked). Once
               | you've generated AVX2 with a reasonable block size,
               | you're most of the way there. But why not just use a free
               | BLAS which has been specifically tuned for your AMD CPU
               | (and probably your Intel one)?
        
               | user5994461 wrote:
               | Nope, they removed support for the magic environment
               | variable in the latest MKL release.
        
         | pletnes wrote:
         | From a practical perspective you have to use _some_ BLAS
         | library. If there is a working alternative from AMD, it would
         | be great if you share it. They did have one in the past
         | although I don't recall its name.
        
       | rshm wrote:
       | Looks like recompilation. I am guessing gains are on numpy and
       | scipy. For python heavy code base, i doubt it can be performant
       | than pypy.
        
       | bananaquant wrote:
       | Quite unsurprisingly, this distribution has no support for ARM:
       | https://software.intel.com/content/www/us/en/develop/article...
       | 
       | I once was excited about Intel releasing their own Linux distro
       | (Clear Linux), but it has the same problem. It looks like Intel
       | is trying to make custom optimized versions of popular open-
       | source projects just to get people to use their CPUs, as they
       | lose their leadership in hardware.
        
         | smoldesu wrote:
         | "Their" CPUs meaning x86 platforms, in this case.
         | 
         | Plus, who's surprised? This is how Intel _makes money_. The
         | consumer segment is a plaything for them, the real high-rollers
         | are in the server segment, where they butter them up with fancy
         | technology and the finest digital linens. Is it dumb? A little,
         | but it 's hardly a "problem" unless you intended to ship this
         | software on first-party hardware which, hint-hint, the license
         | forbids in the first place.
         | 
         | At the end of the day, this doesn't really irk me. I can buy a
         | compatible processor for less than $50, that's accessible
         | enough.
        
           | mistrial9 wrote:
           | the capital model for cost recovery and earnings is one
           | thing, but in the modern times, the amount of money that
           | flows through Intel Inc. is not the same thing. Intel played
           | dirty for long years to crush competitors, not "make money"
           | like they need it.. "Greed is good" - remember that ? so,
           | no.. apologists count your quarterly dividends but you have
           | no platform for social advocacy here IMO
        
           | stonemetal12 wrote:
           | No, Their CPUs as in ones from Intel. Intel has long done a
           | thing in their compilers where they detect the CPU model, and
           | run less optimized code if it isn't Intel. They claim it is
           | because they can't be sure "Other" processors have correctly
           | implemented SSE and other extensions. So Intel Linux is going
           | to run faster on an Intel CPU because it was compiled with
           | ICC.
        
             | zorgmonkey wrote:
             | I don't know much about it, but Intel's clear linux does
             | not use icc this is in their FAQ
             | https://docs.01.org/clearlinux/latest/FAQ/index.html#does-
             | it...
        
             | Sanguinaire wrote:
             | This is trivially easy to defeat, just so you know. If
             | anyone reading is ever in need of optimized math library
             | performance on AMD, just speak to your hardware/cloud
             | vendor; they all know the tricks.
        
           | klelatti wrote:
           | Link says Core Gen 10 or Xeon so you may be out of luck on
           | AMD or at less than $50.
           | 
           | I think this is more likely aimed at AMD than Arm - don't
           | think Arm is yet a threat in this space - and whilst they're
           | entitled to do what they want it does make me less enthused
           | about Intel and frankly more likely to support their
           | competitors.
        
             | mumblemumble wrote:
             | AMD has their own equivalent:
             | https://developer.amd.com/amd-aocl/
             | 
             | I'm not sure it's a sin for hardware manufacturers to
             | support their products? In the days of yore, we even
             | expected it of them.
        
               | gnufx wrote:
               | Yes. The difference is that may be "theirs", but I think
               | it's all free software. At least the linear algebra stuff
               | is. They supply changes for BLIS (which seem not to get
               | included for ages). Their changes may well be relevant to
               | Haswell, for instance. I don't remember what the
               | difference in implementation was for Zen and Haswell, but
               | they were roughly the same code at one time.
        
               | klelatti wrote:
               | Not a sin but it's not really just about supporting (or
               | optimising) their products, its about doing so whilst
               | trying to increase the lock-in beyond what is achieved on
               | performance grounds alone.
               | 
               | I may be wrong but my experience is that AMD has been a
               | bit better on this is the past e.g their OpenCL libraries
               | supported both Intel and AMD whereas Intel's were Intel
               | only.
        
               | mumblemumble wrote:
               | I would assume that's not entirely a fair comparison,
               | though. Intel's 3D acceleration hardware only ever
               | appears in Intel-manufactured chipsets, which only ever
               | contain Intel-manufactured CPUs.
               | 
               | AMD, on the other hand, also supplies Radeon GPUs for use
               | with Intel CPUs. For example, that's the setup in the
               | computer on which I'm typing this.
               | 
               | So I have a hard time seeing anything nefarious there.
               | The one is obviously a business necessity, while the
               | other would obviously be silly. Perhaps that changes with
               | the new Xe GPUs?
        
               | klelatti wrote:
               | Sorry, should have been clearer - Intel's CPU OpenCL
               | drivers only supported Intel and not AMD whereas the
               | AMD's CPU OpenCL drivers supported both - so GPUs not
               | relevant in this case.
               | 
               | I can see how if you've invested a lot in software you'd
               | like to get a competitive advantage over your nearest
               | rival so maybe a price we have to pay.
        
             | vel0city wrote:
             | I wonder what features are missing from a Comet Lake
             | generation Pentium, those can be had for ~$70 these days.
             | Other than the feature of the box says "Core" on it instead
             | of "Pentium".
             | 
             | EDIT: Ah, I found it, AVX2.
        
         | mumblemumble wrote:
         | I'm not sure I see why you would expect anything different? The
         | entire point of this framework is to provide a bunch of tools
         | for squeezing the most you can out of SSE, which is specific to
         | x86.
         | 
         | I don't know if there's an ARM-specific equivalent, but, if you
         | want to use TensorFlow or PyTorch or whatever on ARM, they'll
         | work quite happily with the Free Software implementations of
         | BLAS & friends. If you code at an appropriately high level, the
         | nice thing about these libraries is that you get to have
         | vendor-specific optimizations without having to code against
         | vendor-specific APIs. Which is _great_. I sincerely wish I had
         | that for the vector-optimized code I was writing 20 years ago.
         | In any case, if ARM Holdings or a licensee wants to code up
         | their own optimized libraries that speak the same standard APIs
         | (and assuming they haven 't already), that would be awesome,
         | too. The more the merrier. How about we all get in on the
         | vendor-optimized libraries for standard APIs bandwagon. Who
         | doesn't want all the vendor-specific optimizations without all
         | the vendor lock-in?
         | 
         | Alternatively, if you would rather get really good and locked
         | in to a specific vendor, you could opt instead to spam the CUDA
         | button. That's a popular (and, as far as I'm concerned, valid,
         | if not necessarily suited to my personal taste) option, too.
        
         | mhh__ wrote:
         | Alder Lake looks seriously impressive if the rumoured
         | performance is even close to accurate, so I wouldn't count them
         | out just yet - that being said, they will never get a run like
         | they did over the last 10 years again.
        
         | gnufx wrote:
         | Clear Linux looked unconvincing to me. When I looked at their
         | write-up, the example of what they say they do with
         | vectorization was FFTW. That depends on hand-coded machine-
         | specific stuff for speed, and the example was actually for the
         | testing harness, i.e. quite irrelevant. I did actually run the
         | patching script for amusement.
        
       | agloeregrets wrote:
       | I wonder who the person is who saw python and was like "You know
       | what this needs? INTEL."
        
       | amelius wrote:
       | Maybe I'm missing something but it seems to me that this can only
       | cause fragmentation in the Python space.
       | 
       | Why not use the original distributions?
        
         | gnufx wrote:
         | Mystique (PR)?
        
         | lbhdc wrote:
         | There are a number of alternate interpreters available. The
         | selling point typically is that they are faster, and seems to
         | be the value proposition of intels.
         | 
         | One use might be improving throughput of a compute bound
         | system, like an etl written in python, with little effort.
         | Ideally just downloading the new interpreter.
        
           | amelius wrote:
           | Ok. If they offer Python without the GIL then I'm all ears :)
        
             | gautamdivgi wrote:
             | I don't think python is ever going to get rid of the GIL. I
             | haven't looked but there's two things that may speed things
             | up quite a bit: - Use native types - Provide the ability to
             | turn "off" the GIL if you know you will not be using multi-
             | threading within a process.
             | 
             | I guess that is my naive wish list for a short term speed
             | up :)
        
               | TOMDM wrote:
               | A pythonic language that included something analogous to
               | Golangs channels/goroutines would be my ideal.
        
               | borodi wrote:
               | Julia does have channels similar to those of Go, although
               | if you want to call it pythonic or not is up to you.
        
               | TOMDM wrote:
               | I've seen hype for Julia over and over, but this is the
               | first piece of information that's made me genuinely
               | interested.
               | 
               | Thanks for the heads up!
               | 
               | EDIT: Oh god it's 1 indexed
        
               | borodi wrote:
               | While people discuss a lot about it, in the end 1
               | indexing doesn't really matter. I think it comes from
               | fortran/matlab.
        
               | TOMDM wrote:
               | I agree, it doesn't really matter, but I've been
               | programming long enough that I can see it being that top
               | step that's always half an inch too tall that I'm going
               | to stub my toe on.
        
               | borodi wrote:
               | For sure, I switch between python, C/C++ an julia a lot
               | and well, lets say bounds errors are pretty common for
               | me.
        
               | oscardssmith wrote:
               | My advice would be to use begin and end. Then you don't
               | have to think about the indexing.
        
               | dec0dedab0de wrote:
               | Jython doesn't have a gil, but It doesn't support
               | python3, and I've never used it.
        
               | gautamdivgi wrote:
               | Jython would also have issues with the many c libraries
               | that python code relies on today.
        
               | [deleted]
        
               | shepardrtc wrote:
               | Numba might be what you're looking for:
               | http://numba.pydata.org/
        
       | _joel wrote:
       | Why are they making their own distro and not putting code back
       | into mainline if it's useful? Do they have some particular IP
       | that makes this impossibe?
        
         | SkipperCat wrote:
         | I think there is a pretty big base of people who do big data
         | work using Numpy and Pandas (Fintech, etc). They want to
         | squeeze every bit of computing power out of the specific Intel
         | chipset, GPUs, etc and Intel's distro really helps them out.
         | 
         | A 10% speed improvement on 1000's of jobs could in theory save
         | you a nice chunk of time. This becomes very important in the
         | financial market where you need batch jobs to be finished
         | before markets open, or you just want to save 10% on your EC2
         | bill.
        
           | gnufx wrote:
           | 10% is around the noise level for HPC, especially for
           | throughput depending on scheduling. I rather doubt you
           | couldn't do the same as free software.
        
             | Sanguinaire wrote:
             | You are correct, nothing Intel provides in their Python
             | distro cannot be obtained elsewhere - this is just a nice
             | wrapper.
        
         | LeifCarrotson wrote:
         | Here's the list of CPUs which incorporate the AVX2 instructions
         | that enable some of these optimizations:
         | 
         | https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPU...
         | 
         | You could write your distro to check for flags that will tell
         | it whether or not you have these using flags from
         | /proc/cpuinfo. Or you could check whether it's in the Intel
         | half of the list or the AMD half of the list. Or you could
         | write your own distro that only runs on the first half of the
         | list.
         | 
         | I get that Intel's contributions aren't purely altruistic.
         | There are likely to be subtle tuning problems that require
         | slight changes to optimize on different platforms, and they
         | can't really be expected to do free work for AMD. But it looks
         | to me like they're being unecessarily anticompetitive.
        
           | falcor84 wrote:
           | >being unecessarily anticompetitive
           | 
           | Isn't setting up barriers to entry generally considered to be
           | a part of healthy competition? I'd hazard to say that as long
           | as a company is playing within the boundaries of what's
           | allowed, there's nothing they could do that's
           | anticompetitive; at the most, you could accuse them of being
           | somewhat unsportsmanlike.
        
             | dec0dedab0de wrote:
             | _Isn 't setting up barriers to entry generally considered
             | to be a part of healthy competition? _
             | 
             | No, it is not. This is better described as vendor lock-in,
             | than a barrier to entry. But vendor lock-in is also against
             | healthy competition.
             | 
             | Healthy competition means that users choose your product
             | because it suits their needs the best, not because they are
             | somehow forced to choose your product.
        
             | DasIch wrote:
             | Competition is desirable because it aligns with society's
             | goals of innovation and progress which also imply increased
             | productivity and lower prices.
             | 
             | Artificial barriers to entry are contrary to that and if
             | they're not illegal they should be.
        
               | falcor84 wrote:
               | Where do you define this line of barriers becoming
               | 'artificial'?
        
               | LeifCarrotson wrote:
               | It's artificial when the vendor expends additional time,
               | effort, or funds to construct a barrier, or chooses an
               | equally-priced non-interoperable design that a rational,
               | informed consumer with a choice would reject. If you're
               | expending great effort to write custom DRM or to reinvent
               | open industry standards that you could have installed
               | cheaply, that's artificial.
               | 
               | I fully admit that there are natural barriers that occur
               | at times. I don't think that you should be expected to
               | reverse-engineer your competitor's products and bend over
               | backwards to make them work better.
               | 
               | Here, for a concrete example, Intel had a clear choice to
               | test whether a processor supported a feature by checking
               | a feature flag - It's in the name, they're literally
               | implemented for that exact purpose - or they could expend
               | extra effort in building their own feature flag database
               | by checking manufacturer and part number. They could have
               | either expended extra effort to launch and distribute
               | their own entire custom Python distribution, or submitted
               | pull requests to the existing main distribution. For
               | another example, Apple could have used industry-standard
               | Phillips or Torx screws in their hardware: Manufacturers
               | had lines to produce them, distributors had inventory of
               | the fasteners, users had tools to turn them. Instead,
               | they went to great expense to build their own
               | incompatible tri-lobe screws, requiring probably millions
               | of dollars in investment in custom tooling and production
               | lines, all for the sake of creating an artificial
               | barrier.
        
               | Sanguinaire wrote:
               | We could start with something similar to the concept of
               | Pareto optimality; Intel could have delivered their
               | maximum performance without preventing optimizations from
               | being applied equally on AMD hardware, but instead they
               | choose to disadvantage AMD without providing anything
               | extra on top of what they could do while remaining
               | "neutral".
        
       | gnufx wrote:
       | I don't know what Intel did for the proprietary version, but the
       | first thing you should do for Python is to compile with GCC's
       | -fno-semantic-interposition. I don't know if there's a benefit
       | from vectorization, for instance, in parts of the interpreter, or
       | whether -Ofast helps generally if so, but I doubt there's
       | anything Intel CPU-specific involved if there is. I've never
       | looked at it, has the interpreter not been well-profiled and such
       | optimizations provided? Anyway, if you want speed, don't use
       | Python.
       | 
       | It's obviously not relevant to Python per se, but you get
       | basically equivalent performance to MKL with OpenBLAS or,
       | perhaps, BLIS, possibly with libxsmm on x86. BLIS may do better
       | on operations other than {s,d}gemm, and/or threaded, than
       | OpenBLAS, but they're both generally competitive.
        
       | Rd6n6 wrote:
       | > the Intel CPU dispatcher does not only check which instruction
       | set is supported by the CPU, it also checks the vendor ID string.
       | If the vendor string says "GenuineIntel" then it uses the optimal
       | code path. If the CPU is not from Intel then, in most cases, it
       | will run the slowest possible version of the code, even if the
       | CPU is fully compatible with a better version.[1]
       | 
       | I've been a little shy about using intel software since reading
       | about this years ago
       | 
       | [1] https://www.agner.org/optimize/blog/read.php?i=49
        
       | ciupicri wrote:
       | Python 3.7.4 when 3.10 is just around the block.
        
       | TOMDM wrote:
       | To me this just looks like Intel saw what Nvidia has accomplished
       | with CUDA, locking in large portions of the scientific computing
       | community with a hardware specific API and going "yeah me too
       | thanks"
       | 
       | Thankfully, accelerated math libraries already exist for Python
       | without the vendor lockin.
        
         | bostonsre wrote:
         | Intel has been releasing mkl/math kernel libraries for Java for
         | a really long time. Hopefully core python devs can learn a few
         | tricks and similar changes can make it upstream.
        
       | vitorsr wrote:
       | You can easily try it yourself [1]:                   conda
       | create -n intel -c intel intel::intelpython3_core
       | 
       | Or [2]:                   docker pull
       | intelpython/intelpython3_core
       | 
       | Note that it is quite bloated but includes many high-quality
       | libraries.
       | 
       | You can think of it as a recompilation in addition to a
       | collection of patches to make use of their proprietary libraries.
       | 
       | Other useful links to reduce the noise in this thread: [3], [4],
       | [5], [6].
       | 
       | [1]
       | https://software.intel.com/content/www/us/en/develop/article...
       | 
       | [2]
       | https://software.intel.com/content/www/us/en/develop/article...
       | 
       | [3] https://www.nersc.gov/assets/Uploads/IntelPython-NERSC.pdf
       | 
       | [4] https://hub.docker.com/u/intelpython
       | 
       | [5] https://anaconda.org/intel
       | 
       | [6] https://github.com/IntelPython
        
         | tkinom wrote:
         | Any benchmarks comparison data?                  For example:
         | .... benchmarks with this python is XXX % higher than ... (std
         | python, AMD, ARM)
        
           | mumblemumble wrote:
           | I haven't done a comparison in a long time, and, even then,
           | it wasn't very thorough, so take this with a grain of salt.
           | 
           | But, 6 years ago, when I was in grad school, just swapping to
           | the Intel build of numpy was an instant ~10x speedup in the
           | machine learning pipeline I was working on at the time.
           | 
           | No idea if that's typical or specific to what I was doing at
           | the time. I don't use MKL anymore because ops doesn't want to
           | deal with it and the standard packages are already plenty
           | good enough for what I'm doing nowadays. If you forced me to
           | guess, I guess I'd have to guess that my experience was
           | atypical.
        
       ___________________________________________________________________
       (page generated 2021-07-21 23:00 UTC)