[HN Gopher] Stable Diffusion in C/C++
___________________________________________________________________
Stable Diffusion in C/C++
Author : kikalo00
Score : 252 points
Date : 2023-08-19 11:26 UTC (11 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| [deleted]
| nottorp wrote:
| [flagged]
| naillo wrote:
| Awesome that they implemented CLIP as well. That alone could be
| cool to extract and compile as a wasm implementation.
|
| Edit: Seems like someone already has
| https://github.com/monatis/clip.cpp :) Now to wasmify it
| KaoruAoiShiho wrote:
| Speaking of CLIP, I'm always troubled that the next CLIP might
| not get released as both OpenAI and Google are shifting into
| competition mode. Sad to think there might be a more advanced
| version of CLIP already but sitting in a secret vault
| somewhere.
|
| Edit: I'm not referring to a CLIP-2 but any advance on the same
| level of importance as CLIP.
| GaggiX wrote:
| The biggest CLIP models we know of are open source.
|
| If a company has a bigger CLIP model they don't have even
| reported that.
|
| Also OpenAI had already for a moment a proprietary CLIP model
| that was bigger than any other models available, the CLIP-H
| used by Dalle 2.
| snordgren wrote:
| As someone who is out of the loop but could use high
| quality image embeddings right now, what's the best CLIP
| model right now?
| astrange wrote:
| SDXL uses OpenCLIP, and then OpenAI CLIP as a backup
| basically to allow it to spell words properly, but I
| think you could replace the second one.
| speedgoose wrote:
| Stable Diffusion switched to OpenCLIP for stable diffusion 2.
| But it looks they went back to clip for the xl version.
|
| People complained about openclip not being as good. Hopefully
| we can have a better and open clip model eventually.
| evolveyourmind wrote:
| Any benchmarks?
| nre wrote:
| Some people have timed it here, it looks like it's taking
| 15-20s/it (dependent on quant and hardware).
|
| https://github.com/leejet/stable-diffusion.cpp/issues/1
| Lerc wrote:
| Looking at the different quantization levels examples, I'm quite
| impressed. The change from f16 to q8_0 seems to be more of a
| change in direction than loss of quality. The q5_1 result seems
| indistinguishable from the q8_0.
|
| So you're losing determinism with the higher precision models,
| but potentially quite usable.
| jakearmitage wrote:
| There's just something special to these C/C++ implementations of
| AI stuff. They feel so clean and straightforward and make the
| entire field of AI feel tangible and learnable.
|
| Is that because Python's ecosystem is so messy?
| snordgren wrote:
| Rewrites tend to improve code quality, and replacing
| dependencies with custom-tailored code that does just what you
| need also improves code quality.
|
| And while the Python version uses C and C++ code for speed,
| this is all just one language.
|
| A trifecta of factors enabling clean code.
| BrutalCoding wrote:
| Saw this repo today, fetched it, built a .dylib (Mac) and used
| Dart's ffi-gen tooling to generate the bindings from the provided
| header file.
|
| I'm just experimenting with it together with Flutter. Ffi because
| I'm trying to avoid spawning a subprocess
|
| Fast forward: Ended up with severe headache and a broken app.
| Will continue my attempt tmr with a fresh mind haha
|
| This repo is great though, had it up and running within 10 min on
| my M1 (using f16). Thanks for sharing!
| lucabs wrote:
| [dead]
| waynecochran wrote:
| Nice to see ML folks getting weaned off of Python and using a
| language that can optimally exploit the underlying hardware and
| not require setting up a specialized environment to build and
| run.
| xvilka wrote:
| Do you mean Julia language?
| A-Train wrote:
| Amen.
| aimor wrote:
| I really appreciate the people doing this work. It's the only
| way I've run these models without any headaches. The difference
| is so stark, even with CUDA and Linux it's bad, with AMD and
| Windows it's miserable. I'm pretty sure it's not just me..
| galangalalgol wrote:
| As long as we are language trolling, why would anyone start a
| greenfield project like this in C++ these days? The android,
| windows, firefox, and now chrome projects have all begun to
| shift towards rust and in the case of Android and Firefox,
| write a significant amounts of their project in rust. Migrating
| an existing project like that is difficult. The chrome team in
| particular lamented the difficulty. But starting a new project?
| If you have a team familiar with performant c++ the speedbump
| of starting a greenfield project in rust is negligible, and the
| ergonomic improvements in the build system and the language
| itself will make up for that in any project that takes more
| than a few months. For that speed bump you get memory and
| thread race safety, far better than any stack of c++ analysis
| tools could ever provide with a tiny fraction of the unit tests
| you'd write in c++. And you lose no performance.
| FoodWThrow wrote:
| Rust is _great_ when you know what you 're building. That
| qualifier encompasses quite amount of software space, but not
| all of it, and I would argue not even the majority of it.
|
| If you don't know what you are doing, if you are exploring
| ideas, Rust will just get in the way. At some point you will
| end up realizing you need to adjust lifetimes, and that will
| require you to touch non-trivial amount of your code base. If
| you need to that multiple times, friction will overwhelm your
| desire to code.
|
| I have a pet theory that, the people that find Rust intuitive
| and fun, are the people that are working on well beaten
| paths; Rust is almost boring at doing that, which is a good
| thing. And the people that find Rust gets in their way are
| the people that like to experiment with their solutions,
| because there aren't any set, trusted solutions within their
| problem space, and even if there are, they like to approach
| the problem on their own, for better or worse.
|
| In any case:
|
| > why would anyone start a greenfield project like this in
| C++ these days?
|
| The video game industry can single-handedly carry C++ on
| their back, kicking and screaming, if need be. Rust is
| uniquely unfit to write gameplay code due to game
| development's iterative nature. Using scripting languages
| doesn't cut it either, because often, slower designer made
| scripts will need to be converted to C++ by a programmer, and
| pull in the crazy reference hell of the game state into the
| C++ land.
|
| I would say Rust is OK for _engine_ level features -- those
| don 't change that often, and requirements are usually well
| understood. But that introduces a cadence mismatch between
| different systems too, so there is a cost there as well. But
| for gameplay? There's a reason why many Rust based game
| engines use crazy amount of unsafe Rust to make their ECS.
| Just not a good fit.
|
| And of course, there's the consoles, where Sony seem to have
| a political reason for not supporting Rust on non-1st-party
| studios. I have no idea what they are thinking, honestly.
| mnrlt wrote:
| C++ has a standard, multiple competing implementations and a
| largely drama-free community.
|
| Does CUDA even have Rust bindings, and if so, are they on the
| same level as the C++ ones?
|
| What do you mean by "the windows projects" that shift towards
| Rust?
| galangalalgol wrote:
| MS has started implementing pieces of windows in rust. If
| you have windows 11 you are running rust. The cuda bindings
| are good for ml, but missing for cufft and similar. There
| are people working on better cuda support, but there are
| even more people working on vendor agnostic gpgpu using
| spirv and webgpu. It isn't there yet. Right now you are
| mostly left to your own bindings unless you are doing ml or
| blas.
|
| Edit: I can't argue about the drama part. The competing
| compilers will get there. A couple gcc frontends in work,
| and crane lift as a competing back end for llvm and full
| self-hosting. There is also miri I guess to emit c? People
| use that to get rust on the C64 or other niche processors.
| pjmlp wrote:
| Yes they started, yet there is enough C++ to rewrite in
| the 30 years of Windows NT history.
|
| Meanwhile, Visual Studio team released better tooling for
| Unreal in Visual C++.
| Const-me wrote:
| > why would anyone start a greenfield project like this in
| C++ these days?
|
| TLDR: quite often, using C++ instead of Rust saves software
| development costs.
|
| Some software needs to consume many external APIs. Examples
| on Windows: Direct3D, Direct2D, DirectWrite, MediaFoundation.
| Examples on Linux: V4L2, ALSA, DRM/KMS, GLES. These things
| are huge in terms of API surface. Choose Rust, and you gonna
| need to write and support non-trivial amount of boilerplate
| code for the interop. Choose C++ (on Linux, C is good too)
| and that code is gone, you only need well-documented and well
| supported APIs supplied by the OS vendors.
|
| Similarly, some software needs to integrate with other
| systems or libraries written in C or C++. An example often
| relevant to HPC applications is Eigen. Another related thing,
| game console SDKs, and game engines, don't support Rust.
|
| For the project being discussed here, GGML, for optimal
| performance the implementation needs vector intrinsics.
| Technically Rust has the support, but in practice Intel and
| ARM are only supporting them for C and C++. Not just CPU
| vendors, when using C or C++ there're useful relevant
| resources: articles, blogs, and stackoverflow. These things
| help a lot in practice. I don't program Rust, but I program
| C# in addition to C++, technically most vector intrinsics are
| available in the current version of C#, but they are much
| harder to use from C# for this reason.
|
| All current C and C++ compilers support OpenMP for
| parallelism. While not a silver bullet, and not available on
| all platforms supported by C or C++, some software benefits
| tremendously from that thing.
|
| Finally, it's easier to find good C++ developers, compared to
| good Rust developers.
| galangalalgol wrote:
| There are existing supported bindings for direct3d from MS,
| as they themselves are migrating. GLES and ggml also have
| supported bindings. I like nalgebra + rustfft better than
| eigen now. Nalgebra still isn't quite as peformant on small
| matricies until const generic eval stabilizes, but it is
| close enough for 6x6 stuff that it is in the noise. Rustfft
| is faster than fftw even. Rust has intrinsic support on par
| with clang and gcc, and the autovectorizer uses whatever
| llvm know about, so again equivalent to clang.
|
| On the last point, I will again assert that a good c++
| developer is just a good rust developer minus a month of
| ramp, that you'll get nack from not having to fight
| combinations of automake, cmake, conan, vcpkg, meson, bazel
| and hunter.
| Const-me wrote:
| I'll be very surprised if MS will ever support rust
| bindings for media foundation. That thing is COM based,
| requires users to implement COM interfaces not just
| consume, and is heavily multithreaded.
|
| About SIMD, automatic vectorizers are very limited. I was
| talking about manually vectorized stuff with intrinsics.
|
| I've been programming C++ for living for decades now.
| Tried to learn rust but failed. I have an impression the
| language is extremely hard to use.
| galangalalgol wrote:
| Not sure how fully featured it is.
| https://lib.rs/crates/mmf
|
| Yes, rust directly supports modern intrinsics, that is
| what rustfft for instance uses. I try to stick with
| autovec myself, because my needs are simpler such that a
| couple tweaks usually gets me close to hand-rolled
| speedups on both avx 512 and aarch64. But for more
| complicated stuff yeah, rust seems to be keeping up. Some
| intrinsics are still only in nightly, but plenty of major
| projects use nightly for production, it is quite stable
| and with a good pipeline you'll be fine.
|
| I've written c++ since ~94, and mostly c++17 since it
| came out. About a quarter of a century of that getting
| paid for it. I never liked or used exceptions or rtti,
| and generally used functional style except for
| preallocation of memory for performance. I think those
| habits might have made the transition a little easier,
| but the people on my team who had used a more OOP style
| and full c++ don't seem to have adapted much more slowly
| if at all. I struggled for years to internalize rust at
| home until I just jumped in at work by declaring the
| project I lead would be in rust. I have had absolutely no
| regrets. It really isn't as bad a learning curve as c++.
| But we learned c++ one revision at a time. Also, much
| like c++ rust has bits you mostly only need to know for
| writing libraries. So getting started you can put those
| things to the side for a bit right at first.
| theLiminator wrote:
| Curious how long you tried to learn rust? I've found C++
| much harder to learn (coming from a python/scala)
| background.
|
| Is it just a case of you forgetting how hard C++ was to
| learn?
| pjmlp wrote:
| No there aren't, unless you mean Rust/WinRT demos with
| community bindings.
|
| Agility SDK and XDK have zero Rust support. If it isn't
| on the Agility SDK and XDK, it isn't official.
|
| Hardly the same as the official Swift bindings to Metal,
| written in Objective-C and C++14.
| api wrote:
| It's interesting to me that my CPU can run some of these things
| in quantized form almost as fast as the GPU. Has the whole
| thing been all about memory bandwidth all along?
|
| In addition to compute the GPU architecture is one that
| somewhat colocates working memory alongside compute. Units have
| local memories that sync with global memory. Is that a big part
| of why GPUs are so good for this?
| brucethemoose2 wrote:
| > Has the whole thing been all about memory bandwidth all
| along
|
| Yeah, sort of.
|
| LLMs like llama at a batch size of 1 are hilariously
| bandwidth bound.
|
| Stable Diffusion less so. Its still bandwidth heavy on GPUs,
| but compute is much more of a bottleneck.
| intelVISA wrote:
| Wasn't Python originally designed as a language to teach
| children how to code? Weird to see so many, otherwise
| intelligent, folks latch onto it.
|
| It really doesn't have any redeeming characteristics vs. Common
| Lisp, or Haskell, to warrant this bizarre popularity imo
| mdp2021 wrote:
| > _Wasn 't Python originally designed as a language to teach
| children how to code_
|
| I think it would be very confusing for a child to start with
| a language so far away from low-level logic.
|
| ...And some people said BASIC was evil. At least what it is
| doing looks plain and direct.
| tester756 wrote:
| >I think it would be very confusing for a child to start
| with a language so far away from low-level logic.
|
| Why?
|
| I started with C++ and when they showed me C# I instantly
| feel in love cuz I didn't have to deal with unnecessary
| complexity and annoyances and could focus on pure
| programming, algorithms, etc.
| mdp2021 wrote:
| Yes Tester,
|
| but you are confirming my point :) ...You _started_ with
| C++, then went to C#...
| tester756 wrote:
| I started with C++ and switched around the beginning, so
| there wasn't any "low level knowledge" nor above c++
| beginner level concepts.
|
| Both: high-to-low and low-to-high have some advantages,
| but it's not like one is always better than the other.
|
| high-to-low allows you to write stuff earlier - like
| programs that do something useful, GUI, web, whatever.
|
| but at the cost of understanding internals / under the
| hood.
| segfaultbuserr wrote:
| > _I think it would be very confusing for a child to start
| with a language so far away from low-level logic._
|
| Depending on the person.
|
| For some, it would be very frustrating to start with a
| language so close to the implementation detail, and so far
| away from what you want to do. It's very possible that
| someone might have long lost the motivation before one can
| do anything non-trivial.
|
| I started from Python, to C, to assembly, to 4-layer
| circuit boards. Whenever I went a level deeper, it feels
| like opening the inner working of a blackbox that I
| normally only interacts with pushbuttons on its front
| panel, but I otherwise is roughly aware of what they do.
|
| On the other hand, much of my childhood was spent on
| tinkering with PCs and servers, including hosting websites
| and compiling packages from source, so I was already well
| aware of the basic concepts in computing before I started
| programming. So, top-down and bottom-up are both absolutely
| workable, under the right circumstances.
| mnrlt wrote:
| ABC, the predecessor from which Python took many syntax
| features, was. I wonder if Python also took a lot of the ABC
| implementation, given that it is still copyright CWI.
|
| I agree that its popularity is very odd, but academics take
| what they are given when attending fully paid conferences
| (aka vacations).
| highspeedbus wrote:
| What's the term to describe a ad hominem fallacy towards
| programming languages? Asked the almighty chat and got this
| new term:
|
| "Code Persona Attack"
|
| Python is fine.
| astrange wrote:
| This is a Ycombinator site, the traditional term is Blub.
|
| And they're right. Python is not a well designed
| programming language - it has exceptions and doesn't have
| value types so that's two strikes against it.
|
| Of course, C++ isn't either.
| gumby wrote:
| It's a bummer there's so little work on the training side in
| C++.
|
| Especially since the python training systems are mostly calls
| into libraries written in C++!
| pjmlp wrote:
| Yeah, and since C++17 the language is already quite
| productive for scripting like workflows, the missing piece of
| the puzzle is that there are two few C++ REPLs aroung,
| ROOT/CINT being one of the few well known ones.
| danybittel wrote:
| Since when does C++ optimally exploit the underlying hardware?
| It has no vector instructions, does not run on the GPU and is
| arguably too hard to make multithreaded. Which leaves you with
| about 0.5% performance of a current PCs.
| jcelerier wrote:
| > does not run on the GPU
|
| both Cuda and the Metal shader language are C++, so is OpenCL
| since 2.0 (https://www.khronos.org/opencl/), so is AMD ROCm's
| HIP (https://github.com/ROCm-Developer-Tools/HIP), so is SYCL
| (https://www.khronos.org/sycl/)? C++ is pretty much the
| language that runs _most_ on GPUs.
|
| > no vector instructions,
|
| There's a thousand different possibilities for SIMD in C++,
| from #pragma omp simd, to libs such as
| std::experimental::simd
| (https://en.cppreference.com/w/cpp/experimental/simd/simd),
| Eve (https://github.com/jfalcou/eve), Highway
| (https://github.com/google/highway), Vc
| (https://github.com/VcDevel/Vc)...
| pjmlp wrote:
| When compared against Python, more than enough.
|
| C++ is one of the supported CUDA languages, even standard
| C++17 does run just fine on the GPU.
|
| Metal uses C++14 alongside some extensions.
| waynecochran wrote:
| Vector types / instructions would be nice. The C++20 STL
| algorithms are very friendly to vectorization with the
| various parallel policies (e.g.
| std::execution::unsequenced_policy) that open up your code to
| be vectorized. Wonderful libs like Eigen handle a lot of my
| numeric needs for linear algebra. I think you are forgetting
| the CUDA is C/C++.
| Someone wrote:
| > Vector types / instructions would be nice
|
| It's technically not a C++ feature, but both gcc
| (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html)
| and Clang (https://releases.llvm.org/3.1/tools/clang/docs/L
| anguageExten...) have vector types, and clang even supports
| the gcc way of writing them, so it gets pretty close.
| astrange wrote:
| Those are traditionally dangerous since they tend to
| compile poorly; not as bad as autovectorization but not
| as good as just writing in assembly. And since
| vectorization is platform-dependent anyway (because it's
| so different across platforms), assembly really isn't
| nearly as bad as it sounds.
|
| Though it's certainly gotten better, the reason people
| push those is that they're written by compiler authors,
| who don't want to hear that their compiler doesn't work.
|
| Some of the reason for this is that C doesn't let you
| specify memory aliasing as precisely as you want to.
| Fortran is better about this.
| codethief wrote:
| That's a rather odd comparison to make. First of all, OP, like
| llama.cpp, doesn't use the GPU - in contrast to most Python ML
| code. It's not hard to write Python code that "optimally
| exploits" the GPU. You might call the GPU a "specialized
| environment to build and run" but it's arguably much better
| suited to the problem.
|
| Second, OP, like llama.cpp, produced efficient and highly
| specialized code _after_ it was clear the model being
| specialized for (StableDiffusion / LLaMa / ...) works well.
| Where Python shines, though, is the prototyping phase when you
| have yet to find an appropriate model. We have yet to see this
| sort of easy & convenient prototyping in C++.
|
| Now, this is not to take away anything from the fantastic work
| that's being done by the llama.cpp people (to whom I also count
| OP) in the "ML on a CPU" space. But the problems being solved
| are entirely different.
| PcChip wrote:
| >You might call the GPU a "specialized environment to build
| and run" but it's arguably much better suited to the problem.
|
| I feel like the person you're replying to knows that the GPU
| is better suited than the CPU to do this task, and your
| argument doesn't really make sense. I think they were
| referring to the python venv environment with all the library
| dependencies as the "specialized environment"
| jebarker wrote:
| The point is that as awesome as this repo is it doesn't do
| much to ween the "ML folks" off of Python since it doesn't
| provide the flexibility and GPU support that people
| designing and training DL systems rely on.
| waynecochran wrote:
| I'm just encouraged when I see ML libraries not using
| Python w its environment kludges. Just a step in the
| right direction.
| jebarker wrote:
| I don't disagree that Python environments are a mess. I'm
| actually a developer on quite a prominent large scale
| neural network training library and a DL researcher that
| uses said library. With my developer hat on I like to
| have minimal dependencies and keep Python scripting as
| decoupled as possible from the CUDA C++ implementation.
| With my researcher hat on I don't want to be slowed down
| by C++ development every time I want to change my model
| or training pipeline. At least for me, C++ development is
| slower and more error prone than modifying Python.
|
| Obviously doing any heavy lifting in Python is a bad
| idea. But as a scripting language I think it's good,
| especially if you keep the environment simple. I don't
| think the answer for DL training is to dump Python
| entirely and start over in pure
| C/C++/Rust/Julia/whatever. Learning C/C++ is too big of
| an ask for everyone working on the model design and
| training side and it would slow down progress
| significantly - most of that work is actually data
| munging and targeted model tweaks. But I do think there's
| still a lot that can be done to decouple Python from the
| underlying engine and yield networks where inference can
| be run in a minimal dependency environment. There's lots
| of great people working on all these things.
| segfaultbuserr wrote:
| > _Where Python shines, though, is the prototyping phase when
| you have yet to find an appropriate model. We have yet to see
| this sort of easy & convenient prototyping in C++._
|
| +1.
|
| To produce a highly-optimized C/C++ kernel that utilizes the
| CPU to the fullest extent, it requires tremendously amount of
| talent and expertise. For example, not everyone can write a
| hand-vectorized kernel with AVX2 intrinsics (outside a few
| specialized applications like 3D graphics, media encoding,
| and the likes), and even fewer people can exploit the
| underlying feature of the algorithm for optimization, such as
| producing usable output at greatly reduced numerical
| precision. The power of LLM provides strong motivation to
| drive the brainpower of countless programmers all over the
| world to do just that. New techniques are proposed and
| implemented on a monthly basis, with people thinking and
| applying every possible trick on the LLM optimization
| problems. In this regard, moving from Python to C is totally
| reasonable.
|
| In comparison, right now I'm working on optimizing a niche
| open-source scientific simulation kernel with a naive C
| codebase. Before me, there were hardly any contributors in
| the last decade.
|
| Python has its place because not everyone has a level of
| resource and expertise comparable to ML. In particular, when
| the bulk of the data processing of a Python script is in done
| in a function call to a C++ or FORTRAN kernel like scipy, the
| differences between naive C and naive Python code (or Julia
| code if you're following the trend) are not that much,
| especially when it's a one-off project for just publishing a
| single paper.
| hotstickyballs wrote:
| It's going to be a tf or PyTorch feature rather than going
| directly to writing things in C. No point solving this
| problem only once.
| waynecochran wrote:
| Yeah i make a living in the GPU space. I think my comment
| comes from colleagues having to hold my hand to set up their
| ML / Python environments with all of their picadellos. In
| fact its bad enough that i have to use docker to create an
| insular environment tailored to their specific setup. And
| Python is like a 1000 times slower when its not using other
| libs like numpy.
| pdntspa wrote:
| Are they not using venvs or something? It should be as
| simple as python -m venv venv; ./activate; pip install -r
| requirements.txt
| Narew wrote:
| Unfortunatly its not that simple expecially for NVIDIA
| driver and cuda install. That's why we usually use conda
| that can handle cuda install but even with that some time
| it work flawlessly and some time not.
| waynecochran wrote:
| Everyone has their own way to do this. Every step is
| broken by some unfamiliar dependency that requires
| special arcane knowledge to fix. Part of me is a grumpy
| old man that doesn't gravitate to the shiny new tools
| that come out every week that the younger devs keep up
| with :)
| pdntspa wrote:
| pip and venv are neither shiny nor new, it's the standard
| way of doing things for a while. I am an outsider to
| python and am incredibly thankful for this
| standardization, because i agree getting python env set
| up correctly before venv was a huge pain
|
| If your guys arent on this I'd suggest you get them on
| it, it dramatically simplifies setup
| waynecochran wrote:
| Here is a tiny excerpt try to get dvc to work just so I
| could get the training weights for deployment ...
| remember I don't develop much w Python...
| $ dvc pull Command 'dvc' not found, but can be
| installed with: sudo snap install dvc
| $ sudo snap install dvc error: This revision of
| snap "dvc" was published using classic confinement and
| thus may perform arbitrary system changes outside
| of the security sandbox that snaps are usually confined
| to, which may put your system at risk.
| If you understand and want to proceed repeat the command
| including --classic. ok I get dvc installed
| somehow -- don't remember. Time to get the weights...
| $ python3 -m dvc pull ERROR: unexpected error -
| Forbidden: An error occurred (403) when calling the
| HeadObject operation: Forbidden
| Having any troubles? Hit us up at
| https://dvc.org/support, we are always happy to help!
|
| Finally I just have my colleague manually copy the
| weights. This kind of thing went for hours.
| pdntspa wrote:
| Researchers are notorious for writing bad code
|
| What even is dvc
|
| edit: also- i'd avoid snap and just use your regular
| package manager.
| waynecochran wrote:
| I think dvc is like git for large binary files. You need
| someway to manage your NN weights -- what are other
| methods?
| pdntspa wrote:
| git lfs is what everyone is using, HF in particular
| efiop wrote:
| Hey, DVC maintainer here.
|
| Thanks for giving DVC a try!
|
| There are a few ways to install dvc, see
| https://dvc.org/doc/install/linux
|
| With snap, you need to use `--classic` flag, as noted in
| https://dvc.org/doc/install/linux#install-with-snap
| Unfortunately that's just how snap works for us there :(
|
| Regarding the pull error, it simply looks like you don't
| have some credentials set up. See
| https://dvc.org/doc/user-guide/data-management/remote-
| storag... Still, the error could be better, so that's on
| us.
|
| Feel free to ping us in discord (see invite link in
| https://dvc.org/support). I'm @ruslan there. We'll be
| happy to help.
| pjmlp wrote:
| That Python ML code is calling C++ code running in the GPU,
| one more reason to use C++ across the whole stack.
|
| CERN already used prototyping in C++, with ROOT and CINT, 20
| years ago.
|
| https://root.cern/
|
| Nowadays it is even usable from Netbooks via Xeus.
|
| It is more a matter of lack of exposure to C++ interpreters
| than anything else.
| two_in_one wrote:
| Add to that it's only inference code, not training.
| kwant_kiddo wrote:
| not sure what "It's not hard to write Python code that
| "optimally exploits" the GPU", exactly means but Python is so
| far from exploiting the GPU resources even with C/C++
| bindings that it's not even funny. I am sure that HPC folks
| would have migrated way from FORTRAN and C/C++ long time ago
| if it was so easy.
| codethief wrote:
| I wasn't trying to claim that Python is great at fully
| exploiting GPU resources on generic GPU tasks. But in ML
| applications it often does, at least in my experience.
| geysersam wrote:
| It's not like any performance significant component of the ML
| stack is actually implemented in Python. Everything is and has
| always been cuda, c or c++ under the hood. Python is just the
| extremely effective glue binding it all together.
| brucethemoose2 wrote:
| Sometimes implementations will spend a little too much time
| in Python interpretation, but yeah, its largely lower level
| code.
|
| The problem with PyTorch specifically is that (without Triton
| compilation) pretty much all projects run in eager mode.
| That's fine for experimentation and demonstrations in papers,
| but its _crazy_ that its used so much for production without
| any compilation. It would be like using debug C binaries for
| production, and they only work with any kind of sane
| performance on a single CPU maker.
| fassssst wrote:
| Yup. I would much prefer if every ML model had a simple C
| inference API that could be called directly from pretty much
| any language on any platform, without a mess of dependencies
| and environment setup.
| naillo wrote:
| ML is such a beautiful and perfect setup for dependency free
| execution too. It should just be like downloading a
| mathematical function. I'm glad we're finally embracing that.
| brucethemoose2 wrote:
| Llama.cpp/ggml is uniquely suited to llms. The memory
| requirements are _huge_ , quantization is effective, and token
| generation is surprisingly serial and bandwidth bound, making it
| good for CPUs, and an even better fit for ggml's unique pipelined
| CPU/GPU inference.
|
| ...But Stable Diffusion is not the same. It doesn't quantize as
| well, the unet is very compute intense, and batched image
| generation is effective and useful to single users. Its a better
| fit for GPUs/IGPs. Additionally, it massively benefits from the
| hackability of the Python implementations.
|
| I think ML compilation to executables is the way for SD.
| AITemplate is already blazing fast [1], and TVM Vulkan is very
| promising if anyone will actually flesh out the demo
| implementation [2]. And they preserve most of the hackability of
| the pure PyTorch implementations.
|
| 1: https://github.com/VoltaML/voltaML-fast-stable-diffusion
|
| 2: https://github.com/mlc-ai/web-stable-diffusion
| WinLychee wrote:
| The above project somewhat supports GPUs if you pass the
| correct GGML compile flags to it. `GGML_CUBLAS` for example is
| supported when compiling. You get a decent speedup relative to
| pure C/C++.
| brucethemoose2 wrote:
| Interesting. It still doesn't seem to be very quick:
| https://github.com/leejet/stable-diffusion.cpp/issues/6
|
| But don't get me wrong, I look forward to playing with ggml
| SD and its development.
| WinLychee wrote:
| Yeah for comparison, `tinygrad` takes a little over a
| second per iteration on my machine. https://github.com/tiny
| grad/tinygrad/blob/master/examples/st...
| brucethemoose2 wrote:
| Is that on GPU or CPU? 1 it/s would be very respectable
| on CPU.
|
| The fastest implementation on my 2060 laptop is
| AITemplate, being about 2x faster than pure optimized HF
| diffusers.
| WinLychee wrote:
| That was on GPU, and there are various CPU
| implementations (e.g. based on Tencent/ncnn) on github
| that have similar runtime (1-3s / iteration).
| skykooler wrote:
| On the other hand, this is nice for anyone who wants to play
| with these networks locally and does not have a nvidia GPU with
| 6+ gigabytes of VRAM. I can run this on an old laptop, even if
| it takes a while.
| brucethemoose2 wrote:
| I would also highly recommend https://tinybots.net/artbot
|
| You can even run CLIP or (if its fast enough) llama.cpp on
| the CPU to contribute to the network, if you wish.
| gsharma wrote:
| ComfyUI works pretty well on old computers using CPU. It
| takes over 30 seconds per sampling step on a 2015 MacBook Air
| (I7, 8GB RAM).
| voz_ wrote:
| Iirc we had good speedups on it with torch.compile, and I
| remember working on it. Let me see if I can find numbers...
| brucethemoose2 wrote:
| Its about 20-40% depending on the GPU, from my tests.
|
| And only very recent builds of torch 2.1 (with dynamic input)
| work properly, and it still doesn't like certain input
| changes, or augmentations like controlnet.
|
| AIT is the most usable compiled implementation I have
| personally tested, but SHARK (running IREE/MLIR/Vulkan) and
| torch-mlir are said to be very good.
|
| Hidet is promising but doesn't really work yet. TVM doesn't
| have a complete implementation outside of the WebGPU demo.
| voz_ wrote:
| Try head of master. If there's any bugs or graph breaks you
| hit, lmk, I can take a look. My numbers say 71% with a few
| custom hacks.
|
| Glad the dynamic stuff is working out tho!
| brucethemoose2 wrote:
| I will, thanks!
|
| I have been away for a month, but I will start testing it
| again later and submit some issues I run into.
| voz_ wrote:
| My username without underscore, at meta. Email me any
| bugs, I can help file them on GH and lend a hand fixing.
| kpw94 wrote:
| This is incredibly easy to setup, just tried it for first time.
|
| How fast is it supposed to go?
|
| Just tried on linux with `cmake .. -DGGML_OPENBLAS=ON` on a AMD
| Ryzen 7 5700g (no discrete GPU, only integrated graphics)
| ./bin/sd -m ../models/sd-v1-4-ggml-model-f32.bin -p "a lovely
| cat" [INFO] stable-diffusion.cpp:2525 - loading model
| from '../models/sd-v1-4-ggml-model-f32.bin' ...
| [INFO] stable-diffusion.cpp:3375 - start sampling [INFO]
| stable-diffusion.cpp:3067 - step 1 sampling completed, taking
| 12.25s [INFO] stable-diffusion.cpp:3067 - step 2
| sampling completed, taking 12.22s [INFO] stable-
| diffusion.cpp:3067 - step 3 sampling completed, taking 12.56s
| ... sampling completed, taking 246.40s
|
| Is that expected performance?
|
| (EDIT: Don't have open Blas installed, so that flag is no-op)
| patrakov wrote:
| CPU-only, 8-bit quant, Intel Core i7 4770S, 16 GB DDR3 RAM, 10
| year old fanless PC: 32 seconds per sampling step, correct
| output.
| badsectoracula wrote:
| This is nice, it basically does what i asked a year ago[0] and
| at the time pretty much every solution wanted a litany of
| Python dependencies that i ended up failing to install because
| it took ages... and then i ran out of disk space.
|
| No, really, this replaces literal gigabytes of disk space with
| just a 799KB binary. And as a bonus using the Q8_0 format (the
| one that seems to be the fastest) it also saves ~2.3GB of data
| too.
|
| That said, it seems to be buggy with anything other than the
| default 512x512 image size. Some sizes (e.g. 544x544) tend to
| cause assert fails, sizes smaller than 512x512 (which i tried
| as 512x512 is quite slow on my PC) sometimes generate garbage
| (anything smaller than 384x384 seems to always do that).
|
| [0] https://news.ycombinator.com/item?id=32555608
| kpw94 wrote:
| Also got segfault core dump with different sizes. 512w * 768h
| worked
| brucethemoose2 wrote:
| You should quantize the model, but 12s/iter seems about right.
| kpw94 wrote:
| Nice. Tried the fp32, q8_0, and q4_0, and for some reason
| they all take ~12s/iter.
|
| Must have something wrong with my setup, but no big deal, for
| my minimal usage of it, and the amount of time spent,
| fp32@12s/iter is fine
| brucethemoose2 wrote:
| Hmm, theoretically FP16 might be the fastest, if thats an
| option in the implementation now.
| Lockal wrote:
| I did a quick run under profiler and on my AVX2-laptop
| the slowest part (>50%) was matrix multiplication
| (sgemm).
|
| In current version of GGML if OpenBLAS is enabled, they
| convert matrices to FP32 before running sgemm.
|
| If OpenBLAS is disabled, on AVX2 plaftorm they convert
| FP16 to FP32 on every FMA operation, which even worse
| (due to repetition). After that, both ggml_vec_dot_f16
| and ggml_vec_dot_f32 took first place in profiler.
|
| Source: https://github.com/ggerganov/ggml/blob/master/src
| /ggml.c#L10...
|
| But I agree, that _in theory, and only with AVX512_ BF16
| (not exactly FP16, but similar) will be fast with
| VDPBF16PS instruction. Implementation is not there yet.
| brucethemoose2 wrote:
| Interesting.
|
| I saw some discussion on llama.cpp that, theoretically,
| implementing matmul for each quantization should be much
| faster since it can skip the conversion. But practically,
| its actually quite difficult since the various BLAS
| libraries are so good.
| billfruit wrote:
| It appears to be in C++, why state it as C/C++?
| mmcwilliams wrote:
| From what I understand the underlying ggml dependency is
| written in C.
___________________________________________________________________
(page generated 2023-08-19 23:00 UTC)