[HN Gopher] Laion-5B: A new era of open large-scale multi-modal ...
___________________________________________________________________
Laion-5B: A new era of open large-scale multi-modal datasets
Author : tosh
Score : 129 points
Date : 2022-12-12 12:18 UTC (10 hours ago)
(HTM) web link (laion.ai)
(TXT) w3m dump (laion.ai)
| jerpint wrote:
| LAION is arguably as important as imagenet was in the early 2010s
| SubiculumCode wrote:
| What makes this multimodal, labels?
| ShamelessC wrote:
| One mode is natural language, the other is imagery. It is
| combination becuse the model will learn statistical
| associations between the modes e.g. "text to image", "voice to
| text".
|
| Within these respective modes are even more subgroups e.g.
| language translation, audio diarization. For sd you can
| consider animation and photographs as separate modes the model
| has to learn. Although the language is fuzzy and im not being
| statistically rigorous as it is a weak point of mine.
| minimaxir wrote:
| For practical context, Stable Diffusion 2.X was trained on
| LAION-5B as opposed to LAION-400M for Stable Diffusion 1.X.
|
| At the least, Stable Diffusion 2.X is better at pain points of
| image generation such as text legibility and hands, potentially
| due to having more data points.
| in3d wrote:
| This is incorrect. Stable Diffusion 1.x was trained on "laion-
| improved-aesthetics" (a subset of laion2B-en).
| minimaxir wrote:
| Double checked and both the initial comment and the
| correction are incorrect: the original v1.1 was trained on
| LAION-2B, then subsequent versions were finetuned on the
| aestethics subset.
|
| Either way, the main point is the same: more training data
| gives better results.
|
| https://github.com/CompVis/stable-diffusion#weights
| in3d wrote:
| 1.1 wasn't public. Public releases were trained as I said.
| cma wrote:
| 1.1 is available here:
| https://huggingface.co/CompVis/stable-
| diffusion-v-1-1-origin...
| satvikpendem wrote:
| SD 2 also removed quite a lot of images of humans due to their
| fear of people generating CSAM, so the quality actually has
| gotten worse for anything resembling humans than SD 1.
| astrange wrote:
| 2.0 removed too many of them due to a bug in the NSFW filter.
| 2.1+ should be better again.
|
| But they're harder to control without negative prompting.
| Terretta wrote:
| Problem with hands is probability.
|
| It's more probable a finger has a finger on both sides of it
| than not. So the model diffuses lots of adjacent fingers.
| stavros wrote:
| But that's the same for everything that has structure. A
| small section of an arm is much more likely to have another
| small section of an arm next to it than to have a hand, yet
| SD's arms are usually well-proportioned.
| sdenton4 wrote:
| There's a lot of loooooong necks, though.
| alar44 wrote:
| No the problem with fingers is that they resemble hotdogs and
| the AI really likes hotdogs so you get a lot of fingers.
|
| I can make things up too!
| lajamerr wrote:
| While impressive number of images today. I believe this will be
| an underwhelming amount of images compared to what models are
| trained on in the future.
|
| This is an incomplete analogy but from the time a baby is born
| that baby will have seen 1,892,160,000 frames of data per eye
| 3,784,320,000 frames in a year. That baby practically knows
| nothing about the world still.
| rom1504 wrote:
| yes indeed. Video is the clear next step.
| minimaxir wrote:
| Most of those frames are redundant.
| bena wrote:
| And unclassified. And of poor quality.
|
| Babies have a much harder task. They have to construct a
| corpus of knowledge from absolutely nothing.
| coolspot wrote:
| Not absolutely nothing, the neural net is initialized with
| some weights encoding basic things (breathing, sucking,
| crying, etc.). Newborn horse walks and follows mother after
| first 5-10 minutes.
| trasz2 wrote:
| How do we know they start from nothing?
| CamperBob2 wrote:
| In fact, we're pretty sure that they don't "start from
| nothing." E.g.,
| https://en.wikipedia.org/wiki/The_Language_Instinct
| bena wrote:
| We're not pretty sure of anything e.g.
| https://en.wikipedia.org/wiki/Educating_Eve
| CamperBob2 wrote:
| On the surface, that sounds like a reasonable position to
| take. ("Cowley proposes an alternative: that language
| acquisition involves culturally determined language
| skills, apprehended by a biologically determined faculty
| that responds to them. In other words, he proposes that
| each extreme is right in what it affirms, but wrong in
| what it denies. Both cultural diversity of language, and
| a learning instinct, can be affirmed; neither need be
| denied.")
|
| GPT's ability to fool intelligent people into thinking
| that it is "intelligent" itself seems like a powerful
| argument that language, more than anything else, is what
| makes humans capable of higher thought. Language is all
| GPT has. (Well, that and a huge-ass cultural database.)
|
| Intelligence is one of those areas in which, once you
| fake it well enough, you've effectively made it. Another
| 10x will be enough to tie the game against an average
| human player.
| the8472 wrote:
| The upside is that babies get to interact with the
| environment they're training on. Image models can't move
| the camera a few cm to the right if they're interested in
| the perspective of a particular scene.
| lajamerr wrote:
| There's value in redundancy and continuous stream of images
| where one follows the other.
|
| It would be nice to have a dataset of a couple "raising" a
| Video recorder for 1 year as if they would a baby. A
| continuous stream of data.
|
| Could train a model to predict the next frames based on what
| it's seen so far.
| mindcrime wrote:
| _It would be nice to have a dataset of a couple "raising" a
| Video recorder for 1 year as if they would a baby. A
| continuous stream of data._
|
| The project I'm working on right now is to build a sort of
| "body" for a (non ambulatory, totally non anthropomorphic)
| "baby AI" that senses the world using cameras, microphones,
| accelerometer/magnetometer/gyroscope sensor, temperature
| sensors, gps, etc. The idea is exactly to carry it around
| with me and "raise" it for long periods of time (a year?
| Sure, absolutely, in principle. But see below) and explore
| some ideas about how learning works in that regime.
|
| The biggest (well, one of the biggest) challenge(s) is
| going to be data storage. Once I start storing audio and
| video the storage space required is going to ramp up
| quickly, and since I'm paying for this out of my own pocket
| I'm going to be limited in terms of how much data I can
| keep around. Will I be able to keep a whole year? Don't
| know yet.
|
| There's also some legal and ethical stuff to work out,
| around times when I take the thing out in public and am
| therefore recording audio and video of other people.
| lajamerr wrote:
| Glad to hear you are working on such a project. There
| definitely will be a lot of privacy concerns in any such
| project so it may be difficult to open source the data to
| broad public.
|
| But could still be useful to research institutes who
| follow privacy guidelines.
|
| It might be best to do a short stint of 1 week to test
| the feasibility. That should give you a good estimate on
| future projections of how much data it will consume after
| a month, 3 months, and a year.
|
| I imagine any intelligent system could work with reduced
| data quality/lossy data at least on the audio.
|
| As long as it's consistent in the type/amount of
| compression. So instead of WAV/FLAC/RAW. You could encode
| it to something like Opus 100 Kbps and that would give
| you 394.2 Gigabytes of Data for a single year for the
| audio.
|
| As for video... it would definitely require a lot of
| tricks to store on a hobbyist level.
| mindcrime wrote:
| Yep. Your reply here encapsulates a lot of what I've been
| thinking about for the past few weeks. I'd love to open-
| source at least some of the data I collect, but the
| privacy/ethics issues have to be considered. And as far
| as that goes, there are legal/ethical issues around
| simply _collecting_ data even if I don 't share it, that
| come into play where other people are involved.
|
| _It might be best to do a short stint of 1 week to test
| the feasibility. That should give you a good estimate on
| future projections of how much data it will consume after
| a month, 3 months, and a year._
|
| Yep. That's basically the approach I took with "phase 1"
| where the only data being ingested was gps /
| accelerometer data. I just let it run for a couple of
| weeks and then extrapolated out what the storage
| requirements would be for the future. Obviously audio and
| video are going to change the equation a lot, but the
| same principle is what I am planning to employ.
|
| _I imagine any intelligent system could work with
| reduced data quality /lossy data at least on the audio._
|
| Yep, that's another area I've been thinking a lot about.
| The "instinct" is to capture everything at the highest
| possible resolution / sampling rate / etc. and store in a
| totally lossless format. But that is also the most
| expensive scenario and if it's not strictly required,
| then why do it? We know human hearing at least can work
| with relatively crappy audio. Look at the POTS phone
| system and it's 8khz of bandwidth for example. Does that
| analogy hold for video? Good question.
|
| _As long as it 's consistent in the type/amount of
| compression. So instead of WAV/FLAC/RAW. You could encode
| it to something like Opus 100 Kbps and that would give
| you 394.2 Gigabytes of Data for a single year for the
| audio._
|
| Agreed.
|
| _As for video... it would definitely require a lot of
| tricks to store on a hobbyist level._
|
| Definitely. One thing that may help with costs in the
| short-term is that I'm very explicitly not (for now
| anyway) using a cloud storage service. Data ingestion is
| to a server I own and physically have in my home. I can
| get away with this because while the aggregate total
| amount of data may wind up fairly big over longer periods
| of time, the rate at which I need to ingest data isn't
| all that high (there's only one of these devices sending
| to the server). And I can just keep adding 5TB or 10TB
| drives as needed. When one fills up, I can unplug it,
| replace it with another, label and store it, and move on.
| The big risks here are that I don't really have any
| redundancy in that scenario, especially if my home burns
| down or something. But in that case I have bigger
| problems to worry about anyway!
|
| There are other downsides to this approach, like dealing
| with the case of needing to access the entire year's
| worth of data "at once" for analysis or training, but I'm
| not sure that need will ever even arise.
| sharemywin wrote:
| here was an article on using latent embeddings for
| compression. might be useful.
|
| https://pub.towardsai.net/stable-diffusion-based-image-
| compr...
| Hendrikto wrote:
| Pretty sure this is a troll.
|
| The assumption that human eyes can be measured in FPS is, in
| itself, very questionable. And if it were indeed the case, then
| it would surely be far in access of 60fps...
| dr_dshiv wrote:
| Well, inhibitory alpha waves cycle across the visual field 10
| times a second. People with faster alpha waves can detect two
| flashes that people with slower alpha waves see as one flash.
| mindcrime wrote:
| _The assumption that human eyes can be measured in FPS is, in
| itself, very questionable._
|
| In the strictest sense, yes. But it seems quite reasonable to
| think that there is something like an "FPS equivalent" for
| the human eye. I mean, it's not magic, and physics comes into
| play at some level. There's a shortest unit of time / amount
| of change that the eye can resolve. From that you could work
| out something that is analogous to a frame-rate.
|
| _And if it were indeed the case, then it would surely be far
| in access of 60fps_
|
| Not necessarily. Quite a few people believe that the human
| eye "FPS equivalent" is somewhere between 30-60 FPS. That's
| by no means universally accepted and since it's just an
| analogy to begin with the whole thing is admittedly a little
| big dodgy. But by the same token, it's not immediately
| obvious that the human "FPS equivalent" would be "far in
| excess of 60 FPS" either.
| satvikpendem wrote:
| You are correct. Deepmind released a paper earlier this year
| showing that data is the primary constraint holding back these
| models, not their architecture size (ie a model with 5 billion
| parameters is not much better than one with 1 billion, but more
| data can make both much better) [0].
|
| I will copy paste the main findings from the article here:
|
| - Data, not size, is the currently active constraint on
| language modeling performance. Current returns to additional
| data are immense, and current returns to additional model size
| are miniscule; indeed, most recent landmark models are
| wastefully big.
|
| - If we can leverage enough data, there is no reason to train
| ~500B param models, much less 1T or larger models.
|
| - If we _have_ to train models at these large sizes, it will
| mean we have encountered a barrier to exploitation of data
| scaling, which would be a great loss relative to what would
| otherwise be possible.
|
| - The literature is extremely unclear on how much text data is
| actually available for training. We may be "running out" of
| general-domain data, but the literature is too vague to know
| one way or the other.
|
| - The entire available quantity of data in highly specialized
| domains like code is woefully tiny, compared to the gains that
| would be possible if much more such data were available.
|
| [0]
| https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc...
| ma2rten wrote:
| This post is about image generation, not language models.
| satvikpendem wrote:
| I'd imagine the situation is the same for image generation
| models too.
| gillesjacobs wrote:
| Good to see open data and open models become a thing, I hope this
| trend will continue and open AI will triumph like open source
| software did.
| ritwikgupta wrote:
| This dataset is a massive failure when it comes to ethical
| research practices. LAION-5B openly indexed copyrighted data that
| it had no business collecting. They failed to go through an IRB
| when curating this data. The ethics review for this paper was a
| joke, where the ethics reviewer raises valid concerns and then
| discards their review because "if they don't publish it here,
| they'll publish it somewhere else anyways" [0].
|
| LAION-5B has enabled some really cool technologies and a lot of
| promising startups. This work should have been carried out
| responsibly.
|
| [0] https://openreview.net/forum?id=M3Y74vmsMcY
| O__________O wrote:
| What specifically are you claiming required a review board?
|
| Quick review of their site and the paper turns up nothing that
| commonly would be a topic that might merit such a review.
|
| Related FAQs:
|
| - https://laion.ai/faq/
| ritwikgupta wrote:
| LAION-5B includes images of humans without their explicit
| consent. Images of people generally involve IRB/HSR.
| Additionally, almost any IRB will mention that if you're
| using data _derived_ from humans, you must go through IRB.
|
| LAION can say all they want that they're not including images
| in their dataset. They include a script to download those
| URLs into images on disk. By being a company that's not bound
| to decades of university ethics regulations, they are
| seemingly allowed to skirt what you learn on your first day
| as a researcher in academia. It may be legal, but it sure is
| not ethical.
| O__________O wrote:
| Please provide link to another academic publication
| agreeing with your claim that linking to online content is
| unethical without the subject's explicit approval.
| nl wrote:
| Thanks a more specific claim that the OP didn't make.
| OctopusLupid wrote:
| > LAION-5B openly indexed copyrighted data that it had no
| business collecting.
|
| This seems to be legal in many countries (from what I know, the
| UK, EU, Japan and Singapore) due to the TDM (Text and Data
| Mining) exception, especially for researchers.
| Blackthorn wrote:
| > LAION-5B openly indexed copyrighted data that it had no
| business collecting.
|
| Seems like an open and shut fair use claim, web indexing (not
| even scraping, just indexing) is not uncommon...
| oth001 wrote:
| Terribly unethical using unlicensed images. They could/could've
| crowdsourced image gathering and labeling instead of stealing
| images.
| astrange wrote:
| This is like saying Google Image Search stole your image.
|
| (In fact it's exactly the same; it's allowed under the same
| laws and it respects robots.txt.)
| oth001 wrote:
| Does Google.com allow anybody to instantly mimic an artist's
| style? Obviously AI laws haven't been put in place yet - it
| doesn't mean it's not unethical.
| astrange wrote:
| It's always been possible to imitate an artstyle.
| Nevertheless, they've never gotten IP protection - they're
| more like trade secrets.
|
| What's notable is "AI users are trying to copy an artist"
| != "AI has learned from an artist" != "AI has seen the
| artist's images in the first place". The most popular
| supposedly stolen-from artist Greg Rutkowski is not in
| StableDiffusion's training images, even though users are
| actively trying to copy him, it's a coincidence that it
| appears to work. Is that unethical?
|
| Also, AI laws (text and data mining exemptions) /have/ been
| put in place - to make this explicitly legal!
| satvikpendem wrote:
| If you've ever actually looked into the Laion datasets, you'll
| notice that they are hot garbage, in that the captions often
| don't even correlate to what the image is about, and the images
| are often low quality, bad cropped and so on.
|
| There are other datasets being developed that use high quality
| images that are manually labeled by humans, such as by Unstable
| Diffusion who's having a Kickstarter right now [0]. They say they
| will be able to get a much more high quality model due to such
| high quality images and captioning, so we'll see. They also want
| to make the model and code entirely open source rather than the
| license that Stable Diffusion has which is not open source (it
| has many restrictions, enforceable or not, on the images made).
|
| [0]
| https://www.kickstarter.com/projects/unstablediffusion/unsta...
| infinityio wrote:
| Obviously there would be limits as to how much could be
| manually reviewed by hand (if 1000 people reviewed 1000 images
| each, only 0.02% of the images would be reviewed assuming no
| overlap was required), but I wonder if there would be any
| benefit to attempting to crowdsource captions for the dataset
| for the worst available images
| alsodumb wrote:
| If you ever actually look into Unstable diffusion Kickstarter,
| you'll notice that they're not actually claiming they'll
| manually label a dataset the size of Laion-5B - that's a much
| bigger task than what you seem to think it is.
|
| Even if a million people are labeling images, without any
| overlap, 5 billion images would mean each of them has to label
| 5000 images each.
|
| What Unstable diffusion folks seem to be doing is that they're
| using a few thousand labeled images to train a caption
| generation model and then use it to create a huge multimodal
| dataset with text and high quality images.
| satvikpendem wrote:
| > _If you ever actually look into Unstable diffusion
| Kickstarter, you'll notice that they're not actually claiming
| they'll manually label a dataset the size of Laion-5B -
| that's a much bigger task than what you seem to think it is._
|
| I never claimed this either.
| fpgaminer wrote:
| DALL-E, Stable Diffusion, GPT-3, Whisper, CLIP, etc are all
| trained on "hot garbage" and all of them are SOTA. Whisper is a
| great example, as it shows that this broader use of imperfect
| training data helps to make the models more robust and general
| than their "perfectly" trained counterparts. The trick behind
| all of these is to build mechanisms on smaller scale, human
| labelled data that can then be used to filter and label the
| broader dataset. Or use training methods that are more robust
| to imperfect data, like contrastive learning ala CLIP.
| GaggiX wrote:
| It is not possible to manually label hundreds of millions of
| images to train a model on them, CFG exists to deal with this
| problem, also Unstable Diffusion will just finetune a Stable
| Diffusion model, so you cannot simply change the licence to
| what you want.
| operator-name wrote:
| Boorus [0] contain millions of images, manually labeled to a
| pretty high quality. Notably defusion models trained on booru
| datasets have had good success.
|
| This is not the only example of well curated image-tag pairs,
| especially in artistic circles. It's just that most of them
| are not CC.
|
| [0]: https://en.wiktionary.org/wiki/booru
| GaggiX wrote:
| Booru use tags instead of captions, so a model trained on
| them is really limited; moreover, Danbooru has only 5
| million images, while other booru such as gelbooru and
| sankaku have lower quality.
| operator-name wrote:
| Tags are limited how exactly? Prompt crafting becomes a
| case of selecting the relevant tags, and the embedding
| space will still capture the dataset.
|
| Danbooru is only one such example of well curated tagging
| and if we ignore copyright there are far more examples.
| These example just serve as evidence that refining poor
| labeling is not outside of the relm of possibility as you
| suggested.
| GaggiX wrote:
| A tag-based system would completely lack any kind of
| contextual information and it would not be possible to
| create any relationship between words; natural language
| is much more powerful.
|
| An example, an image is tagged: kanna_kamui, kimono and
| torhu_(maiddragon), who has the kimono? Kanna, Torhu or
| both? It cannot be known, but with natural language it is
| possible to describe who is wearing what.
| devmor wrote:
| >It is not possible to manually label hundreds of millions of
| images to train a model on them
|
| Citation, please?
|
| I think you mean "the developers of this technology do not
| want to pay to have hundreds of millions of images labeled".
| GaggiX wrote:
| It is not believable that someone would pay humans to label
| 400mln or 5bln images/samples to train a model on them, but
| I guess if you argument is "everything is possible" then
| gotcha
| satvikpendem wrote:
| If it's done in a reCAPTCHA like way, it can be done
| fairly efficiently and for cheap. In fact Scale AI does
| just this, they do manual labor operations such as
| captioning images, as an API. Here's their product for
| image labeling: https://scale.com/rapid.
|
| Unstable Diffusion is also doing their captioning like
| how I mentioned, with groups of volunteers as well as
| hired individuals.
| GaggiX wrote:
| Scale seems to do, for example, image classification but
| not captioning as it would be hard to compare the results
| with others people to verify the quality (when you have a
| discrete number of classes is really straightforward),
| also can you report where you read about the Unstable
| Diffusion plan for manually labeling image datasets? I
| want to dig deeper
| satvikpendem wrote:
| From their Reddit post about this: https://old.reddit.com
| /r/StableDiffusion/comments/zhg18s/uns...
| GaggiX wrote:
| > We are releasing Unstable PhotoReal v0.5 trained on
| thousands of tirelessly hand-captioned images
|
| They seem to have created a much smaller dataset than
| LAION's, it would not work to train a generative model on
| such a small amount of images (obviously the images here
| do not have a single domain).
| devmor wrote:
| You seem to be confusing "possibility" with your personal
| opinion on what you think would be done by others.
| GaggiX wrote:
| As a human being I know human limitations, explicitly
| labeling 400mln/5bln images for a particular task seems
| absurd to me, but if you think it is realistically
| possible perhaps you can give an example.
| whiplash451 wrote:
| The LAION dataset was designed for the broader community at
| the first place, so clearly the premise is that they don't
| have millions to throw at the problem.
| rom1504 wrote:
| Looks like you missed the whole point of this dataset.
|
| The idea that we proved is you can get a dataset with decent
| caption and images (that do match yes, you can see for yourself
| at https://rom1504.github.io/clip-retrieval/ ) that can be used
| to trained well performing models (eg openclip and stable
| diffusion) while using only automated filtering of a noisy
| source (common crawl)
|
| We further proved that idea by using aesthetic prediction, nsfw
| and watermark tags to select the best pictures.
|
| Is it possible to write caption manually? sure, but that
| doesn't scale much and won't make it possible to train general
| models.
| satvikpendem wrote:
| > _Is it possible to write caption manually? sure, but that
| doesn 't scale much and won't make it possible to train
| general models._
|
| Maybe, I don't think so however based on the above comments
| by Unstable Diffusion. It seems like people are
| underestimating the power of high quality data and just
| throwing the kitchen sink at models. Perhaps a set of good
| quality data can indeed outperform Laion-style datasets.
|
| It's like the YC saying about doing things that don't scale,
| perhaps with the high quality dataset, we can train better
| models than CLIP and in turn use those to caption the rest of
| the images, only now the caption model is much better than
| previous ones.
| GaggiX wrote:
| The new Unstable Diffusion model will be one of the several
| SD finetuned model out there, these models usually have
| much higher quality (but smaller image diversity) because
| they take the coherency of SD and costrain the distribution
| to a small high quality portion, this means that you can
| train a model on a smaller high quality dataset from
| scratch but you would not, for example, have the same level
| of coherency, this can only be obtained with an incredible
| amount of images, and they don't need to be "high quality",
| a man will almost always have 2 arms, 2 legs etc...
| regardless of the quality of the images, and after the
| model has fit the entire distribution you can finetune it
| to produce high quality and coherent images with a small
| dataset, that's why Unstable Diffusion will finetuned a SD
| checkpoint, also why researchers use these big dataset like
| LAION-400M/5B
| cma wrote:
| > and they don't need to be "high quality", a man will
| almost always have 2 arms, 2 legs etc...
|
| At the next generation it feels like the training set
| will be inbreeding on the flood of stable diffusion
| images with 7 mangled fingers, heads coming out of legs,
| etc.
| version_five wrote:
| I'd guess there is a bias-variance tradeoff. If you just
| want to make a certain kind of image, no doubt a manually
| labeled and curated dataset can be better. If you want a
| generic generative model that has learned a wide variety of
| stuff, scale wins.
|
| I can see LAION playing a similar role to imagenet. The
| main application of imagenet isn't directly training image
| recognition models. It's pertaining on diverse data so that
| a "big" (big in 2016) model can be fine tuned easily on a
| small dataset, after learning to be a good feature
| extractor. From that perspective, the label quality (and
| concerns about bias and whatnot) are almost irrelevant
| napier wrote:
| It's possible to so a much better job with automation too.
| Context aware cropping, accurate aspect ratio, quality
| filtering by various metric... all solved problems long ago,
| but absent from Laion-5B for some reason. Perhaps it would be
| a good idea go collaborate more closely with image experts
| for the next round.
| [deleted]
| abeppu wrote:
| So the core of the dataset is image _URLs_ and text captions.
|
| 1. From a reproducibility perspective, isn't this kinda brittle
| in that even without malicious intent, some of those images will
| no longer be available when other researchers attempt to download
| them?
|
| 2. From a resilience perspective, if your site has some of the
| images in the dataset, could you swap in another image with the
| correct dimensions. Could you poison or skew the model in any
| interesting ways?
| version_five wrote:
| Imagenet (arguably the most used image dataset of the last 10
| years) is the same, it's a list of URLs with full archives of
| the downloaded images available under some conditions.
| tbalsam wrote:
| Fair enough, but Imagenet is sort of a nightmare right now. I
| get it's a crowd funded and sourced effort, but hopefully at
| some point some brave soul(s) will step up to archive the
| data as-is in a very reproducible kind of way. :D :))))
| alsodumb wrote:
| The key is the scale of the dataset. Both the points you
| mention become irrelevant for a large dataset because
|
| 1) The chance that a significant percentage of the images
| become unavailable is low. Also, training on such a big dataset
| means your model generalizes well and is usually robust.
|
| 2) Again, you would need to inject adversarial/malicious images
| to a significant number of those links in the dataset for it to
| have actual impact on trained model. Again, unlikely.
| [deleted]
| abeppu wrote:
| For point 1 ... it depends on the timescale. In the fullness
| of time, surely a significant portion of images will be
| unavailable. From the perspective of allowing other
| researchers to work from the "same" baseline "today", this is
| likely good enough. In a generation from now, if someone
| wants to reproduce results from some landmark model trained
| against this dataset, we'd have problems. In other fields
| where people publish or share their datasets, would this be
| considered sufficient?
|
| For point 2, I think it's possible that for some narrow
| topics, some domains have a significant share of images. I
| think these can affect the model, which is in part why they
| give special attention to watermarking. Suppose instead of
| merely watermarking images, for every image on my large
| collegiate track and field website I make sure someone is
| wearing a garment with a visible Nike swoosh. Can I skew the
| model towards associating Nike with the sport? I think this
| kind of thing may be achievable for niche areas.
| astrange wrote:
| Since artists already appear to believe LAION is "stolen
| content", actually downloading everything wouldn't help the
| case that it's fine.
| whiplash451 wrote:
| And from the storing perspective? The full image dataset weighs
| dozens of PB. How convenient is that to share?
___________________________________________________________________
(page generated 2022-12-12 23:00 UTC)