[HN Gopher] Problems with AI-based monitoring startups (2018)
___________________________________________________________________
Problems with AI-based monitoring startups (2018)
Author : zdw
Score : 83 points
Date : 2021-01-24 19:13 UTC (2 days ago)
(HTM) web link (www.yesthatblog.com)
(TXT) w3m dump (www.yesthatblog.com)
| phonebucket wrote:
| Somewhat of a tangent, but a pet peeve of mine is people misusing
| 'by definition' where there is no clear definition to be
| leveraged.
|
| > By definition, an AI researcher does not have operational
| experience.
|
| Why? What definition?
| thinkloop wrote:
| An AI researcher, by definition, is one who spends their time
| researching AI, so they would have no time remaining to get
| operational experience.
| TomasEkeli wrote:
| that's just silly. someone with operational experience can
| switch to ai research later in their career. probably not
| super common, but nothing definitionally wrong with it
| lwhi wrote:
| They probably have experience of using e-commerce though.
| [deleted]
| dijksterhuis wrote:
| I'm a researcher in ML security.
|
| I also have five years experience in music copyright and
| royalties. I still know stuff about weird esoteric
| distribution policies PRS for Music has, like the Educational
| Recording Agreement or how the pubs and clubs scheme
| analogous apportionment of PS40 million worked.
|
| Then there's the digital signals synthesis techniques I've
| had to learn in the last two years as well. Literally
| building bare bones software based synthesisers.
|
| And then there's all the ML stuff and the security stuff...
|
| So I wholeheartedly disagree with your definition. Knowledge
| is often much fuzzier than _person is working on X, therefore
| they must only know about X_.
| jarym wrote:
| Your background is really interesting - ML and music
| copyright/royalties. It's an area that my wife is
| interested in. If you don't mind I'll email you directly.
| dijksterhuis wrote:
| Sure, happy to help if I can. The website in my profile's
| "about" section has a few e-mail address you can contact
| me on.
| sakarisson wrote:
| A stamp collector, by definition, is someone who spends all
| of their time collecting stamps, so they would have no time
| remaining to get operational experience.
| speedgoose wrote:
| There is applied science.
| sn41 wrote:
| This is just the usual "real world" snobbery that
| mathematicians and researchers have to deal with. Sure,
| researchers are inexperienced in some areas, but the "real
| world" also will not be able to make leaps in progress without
| specialized research.
| matsemann wrote:
| Lots of monitoring setups are either too simple or too advanced.
| We had a system accepting a certain kind of applications. We
| would normally get about ~1500 a day, following a pattern with a
| steady amount during working hours, many in the evening, and
| almost nothing during the night. Any human could take a glance at
| the graph and quickly tell if something was wrong or everything
| ok.
|
| We wanted to catch if something went wrong and people couldn't
| submit applications. We set up an alarm that triggered "if less
| than X applications last hour", but that was hard to tweak so it
| didn't go off during the night and was unusable. The tool had no
| way to set that rule to only apply for certain time periods or
| any way of making more advanced rules. But even if it had, I
| think it would have been a game of whack-a-mole with false
| positives and obvious errors slipping through.
|
| Instead we could set up some kind of AI "anomaly detection". That
| was almost even worse. Firstly, because no one could tell us how
| it really worked, how does it know what we consider an anomaly? I
| mean, if there is a holiday in the middle of the week and we get
| fewer applications than normal that's an anomaly, but nothing to
| sound an alarm for.
| amelius wrote:
| I'd like to see anomaly detection as a standard part of (image)
| classification libraries, as it's incredibly useful to know if
| a given image is "too far" outside of your trained dataset to
| be accurately classified.
| sarusso wrote:
| Actually, good anomaly detection can take into account what
| "normal" is for your use-case.
|
| For example, Facebook Prophet (a forecasting procedure, which
| can in turn be used for anomaly detection) has extensive
| support for seasonalities and holidays, and would probably work
| well for the use-case you mentioned.
| jacques_chester wrote:
| > _Firstly, because no one could tell us how it really worked,
| how does it know what we consider an anomaly?_
|
| Some old-fashioned techniques (eg. just triggering on high
| deviations with adjustment for seasonality) are at least
| explicable, given a little time. But they don't get much buzz.
|
| My own view is that the untreated problem for most attempts to
| apply time series analysis / anomaly detection / concept drift
| / process control is that the time series observed are highly
| variable to begin with. A highly variable series means (1) you
| need more data to build the predictions, (2) those predictions
| are less certain and (3) the false positives will be higher.
|
| Put another way: the need for such tools is a symptom, and a
| treatment, _but not a cure_.
|
| Making software more predictable is more value-adding than
| building ever more elaborate predictors.
| social_quotient wrote:
| I have similar thoughts here. It's almost like we don't need to
| AI we just need the I. What you have described is a fairly
| straightforward pattern that seems like it could be expressed
| by looking at the same period (ago) to see if it's acting
| roughly right.
|
| It reminds me of a chat I had with a customer in their slack.
| They wanted build and deploy notifications but only if a build
| didn't happen on a day at an expected time. They don't want 700
| notifications a day, it creates a deafness to the signal. The
| request was comically difficult.
| a_imho wrote:
| _Firstly, because no one could tell us how it really worked,
| how does it know what we consider an anomaly_
|
| This hits the nail on the head for me. Unless the AI magic
| handwaving is 100% infallible a human in the loop needs to
| verify whether there is in fact any problem. Traceability is at
| least as important as accuracy and statistical methods have a
| very good track record there.
| craigching wrote:
| I don't think of anomaly detection as being the end, it's
| triage. It shouldn't do more than "I found something that
| doesn't follow the normal patterns, let me page someone to
| take a look." And you need to use it strategically, it's not
| for use everywhere. Like anything, it's a tool and it really
| depends on how you use it.
| linuxftw wrote:
| The pattern is simple. Collect metrics, filter, apply rules,
| alert.
|
| The problem is, people try to put all of those pieces into a
| single product. This sometimes works, other times not. "Filter"
| and "apply rules" is the real work, and it's rather
| intractable. Some products embed a DSL to attempt to empower
| users to create these bits themselves.
|
| Presumably the "AI" could write the filters and rules by
| "training" on some data. The AI can't automatically deduce the
| dimensions of your data like 'consider the time and day of this
| request' without instruction from the human. It can't
| conceptualize 'is a holiday/is not a holiday' without input
| from a human.
|
| At the end of the day, just collect the metrics and write a
| small application with whatever alerting rules you want. This
| might not scale, but it might not need to.
| eric_khun wrote:
| Datadog works really well for us. Their "watchdog" feature , that
| monitor every metric you send to them, based on anomaly
| detection, helped us to uncover many issues, with a low % of
| false positives.
|
| Kudos to their AI team (but yeah, Datadog is hella expensive)
| that_guy_iain wrote:
| I'm actually working on an AI-based monitoring product[1] that I
| plan on launching next month. Mines is based on anomaly detection
| for ecommerce systems with the emphasis on anomalies with the
| orders. My take is, normally IT just look to see if they're still
| processing payments and still taking orders. They'll often have a
| dashboard within their office where they can see the orders and
| go check that dashboard manually throughout the day. Any process
| that has manual actions are prone to failure. They don't look for
| 6 hours or maybe forget for a few days. But even then, if you're
| looking to ensure that you're still accepting orders, you may be
| accepting orders, but a random feature may be reducing the sales.
| An example that I've been told happens a lot, the dropdown to
| select which variant you want may break. There are also other
| things other than technical errors that could affect sales, such
| as changing the layout, it may all work but conversion rates
| could drop. Noticing a 25% drop in sales is a hard thing to
| notice by eye and often something that will only get picked up
| weeks later and cause a whole bunch of painful meetings.
|
| Overall, I think anomaly detection is a major thing we need in
| monitoring. I generally want to have it with my logging which is
| where I got the idea for my project, and I will be extending it
| to work with my logging system even if it's just for me. There
| are so many issues in legacy production systems where if it
| happens once that's ok, but if it happens 100 times, it's not ok.
| Writing rules for all of these seems near impossible if you have
| a large startup system you need to make reliable.
|
| [1] https://www.ootliers.com
| NicoJuicy wrote:
| Where did you get your dataset from?
|
| What could ootliers detect in an B2B ecommerce platform (
| building one, that's why)
| Bombthecat wrote:
| Probably past data, and i think calling that AI is a bit of a
| stretch, if you are just look for outliers / deviation from
| standard..
| NicoJuicy wrote:
| Wasn't calling it AI though, machine learning has anomaly
| detection ;)
| that_guy_iain wrote:
| Yea it's from past data. I also think it's a strech calling
| it AI but everyone else does so I just go with it and the
| fact it's a buzzword.
| jacques_chester wrote:
| I'm aware of another company with what seems like a comparable
| value proposition: https://outlier.ai/
| that_guy_iain wrote:
| Nice! I actually think it's better that other people also
| think this idea is worth building. Especially if they're
| targetting a different niche.
| jacques_chester wrote:
| I agree. It's useful to have competitive validation. I hope
| you succeed.
| teekert wrote:
| Well, now you know why it will fail ;)
|
| Kidding of course. In fact I tend to skip titles like this
| because actually they are not true and that is often even
| admitted in the text. It annoys me to no end.
|
| Good luck with you AI-based monitoring startup!
| that_guy_iain wrote:
| > Well, now you know why it will fail ;)
|
| I'm preparing my CV already, got to think of the best
| sounding job title to impress for my next gig. :)
| qayxc wrote:
| The article doesn't target _all_ monitoring products.
|
| The author talking specifically about _IT infrastructure_
| monitoring. Anomaly detection in business processes (like in
| your case orders) is something else entirely.
| bassdigit wrote:
| Look, three paragraphs beginning with "Look," give the article a
| rather condescending tone.
| exporectomy wrote:
| That grates on me too. I translate it as "Shut up and listen."
| which causes me to try to disengage from the speaker. Though,
| here it's aimed at a 3rd party, so it's not quite as offensive.
| kalal wrote:
| Does the tone bothers anybody? After reading first couple of
| sentences I got the impression that the author knows it all and
| everybody else is just stupid... This may be cultural difference,
| or too much sensitivity on my side. I don't know.
| tachyonbeam wrote:
| It's not just you. The article starts out with a very
| inflammatory title. If the author had any regard for effective
| communication, they would begin by explaining what exactly is a
| "monitoring startup", but instead they assume that there can
| really be only one kind of "monitoring". About 3/4 through the
| article, it becomes clear they probably mean monitoring
| servers, and not, say, monitoring a physical location with
| cameras, or monitoring an assembly line or baby monitors.
|
| Then, the meat of the article is basically saying "everyone
| else failed, so you will too. Also, Google can do it better
| than you." It just sounds snarky and arrogant. IMO much of the
| point of startups is to be allowed to take risks. If everyone
| has a conservative attitude that risky ideas shouldn't even be
| tried, progress never happens. What do you think this guy would
| have said when Elon Musk announced his plans to launch SpaceX?
| Good thing some people don't listen to cynical asshats.
|
| Also, two-sentence paragraphs. They convey a fundamental
| misunderstanding of how paragraphs are used.
| htrp wrote:
| You aren't wrong, the author definitely conveys the impression
| of the stereotypical toxic engineer.
| xwdv wrote:
| If it was a comment on HN, it'd be downvoted to hell. Instead
| it's written as an article and soared to the front page. Maybe
| I should write my comments as blogs instead, since I have a
| large amount of downvotes.
| ram_rar wrote:
| > Each team should set up their own monitoring and alerting
| rules.
|
| I agree with the general sentiment of it. But in practice this
| leads to a lot more chaos. There are a lot of benefits of
| templatizing standards metrics for any service that is launched
| with right SLI/SLOs. Associating it with high and slow burn rates
| bifurcates issues that needs immediate attention to be fixed vs
| something thats slowing eating into your service, but doesn't
| need to be acted on right away.
| helsinkiandrew wrote:
| > "We can predict failures. You don't have to write alert rules
| any more."
|
| If AI monitoring tools can predict potential failures before they
| happen based on existing rules and events - then they have value
| but writing the alert rules defines what you're interested in -
| what is and is not an acceptable error.
| btbuildem wrote:
| A pretty narrow conception of "AI monitoring" -- I'm guessing the
| author is talking about monitoring in context of IT?
|
| Bit annoying tone tbh, like a tadpole in a puddle bragging it's
| got the ocean all figured out.
|
| Failure prediction, process modelling and other applications of
| AI monitoring in an industrial context are more and more
| mainstream these days. It's not just startups trying to pitch,
| it's established players deploying real-world solutions, and
| infrastructure giants providing the building blocks.
| BorisTheBrave wrote:
| > If I have one NOC for all of my services and they're chasing
| their tail trying to figure out root causes and who is
| responsible for each service... you have a management problem.
|
| Is it inconcievable to the author that buying software may be
| easier than fixing a management problem, at least in the short
| term.
| sarusso wrote:
| Curious. I had to read until mid-article to figure out what kind
| of monitoring we were talking about. Because for a lot of other
| monitoring use-case (i.e. infrastructural, environmental,
| biological etc.) it's another story.
|
| Feels like the article has a strong sysadmin/devops bias. Which
| is fine, but maybe adding "IT" in the title would have been
| better :)
| dicroce wrote:
| I think the problem is that these AI solutions should integrate
| at a lower level. Instead of a giant brain that does everything
| (and screws it up), just give people ML based tools... and let
| people hook them up and decide what they need.
| [deleted]
| kristiandupont wrote:
| Creating solutions looking for problems is a common issue for
| would-be founders and it certainly has been for me for years..
| It's a bit weird to hear the dismissal of an idea that I could
| totally have seen myself run with (and agreeing with said
| dismissal!)
| laichzeit0 wrote:
| Weird. Moogsoft [1] has been around for years and they're an AI-
| based monitoring startup. A lot of other monitoring tools
| (AppDynamics, DynaTrace) incorporate some ML to help with root-
| cause and alarm deduplication. It's not the core of the product
| though, it's more like a helpful feature. I personally never
| found any of that shit to work in practice.
|
| [1] https://www.moogsoft.com/
| phenkdo wrote:
| I think the author is painting with too broad a brush here,
| companies like Pagerduty, Datadog et al have done very well.
|
| Yeah there are many me-too businesses in this space - just like
| any other, and they are probably doomed. I feel the author is
| being too harsh here.
| KaiserPro wrote:
| > Pagerduty, Datadog et al have done very well.
|
| They aren't an AI monitoring company. Pagerduty is a rules
| engine based alerting system. Datadog is nagios 2.0 on the web,
| with really aggressive sales people, and really expensive.
|
| None of them are "feed me your raw data and I'll make sane
| alerts and root cause analysis"
|
| Which is the core argument of the post, if you can't do alert
| routing, or root cause pinpointing then AI isn't going to help
| you. Its like saying that AI is going to make your UX, or
| backend app.
| phenkdo wrote:
| tbf ""feed me your raw data and I will..." is _any_ space is
| a bunch of hooey. Nothing special about the monitoring space.
| phenkdo wrote:
| > Pagerduty is a rules engine based alerting system
|
| erm that's AI too.
| kristiandupont wrote:
| Does a user have to write the rules?
| tauwauwau wrote:
| https://en.wikipedia.org/wiki/Expert_system
|
| Manual rule writing and AI are not mutually exclusive.
| NateEag wrote:
| If I have to write the rules myself, I'll do it in a
| language and toolset that are widely-known and well-
| defined, not in incantations that will only work on a
| proprietary system.
| jabl wrote:
| > Datadog is nagios 2.0 on the web, with really aggressive
| sales people
|
| Oh man. At a previous job, I talked with some datadog people
| at a conference. Told them that while their product looked
| interesting, it wasn't really a good fit for our usecase. Few
| weeks later a sales person called me "Hi this is XXX from
| Datadog, remember we talked at conference YYY" (no, I don't
| remember you personally, and I'm absolutely certain you don't
| remember me either, but alas). I told him the same thing,
| that no, I'm sure it's a good product but doesn't fit our
| usecase. Ok, thx, bye. Well, next week he calls me again. No,
| I still haven't re-evaluated datadog, and I still think it's
| not a fit for our usecase. The following week he emails me,
| asking whether I'd like a more in-depth look at their
| product, or if there's somebody else at my employer who could
| be interested. Since it seems like a rehash of our previous
| discussions, I don't bother replying. Week after that, he
| emails me again, with some passive aggressive "still waiting
| for a reply here". Finally, some days later he emails me a
| long whine accusing me of breaking the trust between us by
| not responding to his emails. Seriously? FU.
|
| Maybe Datadog needs some AI in their sales pipeline to figure
| out when they are pissing of potential customers to the point
| they vow to never have anything to do with the company.
| [deleted]
| [deleted]
| trabant00 wrote:
| The article should be taken in the context of system
| administration. And in that context it's on point. When he writes
| monitoring he means alerts that mean something important is
| surely broken and that somebody should wake up right now and log
| on to fight the fire. I myself call that "state monitoring". And
| I find the Nagios model still the best for this.
|
| The other kind I call "trend monitoring". And here you can feed
| data to some program that might detect anomalies, find
| correlations between different data deltas and so on. This can be
| very valuable and obviously a computer can crunch a lot more data
| a lot faster than humans.
|
| For state monitoring I find black box tests to work very good.
| You don't need to understand how the system works, you don't need
| any inside data, just try to use it. Have a bot try to buy the
| product from the web store. It's a lot more reliable than trying
| to deduce the state of the system from how internal components
| act and interact. Again, the scope here is to alert somebody that
| action is surely needed ASAP.
|
| Ofc you also need trend monitoring, you don't want to wait until
| you're out of space when you can project from the current growth
| trend. But in keeping with that example a data import will always
| trick the AI into thinking we're heading for disaster so it's not
| very reliable as a fire alarm.
| coding123 wrote:
| I think one AI that would be nice to have is something to sell to
| AWS: A tool that examines data that is publicly on S3 and tries
| to analyze if it's sensitive data or just media assets, etc... S3
| seems to be our largest source of data breaches sadly.
|
| There's a whole branch of "security" AI that needs more
| exploring: shell analysis (hacker or not), IP connection
| analysis, change analysis. lots of crap.
| michaelbuckbee wrote:
| This is part of what Polyrize (now Varonis) does -
| https://www.polyrize.com/
| panpanna wrote:
| You don't need ai for that.
|
| Can't find the link, but there are a few projects doing just
| that.
| melomal wrote:
| It's not far off to be fair:
| https://www.ft.com/content/21b19010-3e9f-11e9-b896-fe36ec32a...
|
| Again, buzzwords and whatever get's money in the bank is what
| leads tech development. Which is probably why we have stagnated
| on new, radical ideas or at least they are not nearly enough
| getting awareness.
| aaron695 wrote:
| > Which is probably why we have stagnated on new, radical ideas
| or at least they are not nearly enough getting awareness.
|
| We don't need new radical ideas. Almost nothing in IT has been
| new for the past 30 year (Exception - Blockchain)
|
| We do prioritise badly. We don't acknowledged real problems. We
| do consistently pour dev money into scam technologies. We do
| run around in circles pretending we haven't been here before
| (like the article mentions)
|
| It's the same as any industry. Like quack medicine, and real
| medicine which is mostly wrong. But IT should have the ability
| fix itself at a faster rate.
| sgt101 wrote:
| >Almost nothing in IT has been new for the past 30 year
|
| eermm !
|
| - iPhones came out in 2007 and it took maybe 5 years for mass
| adoption of mobile devices to do work tasks (see iPads in
| bank branches as an example). This is new.
|
| - This year the workforce in much of the world has gone from
| 90% in the office to 99% at home. This is new.
|
| - Data bases have gone from 10's MB to 1000's of GB, and from
| scores of tables to tens of thousands of tables.
|
| - Cloud computing; when was it that your company outsourced
| it's entire IT infrastructure? I bet it wasn't 30 years ago!
|
| - Outsourcing itself, 30 years ago very few people had lumps
| of their IT done by SI's, this is a huge change that everyone
| has just got used to.
|
| - Security and tfa and firewalls and all that jazz: 30 years
| ago this was all terribly naive and often implemented with
| airgaps. I remember one project doing billing automation at a
| power utility where we installed networking, workstations (in
| branches) and built a completely independent network out from
| the mainframes. It did not touch the internet, sign on was
| via user name and password only with no changing policy. I
| don't even remember seeing a security policy!
|
| - End user consumption; this is big. 30 years ago the c-suite
| never saw emails. Their secretaries printed them out. Execs
| didn't type, or look at screens. There was a weekly report
| and it was given out on friday, reviewed at the weekend and
| discussed / actioned monday morning. Now every exec has a
| notebook, and a phone (see above) they want dashboards, they
| want to explore the data themselves - and they do do that
| (often badly). Their level of interaction and requirement
| from IT is orders of magnitude different from 30 years ago.
|
| - Lack of cash. Every company outside China with the
| exception of FAANGS has had the cash squeezed out of it by
| investors in the last 30 years. 30 years ago utilities and
| investment companies sat on what (in hindsight) were rivers
| of cash. Funding was so heavy that there were special offices
| set up to oversee it's dispersal (who else remembers central
| program offices?) Now projects are managed in line and are
| part of 0 build budgets (start every year with everyone on 0
| budget and make them make the case to add $1 at a time). It
| is incredible to think of how we used to act in projects and
| how empowering and facilitating of progress and quality it
| was - but also how expensive. People would think nothing of
| spending 10 or 20 times their yearly salary on machines and
| software - there was no motivation to imagine what else could
| be done with the cash! Of course all that spending floated
| great chunks of the economy along with the spend going into
| other peoples pockets. It was another world.
| melomal wrote:
| But tablets and touch screens were out way before Apple
| released anything. Apple makes things pretty therefore it
| creates mass adoption. Aesthetics is not a tech
| advancement.
|
| A remote workforce uses Skype, a laptop and the internet.
| As far as I can tell these have been around for years.
|
| Cloud computing is marketing lingo for shared servers OR if
| you have the budget your own dedicated server. Nothing new
| by any means.
|
| It sounds like you agree with my realization which is that
| we need to get the most out of the existing tech rather
| than moving onto something new and shiny.
| pjc50 wrote:
| This is really a fight over the definition of "new",
| isn't it? The transistor was something you can point to.
| "Decades of incremental process change" isn't, but makes
| just as big a difference.
| sgt101 wrote:
| In my defence the new bit was "in IT" not technology
| generally. IT is the operation of information
| infrastructures by major enterprises. And the timeline
| was 30 years. Laptops were very, very, very uncommon 30
| years ago!
|
| Cloud computing isn't just shared servers - it's a shared
| application infrastructure. It's extremely different from
| the old mainframe model.
|
| I have to say I do think that operations could be better
| without new silver bullets, but the scale and pressures
| on IT and business operations in general are dragging
| them to a precipice. Things are now so complex and
| demanding that at some point we may well see major
| organisations fray and disintegrate because they just
| can't run themselves anymore. Something close happened
| with the cyber attacks on Maersk and the NHS a few years
| ago - but from what I've seen it's quite possible that
| someone big will get themselves into such a tangle that
| they suffer a complete operational breakdown without an
| attack.
| melomal wrote:
| I agree on the Cloud app infrastructure but again it's
| more or less the 'aesthetics/experience' of it that has
| made it so widespread.
|
| From the way the UK has been handling basic aspects of IT
| such as the spreadsheet that ran out of rows and missed a
| lot of Covid tracking data, the basics are yet to be
| figured out.
| framecowbird wrote:
| I know this isn't the point of your post, but is blockchain
| all that new? Distributed ledgers already existed in the 90s,
| pretty close to your 30 year cutoff.
| aaron695 wrote:
| I picked 30 years over 40 years because then you get a few
| more things like spreadsheets around the 80's (79/80)
| hasa wrote:
| If fundamentally unscaling signed linked list is worth to
| lift up here, there are perhaps plenty of other things to
| be mentioned as new.
| exporectomy wrote:
| I think cryptocurrency is a fundamentally new invention
| because despite many people's efforts, nobody was able to
| achieve decentralized digital currency before Bitcoin.
|
| Just the fact that a blockchain doesn't scale in an
| obviously good way is probably part of what confused
| people into not inventing it sooner. Surely if total
| storage requirement increases without bound, it can't
| possibly run forever! Well, humanity won't run forever
| either. It just has to last long enough to be useful
| while it exists.
| melomal wrote:
| Actually you are right there, we do not need new radical
| ideas. We need creative solutions to the existing
| infrastructure's we have put in place.
|
| Come to think of it maybe we haven't stagnated and in fact
| are a little overzealous with the creation of a new
| "language/tech/platform" every week that is going to beat
| [insert perfectly fine dev language].
| yamrzou wrote:
| Reminds me of this comment, which gave me a good chuckle:
| https://news.ycombinator.com/item?id=23534048
| melomal wrote:
| That is gold!
___________________________________________________________________
(page generated 2021-01-26 23:02 UTC)