[HN Gopher] Stochastic gradient descent written in SQL
___________________________________________________________________
Stochastic gradient descent written in SQL
Author : Lemaxoxo
Score : 230 points
Date : 2023-03-07 12:57 UTC (10 hours ago)
(HTM) web link (maxhalford.github.io)
(TXT) w3m dump (maxhalford.github.io)
| AndrewKemendo wrote:
| This is genius I love it.
|
| On stream inference with something like "continuous" data in the
| same structure I think is the final material form for "AI" so
| this is a great step towards that
|
| Thanks for the writeup
| yza wrote:
| Next step would be to implement autodiff. If only PostgreSQL had
| CUDA support.
| Rickasaurus wrote:
| Please don't do this in prod
| dventimi wrote:
| Many ML models experience offline training, and some online
| training is fine-tuning a model that was pretrained offline.
| Hence, I see no particular reason to worry about doing this "in
| prod."
| agnosticmantis wrote:
| Just to disambiguate, here machine learning = linear regression.
| So no auto-grad needed as gradients are explicit and simple.
| Still interesting though.
| saydaark wrote:
| This is interesting, but is not using the GPU.
| kilgnad wrote:
| An even MORE useful idea is to do this in CSS.
| hubja wrote:
| Super intresting! Love the scrappy mindset ;)
| Lemaxoxo wrote:
| Cheers, I learnt with the best :)
| simonw wrote:
| I tried to replicate this in SQLite. The first few steps worked
| OK, e.g.
|
| https://lite.datasette.io/?json=https://gist.github.com/simo...
|
| (I replaced "figures" with "raw" due to the way Datasette Lite
| assigns a default table name to the imported JSON)
|
| But the more complex recursive queries gave me this error and I'm
| not sure how to work around it: recursive
| reference in a subquery: state
|
| E.g.
| https://lite.datasette.io/?json=https://gist.github.com/simo...
| Lemaxoxo wrote:
| That's too bad, I would have expected it to work out of the
| box. Other than rewriting the query in a different way, I'm not
| sure I see an easy workaround. Are you still working on this?
| andrenotgiant wrote:
| This is really interesting, but a basic part I don't understand:
| What would it actually look like to run this on a live dataset?
|
| If I understand correctly: you'd run the recursive query, it
| produces results for every step, effectively showing you the
| progression of output over time, and then once it hits "present
| day", it completes and stops?
|
| How would you generate results going forward? I.E. A minute
| elapses after the results return, do you have to re-run the whole
| query for all time?
| [deleted]
| Lemaxoxo wrote:
| Good question. I touched upon this in the conclusion.
| Basically, if you run this in a streaming SQL database, such as
| Materialize, then you would get a true online system which
| doesn't restart from scratch.
| noloblo wrote:
| Does this work with Postgres?
| Lemaxoxo wrote:
| Postgres has excellent support for WITH RECURSIVE, so I see
| no reason why it wouldn't. However, as I answered
| elsewhere, you would need to set some stateful stuff up if
| you don't want the query to start from scratch when you re-
| run it.
| sergius wrote:
| Did you try running it using DuckDB?
| Lemaxoxo wrote:
| DuckDB is what I used in the blog post. Re-running this
| query simply recomputes everything from the start. I didn't
| store intermediary that would allow starting off from where
| the query stopped. But it's possible!
| hobs wrote:
| In a non-streaming db What would prevent you from storing the
| result set and just using the last iteration to calculate the
| next?
| swasheck wrote:
| i'm trying to learn here so please pardon my ignorance.
| wouldn't a pre-aggregated set affect the new aggregate
| result? i suppose you could store avg, sum, count and then
| add the new value(s) to sum, new count to count, and
| recalculate average from that. or even just avg and count
| and then re-aggregate as ((avg*count)+new values)/(count +
| new values count) but i didn't know if there's a better way
| to process new values into a set of data that's already
| been aggregated
| hobs wrote:
| Yep that would be what I would do - effectively a
| functional approach where no memory is required besides
| the current set and iteration.
|
| A big part of materializing datasets for performance is
| finding the right grain that is both easy enough to
| calculate and also can do nice things like be resumeable
| for snapshots.
| Lemaxoxo wrote:
| This is definitely a possibility. What I meant to say is
| that the implementation in the blog post doesn't support
| that.
| hobs wrote:
| Makes sense, just an insert instead of a select, I used a
| similar approach on newton-raphson to implement XIRR in
| SQL and it worked well.
| episode0x01 wrote:
| From a comment by the author:
|
| > Also, I can't justify why, but my gut feeling is that the
| database should be an order of magnitude faster than Python,
| provided the implementation were to be improved.
|
| Would be curious how that could end up being the case. Perhaps if
| NumPy wasn't used at all? That would mean no vectorization or
| optimized storage.
|
| Would be interesting to see how it scaled with length and
| dimensionality
| jackschultz wrote:
| >A machine learning algorithm which can be trained using SQL
| opens a world of possibilities. The model and the data live in
| the same space. This is as simple as it gets in terms of
| architecture. Basically, you only need a database which runs SQL.
|
| First paragraph of the conclusion, and this very much fits with
| the mindset that's been growing in me in the data world over the
| past few years. Databases are much more powerful than we think,
| they're not going to go away, only get better, and having the
| data and the logic in the same space really removes tons of
| headaches. ML models, transformation of data, generating json for
| an API can all be done within the database rather than outside
| scripting language.
|
| Are others seeing this? Are the current tradeoffs just that more
| people know python vs sql or database specific languages to where
| moving logic to postgres or snowflake is looked down on?
| noloblo wrote:
| +1 sql is extremely elegant composable and is under rated
|
| Postgres is very powerful. While I sought a short detour in
| nosql Mongodb land now back to Mysql Postgresql sql territory
| and glad for it
|
| Being able to generate views is and stored procedures is useful
| as well.having sql Take over more like ml, gradient descent
| does open up good possibility.
|
| Also since sql is declarative it Makes it so it's rather easier
| than imperative scripting languages
| naasking wrote:
| SQL has some positives but it is not composable. At all. This
| is because relations are not first-class values in SQL.
| dventimi wrote:
| Is a query not a relation?
| naasking wrote:
| Basically, but queries are not first class in SQL. You
| can't assign a query to a variable, or pass it as a
| parameter to a stored procedure, for example. This would
| make SQL composable: declare @x =
| (select * from Person) select Name from @x where
| Birthdate < '2000-01-01'
| noloblo wrote:
| Exactly since it declarative the style Lends itself using
| stored procedure calls to become composable
| naasking wrote:
| You cannot abstract over stored procedures either, so
| that's still not composable.
| dventimi wrote:
| No idea what this means
| naasking wrote:
| Think of first-class functions. Can't do that with stored
| procedures, just like you can't do that with queries or
| tables in SQL.
| alphanumeric0 wrote:
| Furthermore, stored procedures/functions are not queries.
| dventimi wrote:
| Stored procedures can be relations. Queries are
| relations. Ergo, stored procedures can be queries.
| Lemaxoxo wrote:
| Isn't that the point of common table expressions (CTEs)?
| naasking wrote:
| See: https://news.ycombinator.com/item?id=35058927
| Foobar8568 wrote:
| You can do composition through CTE or Table-Valued
| Function.
| naasking wrote:
| CTE and TVF still treat tables as second class citizens,
| so while they enable some forms of composition, they're
| still very restricted. This has been the consistent story
| with SQL, 15 ways to use queries and tables, all to work
| around the limitation that they are not first class
| values.
| datan3rd wrote:
| with persons as (select * from Person) select
| Name from persons where Birthdate < '2000-01-01
| naasking wrote:
| Where is the assignment to a variable? Where can you
| construct a query using a variable in table/query
| position? That's the whole point of being first class and
| composable, a query becomes like any other value so you
| should be able to parameterize any query by another query
| assigned to a variable that may have been set inside an
| if-statement, or accepted as a parameter to a stored
| procedure. You know, the same kinds of composition we see
| in ordinary programming languages.
| chrstr wrote:
| create table x as (select * from person); select
| name from x where ...;
|
| there you go, just configure your editor to display
| "create table x" as "declare x = " ;)
|
| or even a version with lazy evaluation:
| create view x as (select * from person); select
| name from x where ...;
| naasking wrote:
| You're still not getting it. First-class status means
| that anywhere a value or variable can be used, a query or
| table should also be able to appear, and vice versa. This
| means a table or query can appear as a return type, a
| parameter to a stored procedure or query, a variable, and
| so on.
|
| SQL just does not have this, it instead has 15 different
| second class ways to handle tables and queries that try
| to make up for the fact that they are not first-class
| values. These include CTEs, table valued functions,
| views, etc.
| alphanumeric0 wrote:
| What if I wrote a very long, complicated query that I'd
| like to test against different tables (like test tables),
| and let's say I can't use stored functions or procedures.
| How could I pass different tables to my query?
| chrstr wrote:
| CREATE TABLE data_a AS (SELECT 'a' AS test_case, 1 AS
| value); CREATE TABLE data_b AS (SELECT 'b' AS
| test_case, 2 AS value); CREATE VIEW data AS (SELECT
| * FROM data_a UNION ALL SELECT * FROM data_b);
| CREATE VIEW complicated_query AS (SELECT test_case,
| value+1 FROM data); SELECT * FROM
| complicated_query WHERE test_case = 'a'; SELECT *
| FROM complicated_query WHERE test_case = 'b';
| alphanumeric0 wrote:
| Nice, that is what I was looking for. Of course, it'd
| need to point to production data as well, so maybe
| test_case is null, in that case: CREATE
| TABLE data_a AS (SELECT 'a' AS test_case, 1 AS value);
| CREATE TABLE data_b AS (SELECT 'b' AS test_case, 2 AS
| value); CREATE TABLE data_prod AS (SELECT NULL AS
| test_case, prod_table.value FROM prod_table);
| CREATE VIEW data AS (SELECT * FROM data_a UNION ALL
| SELECT * FROM data_b UNION ALL SELECT * FROM data_prod);
| CREATE VIEW complicated_query AS (SELECT test_case,
| value+1 FROM data); -- when testing
| SELECT * FROM complicated_query WHERE test_case = 'a';
| SELECT * FROM complicated_query WHERE test_case = 'b';
| -- when in 'production' SELECT * FROM
| complicated_query WHERE test_case IS NULL;
| naasking wrote:
| You just reinvented defunctionalization, which is a
| transformation from a domain that has first-class values
| to a domain where support is only second-class.
| Defunctionalization is typically used in programming
| languages to simulate first-class functions in languages
| where functions are only second-class citizens, like C
| and Pascal.
|
| This perfectly illustrates my point. You had to manually
| defunctionalize your data model and queries to support
| what I'm saying should be inherently part of SQL.
| dventimi wrote:
| If you can't use stored procedures which are good for
| this very case, many databases offer dynamic SQL. That
| might work in some cases.
| alphanumeric0 wrote:
| That's a good point. This would rule out SQLite for me.
| Foobar8568 wrote:
| TVF ( Table-Valued Function) with Cross apply.
| adeelk93 wrote:
| Check out dbt - it's a great tool for organizing queries
| and solving such patterns
| simonw wrote:
| CTEs go a long way to making SQL more composable.
| noloblo wrote:
| Postgres has read write parallism that can scale across
| millions of read writes ; if ml model is inherent in the
| Postgres db it is indeed very elegant reminds me of the glory
| days of tsearch2 to do text search in Postgres for searching
| our internal document repository using nothing but Postgres
| and boy was it faster than Java based search systems
| somat wrote:
| A dbms is really it's own operating system, usually this is
| hosted on another operating system, one that understands the
| hardware.
|
| I remember one place I worked where we had several old
| graybeard programmers who considered the dbms[1] the operating
| system, as a unix sysadmin we had some interesting discussions
| as I was always confused and confined by the dbms and they felt
| the same about unix.
|
| 1. unidata if curious, a weird multi value(not relational)
| database, very vertically integrated compared to most databases
| today.
| CreRecombinase wrote:
| I spend most of my time in the parallel universe that is
| scientific computing/HPC. In this alternate reality SQL (not to
| mention databases) never really took off. Instead of scalable,
| performant databases, we have only the parallel filesystem. I'm
| convinced the reason contemporary scientific computing don't
| involve much SQL is sociological/path-dependency, but there are
| also very good technical reasons. Optimizing software in
| scientific computing involves two steps: 1) Think hard about
| your problem until you can frame it as one or more matrix
| multiplications 2) Plug that into a numerical linear algebra
| library The SQL abstraction (in my experience) takes you very
| much in the opposite direction.
| dchftcs wrote:
| There is a recent trend in database research, ML in databases.
| Not sure how much an impact it can make though, the sweet spot
| is doing relatively linear stuff, arguably just a step up from
| analytical functions in queries, while cutting edge ML needs
| dedicated GPUs for compute load and often uses unstructured and
| even binary data.
| Scubabear68 wrote:
| With "traditional" RDBMS systems, putting a lot of code in the
| DB lead to a lot of scaling issues where you'd need gigantic
| machines with a lot of RAM and CPU: because the DB was doing so
| much work. It was expensive and clunky to get HA right.
|
| In more modern DBs being distributed horizontally, this
| approach may see a rebound. The big "but" is still costs, in my
| experience in AWS as an example, managed Postgres Aurora was
| surprisingly expensive in terms of monthly cost.
| dventimi wrote:
| Why would running code in a database process be intrinsically
| more expensive than running it in some other process?
| yunwal wrote:
| The issue is that many relational databases are not
| horizontally scalable, so you want to be frugal with their
| resources.
| dventimi wrote:
| The resources I'm familiar with are I/O, memory, and CPU.
| The only one I believe can be spared in the database by
| using that resource outside of the database, is CPU. When
| the database is far from saturated on CPU and latency and
| throughput are determined by I/O and memory, using CPU on
| some other machine that isn't the database can't possibly
| have any impact on latency and throughput.
| yunwal wrote:
| > When the database is far from saturated on CPU
|
| The issue here is if you scale enough saturate the
| database, you'll have to rewrite essentially all your
| code if you're a typical CRUD webapp. Basically all of
| your business logic is about data retrieval. There's
| probably some companies that can get away with this, but
| it would be way too expensive for most.
| dventimi wrote:
| If I scale up to saturate the database CPU...by doing
| data retrieval? Setting aside my skepticism about
| saturating the CPU with mere data retrieval, how is that
| solved by moving the data to another host's CPU, when
| moving the data involves the very data retrieval that's
| saturating the database's CPU?
| riku_iki wrote:
| you can easily scale linearly your app layer on multiple
| machines, while it is harder with previous gen databases.
| dventimi wrote:
| What good is that going to do if the data always resides
| in the database?
| bob1029 wrote:
| > Databases are much more powerful than we think
|
| The older I get the more I agree with this.
|
| There is nothing you cannot build by combining SQL primitives.
| Side effects can even be introduced - on purpose - by way of
| UDFs that talk to the outside world.
|
| I've seen more than one system where the database itself was
| directly responsible for things like rendering final HTML for
| use by the end clients. You might think this is horrible and
| indeed many manifestations are. But, there lurk an opportunity
| for incredible elegance down this path, assuming you have the
| patience to engage in it.
| giraffe_lady wrote:
| > I've seen more than one system where the database itself
| was directly responsible for things like rendering final HTML
| for use by the end clients.
|
| I did this for a side project a few months ago and even used
| postgrest to serve the page with correct headers for html. It
| felt simultaneously really cursed and obvious. Shit you could
| even use plv8 to run mustache or whatever in the db if you
| really wanted to piss people off.
| dventimi wrote:
| I'm doing this right now with a DO droplet built with
| PostgreSQL, postgrest, nginx, and not much else. Do you
| have any tips, tricks, or blog posts you can share based on
| your experience? You should post it to HN. Strike while the
| iron's hot. With a little luck you'll hit the front page.
| giraffe_lady wrote:
| Nah nothing that refined. Pretty predictably supabase is
| doing some weird stuff along these lines and I found some
| abandoned and semi-active repos associated with them and
| people working for them that were useful examples of some
| things.
|
| As for posting to HN absolutely no thanks. These people
| are so fucking hostile there is no accomplishment too
| small to tear apart for entertainment here. I have no
| interest in it.
| dventimi wrote:
| > As for posting to HN absolutely no thanks. These people
| are so fucking hostile there is no accomplishment too
| small to tear apart for entertainment here. I have no
| interest in it.
|
| You make a really good point.
| nerdponx wrote:
| In the data warehouse / OLAP space, I think we are heading
| towards a world where the underlying data storage consists of
| partitioned Arrow arrays, and you can run arbitrary UDFs in
| arbitrary languages on the (read-only) raw data with zero copy
| and essentially no overhead other than the language runtime
| itself and marshalling of the data that is emitted from the
| UDF.
|
| Something like the Snowflake data storage model + DuckDB as an
| execution engine + a Pandas/Polars-like API.
|
| There is no reason why we have to be stuck with "database ==
| SQL" all the time. SQL is extremely powerful, but sometimes you
| need a bit more, and in that case we shouldn't be so
| constrained.
|
| But in general yes, the world is gradually waking up to the
| idea that performance matters, and that data locality can be
| extremely important for performance, and that doing your
| computations and data processing on your warehouse data in-
| place is going to be a huge time and money saver in the longer
| term.
| roflyear wrote:
| > Are the current tradeoffs just that more people know python
| vs sql or database specific languages to where moving logic to
| postgres or snowflake is looked down on?
|
| Yes, mostly development, deployment, etc.. concerns. I haven't
| ever seen an org that versions their SQL queries, unless they
| are in a codebase. The environment is just unfriendly towards
| that type of management. Nevermind testing! Things that have
| solutions but we haven't matured enough here, because that type
| of development has been happening in application code.
|
| Also, SQL is generally more complex than application logic,
| because they are designed to do different things. What is a
| simple exercise in iteration or recursion can more easily
| become something a little more of a headache.
|
| Problems that can be resolved, but they are problems.
| _a_a_a_ wrote:
| Self-proclaimed database expert here. What a database is good
| at depends on what you're trying to get that database to do, at
| least in part.
|
| Take it into piecess, elegance and efficiency. These will
| correspond to a logical statement of what you're trying to do,
| and how quickly the database will actually do it in practice.
|
| SQL can do some nice things in areas, making it elegant in
| those areas. Elsewhere it can be pretty wordy and ugly.
|
| In efficiency, it comes down largely to how the database is
| implemented and that also includes the capability of the
| optimiser. Both of these are out of your control. In my
| experience trying to turn a database into a number cruncher is
| just not going to work.
|
| I guess that's long way round of me saying that I don't think I
| agree with you!
| maCDzP wrote:
| I agree. I took a course in databases and SQL and was blown
| away by its power. With CTE's and PLSQL you can do a lot of
| stuff inside the database.
|
| I played with SQLite and it's json columns. Once you get the
| hang of the syntax for walking a json structure you can do all
| sorts of neat things. Doing the same thing in Python would be
| tedious.
|
| And I also believe it ended up being way faster than what I did
| in python.
| ISL wrote:
| The challenge I, with limited knowledge, see with developing
| detailed algorithms in SQL is a lack of good testing,
| abstraction, and review tooling. Similarly for a lot of the
| user-defined-functions for the larger data warehouses
| (redshift, bigQuery, etc.)
|
| dbt solves a lot of it, but I'd love to learn more about good
| resources for building reliable and readily-readable SQL
| algorithms for complex logic.
| AndrewKemendo wrote:
| I mean this is an opportunity right? Build those things
| Lemaxoxo wrote:
| I agree. Databases are going to be here for a long time,
| and we're barely scratching the surface of making people
| productive with them. dbt is just the beginning.
| adeelk93 wrote:
| I've supported 3 different models over the years with
| inference implemented in SQL. First one I inherited, loved it
| so much that I implemented it twice again. Amazingly fast for
| TBs of data and no waiting on other tech teams.
|
| That tooling you're describing is definitely not there.
| Bigquery has BQML but it's very much in its infancy. I tend
| to do all the modeling in Python/R on sampled data and then
| deploy to SQL.
| fithisux wrote:
| BQML should become standard.
| actionfromafar wrote:
| I have yet to see a decent IDE or system which allows great
| version control, unit testing and collaboration with SQL source
| code.
|
| So I think a lot of the reluctance is from practical concerns.
| noloblo wrote:
| We store individual sql files in github and keep them in
| separate folders
|
| This is very simple and scales well for our purposes
| jhd3 wrote:
| > Databases are much more powerful than we think
|
| and data has mass. One example of bringing the work to the data
| is https://madlib.apache.org/ (works on Postgres and Greenplum)
|
| [Disclaimer - former employee of Pivotal]
| tomrod wrote:
| Absolutely.
|
| Consider too that PostgreSQL databases support different
| languages, like Python.
|
| Loads of for-profit companies have tried to cash in on this.
| SAP HANA is one of the ones I've had recent experience with. It
| is unfortunately a poor implementation. The right architecture
| tends to be: put your model behind an API interface, not
| internal on the system. Train your model separately from
| production systems, and so on.
|
| You might also be interested in checking out MLOps platforms
| like Kubeflow, Flyte, and others.
| 7thaccount wrote:
| Database first designs make a lot of sense in a lot of ways.
| I've worked for a company with an Oracle database that has SQL
| scripts that do all the validation and create text files for
| downstream usage. I think it makes more sense than a ton of
| Java, but there are pros and cons. One is that SQL is
| relational and the advanced stuff can be extra hard to
| troubleshoot if you don't have enough experience. Even those
| that can't code can usually understand a for loop and can think
| imperatively.
|
| Unfortunately it's an expensive commercial product or I'd
| recommend you look at kdb+ if you work with time series data.
| The big banks use it and essentially put all thier latest RT
| data into kdb+ and then can write extremely succinct queries
| with a SQL-like syntax, but the ability to approach it far more
| programmatically than what is typically doable with something
| like PL-SQL. You can even write your neural network or whatever
| code in less than a page of code as the language of kdb+ is
| extremely powerful, although also basically incomprehensible
| until someone puts some time into learning it. It's extremely
| lightweight though, so very easy to deal with in an interactive
| fashion.
|
| All that to say I agree with you that it's nice to just have
| everything you want all in one spot rather than to deal with 4
| different tools and pipelines and shared drives and so on.
| daveguy wrote:
| I would like for this to be the case. I was DBA of a Postgres
| database for LIMS almost 2 decades ago. Back then you could
| code functions for the database to execute on data and it was
| very powerful, but also very clunky. The tools to do software
| development in the database was not mature at all. A lot has
| changed in the past 20 years and SQL has evolved. Do you think
| SQL will expand that much or there will be APIs built into the
| database? Near-data functions are powerful and useful, but I
| would want my development environment to be more like version
| controlled code than "built-in".
|
| I wonder if near-data functions on small databases is the
| solution to the limit of statelessness that you have with
| functions as a service.
| nonethewiser wrote:
| > Databases are much more powerful than we think
|
| And a function of what people think is attitudes towards
| working at the DB level. I see this often with ORM's in the web
| dev sphere (rather than Dat Science). Yes, ORM's are great but
| many people rely on them to completely abstract away the
| database and are terrified by raw sql or even query building.
| You also see it with services that abstract away the backend
| like PocketBase, Fireship, etc. Writing a window function or
| even a sub select looks like black magic to many.
|
| I say this after several experiences with codebases where joins
| and filtering were often done at the application layer and raw
| sql was seen as the devil.
| boringg wrote:
| Opposite here - dont like ORMs. Too much overhead - though i
| get their value.
| FpUser wrote:
| It is tempting to combine web server, database and some
| imperative language with built in data oriented / SQL features
| in a single executable and call it an application server that
| would communicate with the outside world using for example JSON
| based RPC. I think there were / are some products in the area
| even with the built in IDE (like Wakanda).
| spprashant wrote:
| The problem with throwing everything in a database is you end
| up with brittle stored procedures all over the place, which are
| painful to debug. There is no good support for version control
| or testing, which means you end up creating a dozen copies of
| each function named (sp_v1, sp_v2,.., etc.). It much more
| harder to practice iterative development which the rest of
| software development seems implements effectively.
|
| Also traditional relational databases have a way to go before
| they can support parallelized machine learning workloads. You
| do not have control or the ability to spin up threads or
| processes to boost your performance. You rely on the query
| processor to make those decisions, and depending on your
| platform the results will be mixed.
| 62951413 wrote:
| Junior developers like me were uncomfortable with SQL twenty
| years ago. Java ORM frameworks became popular because of the
| Object-Relational impedance. I kind of see the same kind of
| sentiment nowadays among newer generations but in Python&Co.
|
| The success of the Apache Spark engine can at least partially
| be attributed to
|
| * being able to have the same expressive power as SQL but with
| a real Scala API (including having reusable libraries based on
| it)
|
| * being able to embed it into unit tests at a low price of
| additional ~20 seconds latency to spin up a local Spark master
| justsomeuser wrote:
| I think general programming languages are better for general
| programs than SQL.
|
| Specifically they have: Type systems, compilers, debuggers,
| text editors, package managers, C FFI etc.
|
| But I agree that having the data and the program in the same
| process has benefits.
|
| Writing programs in SQL is one way.
|
| Another way is to move your data to your general program with
| SQLite.
|
| I like using SQL for ACID, and queries as a first filter to get
| the data into my program where I may do further processing with
| the general language.
| Serow225 wrote:
| Another is MS SQL Server, which lets you run .NET on the
| database server :D "you can author stored procedures,
| triggers, user-defined functions, user-defined types, and
| user-defined aggregates in managed code"
| malnourish wrote:
| I have had nothing but bad experiences trying to run .NET
| in SSIS packages -- is there another way?
| Serow225 wrote:
| I've never had the pleasure(ha) of using SSIS, but this
| is the stuff that I was talking about:
| https://learn.microsoft.com/en-
| us/dotnet/framework/data/adon...
| fifilura wrote:
| > Type systems
|
| SQL has types
|
| > compilers
|
| For what specifically do you need a compiler?
|
| > debuggers
|
| Some tasks - like the concurrency SQL enables - are just very
| difficult to debug with debuggers. It would be the same with
| any other language. What SQL does here though is to allow you
| to focus on the logic, not the actual concurrency,
|
| > text editors, package managers
|
| I feel like these two are just for filling up the space.
|
| > C FFI
|
| Many SQL engines have UDFs
| swyx wrote:
| had a very good chat with https://postgresml.org/ last week
| which is focusing on bringing ML to postgres:
| https://youtu.be/j8hE8-jZJGU
| Lemaxoxo wrote:
| I'm watching it, it's really good. Montana makes a great
| point: you can move data to the models, or move the models to
| the data. Data is typically larger than models, so it makes
| sense to go with the latter.
| fzeindl wrote:
| Fully agree.
|
| And using PostgREST [0] you can serve your postgreSQL database
| as REST-API. And if you throw foreign data wrappers / multicorn
| in the mix, you can map any other datasource into your
| postgreSQL-db as table.
|
| [0] https://postgrest.org/en/stable/
| JUNGLEISMASSIVE wrote:
| [dead]
| [deleted]
| airstrike wrote:
| This is great. The only thing I dislike from this is using these
| variables to try to predict Adj Close when they are not at all
| correlated.
|
| There are countless meaningful correlations in financial data
| that would have been just as easy to play around with. One truly
| valuable example would be to look at trading multiples of
| comparable companies. Sticking to P/E would be easier as P is
| easily observable and forward-looking EPS estimates are generally
| always available. This would limit the exercise to more mature
| companies than the ones commonly discussed on HN but would make
| it actually meaningful
| Lemaxoxo wrote:
| Do you have some data/resources on this? I'm a total snowflake
| at this, but I'm willing to learn.
| airstrike wrote:
| _In hindsight this is harder than it seems if you don 't
| already have access to data, so I'll try to be as informative
| as possible in my response_
|
| It's hard to find this information out there, so here's ~all
| you need to know.
|
| Data is usually behind paywalls, unfortunately. Industry
| standards are Bloomberg terminal (ridiculously expensive, 5
| digits $), FactSet (very expensive, 4 digits), Capital IQ
| (expensive, not sure)... but there are a number of up-and-
| coming startups trying to disrupt the space so you may be
| able to grab data from them. I think https://atom.finance has
| a 7-day free trial you could use to play around with.
|
| P/E simply means the company's _P_rice per share divided by
| _E_arnings per share. Cancel out the "per share" terms and
| you get total market capitalization (which is the value of
| the total equity) divided by net income (since "earnings per
| share" really means "net income per share")
|
| So the "P" is easy to get. It's your Adj Close.
|
| The "E" is trickier as it can mean a lot of things. Diluted
| EPS from financial statements? Last year's EPS? Management's
| guidance for EPS? None of those are actually correct even if
| they are all "EPS"
|
| Importantly--and contrary to 99% of the info you will find
| online--the most relevant EPS number are _forward_ estimates
| of EPS, usually for the next twelve months ( "NTM"). That is
| based on an average or median of analyst estimates which is
| called "consensus". These are analysts from financial
| institutions who build their own little models based on their
| own views of where the business is going to go, informed by
| recent earnings, management's color in earnings calls and
| filings, etc.
|
| Believe it or not, as hairy as that sounds, EPS is fairly
| easy to get as it's a metric that has less room for
| interpretation than, say, EBITDA.
|
| So you're not going to go out there, read all these (paid)
| analyst reports, find their EPS, calculate the median, etc.
| Bloomberg, Capital IQ, FactSet do this for you and it's
| easily observable for the end user (that's their business).
|
| The thing is, as you may have guessed, "next twelve months"
| are a moving target across time. Analysts usually provide
| estimates for the current fiscal year (i.e. FY 2023, ending
| 12/31/2023 for most companies) and the following year, ending
| 12/31/2024. Let's call these FY0_EPS and FY1_EPS, for
| simplicity
|
| You might be tempted to just take a moving average of these
| two estimates, so that on 1/1/2023 it is 100% of FY0_EPS + 0%
| of FY1_EPS, on 1/2/2023 it is 99.9% + 0.1% and gradually
| "move forward in time" as the days pass. That sort of works
| (and definitely checks the box for a proof-of-concept like in
| your post) but for the sake of completeness, I'll just say
| that the right-er approach is to only "move forward in time"
| when new earnings are released. So it doesn't matter if we're
| in 1/1/2023 or 2/1/2023--what matter is what is the latest
| reported quarter. Take Coca-Cola for instance
| (https://www.bamsec.com/companies/21344/coca-cola-co). Let's
| roll the tape backward one year. They reported FY 2021
| earnings on 2/22/2022, at which point analysts published new
| estimates in revised models, so on from that day forward
| until the next quarterly earnings we take 100% FY0_EPS + 0%
| FY1_EPS, in which these correspond to estimates for FY 2022
| and FY 2023, respectively.
|
| On 4/1/2022, Coca-Cola reported Q1 2022 results, analysts
| published new estimates, and we now take 75% FY0_EPS + 25%
| FY1_EPS. On 7/1/2022, we move forward another quarter so 50%
| + 50%, then 25% + 75% starting on 10/26 and then back to
| square one with 100% + 0% except FY0_EPS now means FY 2023 vs
| FY 2022 previously, and FY1_EPS means FY 2024
|
| So your table is something like (I'm making up numbers)
| +------------+--------+-----------+---------+---------+------
| -------+-------------+-------------+ | Date |
| Ticker | Adj_Close | FY0_EPS | FY1_EPS | Period |
| NTM0_Weight | NTM1_Weight | | 01/01/2022 | KO |
| 90.10 | 20.00 | 24.00 | Q1 2022 | 1.00 |
| 0.00 | | 01/02/2022 | KO | 91.14 | 20.00
| | 24.00 | Q1 2022 | 1.00 | 0.00 |
| | 01/03/2022 | KO | 89.30 | 20.00 | 24.00 | Q1
| 2022 | 1.00 | 0.00 | | 01/04/2022
| | KO | 91.09 | 20.00 | 24.00 | Q1 2022 |
| 1.00 | 0.00 | | 01/05/2022 | KO |
| 92.01 | 20.00 | 24.00 | Q1 2022 | 1.00 |
| 0.00 | | 01/06/2022 | KO | 89.05 | 20.00
| | 24.00 | Q1 2022 | 1.00 | 0.00 |
| ... | 07/02/2022 | KO | 89.05 | 19.50 |
| 23.20 | Q2 2022 | 0.75 | 0.25 |
|
| With that you can take NTM0_Weight and NTM1_Weight to
| calculate NTM_EPS by multiplying those weights by FY0_EPS and
| FY1_EPS. And then can take AdjClose / NTM_EPS to calculate
| P/E
|
| Why is this useful? Because in theory you can take the
| average P/E of companies X, Y and Z in one industry and
| compare it to a fourth company W. Is W's P/E multiple above
| or below the industry average? You now know if they are over
| or undervalued, respectively, which means you know if you
| should buy or sell that stock (if you believe you picked the
| right "comparable" companies in that industry)
|
| This is just one example... there are all sorts of similar
| analyses done daily in the financial services industry. I'm
| not saying it's easy to extract alpha from trading on these,
| but that's the framework
| Lemaxoxo wrote:
| Thank you so much for this! It's very generous of you to
| have taken the time.
| airstrike wrote:
| My pleasure! I've spent the better part of the last
| decade doing this stuff and I appreciate how hard it is
| to find resources on it, so thought I'd share since you
| mentioned you were interested in learning
|
| At the risk of exhausting said interest, here's one real
| life example of a similar analysis: https://wsp-blog-
| images.s3.amazonaws.com/uploads/2011/09/171...
|
| See PDF page 14. Note the lines called "Composite P / NTM
| EPS" which they built as a blend of American Eagle's,
| Tilly's and Zumiez's P/E multiple, which are companies X,
| Y and Z in my comment above (for some reason they gave AE
| double the weight which is unusual) and compared it to
| Heat's P/E multiple (Heat was the codename for retailer
| Rue21, or hypothetical company W in my example above)
|
| I got this deck from
| https://www.wallstreetprep.com/knowledge/investment-
| banking-... which has a few other examples and a variety
| of other resources on the topic FYI
| college_physics wrote:
| Is there a risk that postgres becomes sentient in our lifetime?
| zackmorris wrote:
| This is great! Moving away from the proprietary nature of GPUs
| and complex math gatekeeping should help democratize AI.
|
| Has anyone converted stuff like gradient descent to set theory?
|
| https://support.unicomsi.com/manuals/soliddb/7/SQL_Guide/2_G...
|
| https://www.sqlshack.com/mathematics-sql-server-fast-introdu...
|
| https://www.sqlshack.com/learn-sql-set-theory/
|
| Right now AI algorithms kind of look imperative and stateful to
| me, like state machines. But there should be a functional or
| relational representation, where pattern matching would be used
| to derive the current state instead.
|
| It's trivial to go from functional to imperative representation,
| but often nearly impossible to go the other way. That's why
| monadic (sorry if I'm mincing terms) logic is so troublesome.
| Mutable variables, futures/promises, async, etc can't be
| statically analyzed, which is why most code today is difficult to
| parallelize and stuff like C++ optimizers don't even increase
| speed to the next order of magnitude. But functional languages
| have nearly infinite optimization potential through divide and
| conquer approaches like sharding and scatter-gather arrays that
| can run internally without side effects. In other words, we can
| throw hardware at SQL for linear speedup since it's
| embarrassingly parallel, but might have limited success
| optimizing something like Python.
| dogcomplex wrote:
| Replying to this to remind myself to research later, as this is
| a very important question
| krick wrote:
| From the start I assumed this is a nice playful "Hexing the
| technical interview" kinda joke. But given the tone of the
| article, and some of the comments here... Uh, this cannot be
| serious, right?
| guhidalg wrote:
| Why couldn't it be serious? Your database already has your data
| so you don't have to stream it to another machine for training.
| If you have a developed model that needs retraining on new
| data, save yourself the network bandwidth (and time!) and just
| make the DB retrain. This doesn't replace the exploration phase
| of model-building but if you already know gradient descent
| works for you then what's wrong with this approach?
| sgu999 wrote:
| I'm very confused as well. Are we facing a wall of GPT-
| generated comments?
| Terretta wrote:
| Title here is wrong. Title in article and headings in article are
| right: ONLINE gradient descent
|
| It's specifically _not_ stochastic. From the article:
|
| _Online gradient descent_
|
| _Finally, we have enough experience to implement online gradient
| descent. To keep things simple, we will use a very vanilla
| version:_
|
| _- Constant learning rate, as opposed to a schedule._
|
| _- Single epoch, we only do one pass on the data._
|
| _- Not stochastic: the rows are not shuffled._ - - - -
|
| _- Squared loss, which is the standard loss for regression._
|
| _- No gradient clipping._
|
| _- No weight regularisation._
|
| _- No intercept term._
| [deleted]
| Lemaxoxo wrote:
| Hehe I was wondering if someone would catch that. Rest assured,
| I know the difference between online and stochastic gradient
| descent. I admit I used stochastic on Hacker News because I
| thought it would generate more engagement.
| airstrike wrote:
| Then just call it Non-stochastic Gradient Descent? You can't
| editorialize titles per HN guidelines
|
| https://news.ycombinator.com/newsguidelines.html
| Lemaxoxo wrote:
| Thanks, I wasn't aware.
| airstrike wrote:
| My pleasure. You can still edit the title, by the way ;-)
| rlewkov wrote:
| Uhhhh, ya know that Python has a library for that :-)
| glomgril wrote:
| Incredible post. I laughed when I saw the title, snickered at the
| first paragraph, and then proceeded to be blown away by the rest
| of it. Thought I was in for a joke and instead I'm thinking about
| the nature of ML Ops and what it's become.
| civilized wrote:
| In the comments here so far, we see a pattern we've seen before.
| When someone suggests doing something in SQL, there's a lot of
| concern about SQL being a very limited programming language where
| it's hard to do proper engineering.
|
| Here's I would really love to know: why is it that SQL is, to
| first order, the only language used to interact with databases,
| and SQL has about the same features as it did in the 70s? It
| seems analogous to if the general-purpose programming world
| stopped with C.
| bob1029 wrote:
| > why is it that SQL is, to first order, the only language used
| to interact with databases, and SQL has about the same features
| as it did in the 70s?
|
| Because SQL is effectively a domain-specific language. If you
| added 100+ additional keywords/functions/etc., do you think it
| would be easier or more difficult for the average developer to
| build something clean with it?
|
| I look at SQL like a pile of bricks. You don't want
| complicated, unicorn bricks or you won't be able to fit
| anything meaningful together. Experienced masons almost
| certainly prefer their materials to be as dumb and consistent
| as possible.
| dventimi wrote:
| I believe the PC revolution gave a few generations of
| programmers the mistaken belief that imperative languages with
| familiar features (Basic, Pascal, C, Perl, Python, Java, etc.)
| are the only or the best way to interact with computers. Forms
| of interaction developed earlier (Forth, Prolog, SQL, etc.)
| were learned later, grudgingly, if at all. SQL was viewed with
| disdain, a necessary evil to be held at arm's length. That
| stalled evolution of SQL, the success of other query languages,
| or really anything that didn't fit the Algol-style "general
| purpose programming language" blinders.
|
| Mercifully, finally, the cold war against SQL in particular and
| against relational databases in general seems to be thawing.
| Articles like this one exemplify that trend
| maxfurman wrote:
| Plenty of SQL features have been added since the 70s, notably
| window functions (which TFA relies on heavily). Most of the
| major databases are from the 80s, and even new kid on the block
| MySQL has been around since the mid 90s.
|
| SQL draws a very hard line between the expression of the query
| and the AST that is used in the actual implementation. Database
| vendors like this aspect because they are free to implement
| whichever optimizations they want, but application developers
| want to build queries programmatically and optimize them
| themselves before passing them to the db engine, hence the
| tension in threads like these.
| snookerdooker wrote:
| I challenge you to add more emojis to your bio ;)
| Lemaxoxo wrote:
| Yes I know some people look down on that. I hope it doesn't
| take away the merits of the article for you hehe.
| Foobar8568 wrote:
| R has been natively supported in sql server for a long while (At
| least since 2017), same for Java or python, e.g. Writing SP in
| these languages.
| kunalgupta wrote:
| god tier
| yttribium wrote:
| At least one F500 company had iteratively reweighted least
| squares implemented in MS SQL as of ~20y ago. When all you have
| is a hammer...
| gigatexal wrote:
| I thought I was decent at SQL until I saw this. Nice article!
| PaulHoule wrote:
| Reminds me of the time I coded up an optimized nearest neighbor
| search in SQL.
| syats wrote:
| Just don't.
|
| SQL:
|
| - does not allow for easy and clean importing of
| modules/libraries
|
| - is not easily to write tests for
|
| - has limited support for a debugger
|
| - lacks a consistent style for such large queries (plus most
| textbook cover fairly simple stuff) which means it's hard for a
| developer to start reading someone else's code (more than in
| other languages)
|
| - clearly indicates in its name that it is a Query language.
|
| Save yourself the trouble and all your collaborators the pain of
| working with this code in the future, of trying to add new
| features, of trying to reuse it in another project.
|
| If you want to operate near the data, use PL/Python for
| PostgreSQL.
|
| EDIT: Fixed formatting.
| dventimi wrote:
| -PostgreSQL extensions are easy to include and use.
|
| -pgTAP exists for testing.
|
| -A large query in SQL is not made smaller but translating it
| into an ORM DSL.
|
| -If "Query" in "SQL" means it's for querying data, then
| evidently "Query" not being in say Java or Python means those
| languages are NOT meant for querying data. If that's true, then
| why would you use them for querying data?
| nfw2 wrote:
| > If "Query" in "SQL" means it's for querying data, then
| evidently "Query" not being in say Java or Python means those
| languages are NOT meant for querying data
|
| If X then Y does not imply if not X then not Y. Java and
| Python do not indicate a purpose in their name because they
| are general-purpose.
| dventimi wrote:
| Are they meant for querying data?
| syats wrote:
| Re modules/libraries: I meant it is not easy to write a piece
| of SQL code, and then import it into several queries to reuse
| it, or lend it to someone else for use on their on schema. It
| is possible, yes, but seldom done, because it is hell.
| PostgreSQL extensions could be used for this purpose, but
| developing an extension requires a different set of SQL
| statements (or luckily, python or c) than those used by the
| user of the extension, which makes compounding them a bit
| hard. Not impossible, just hard to maintain,
|
| About your last point, I don't think that was my line of
| reasoning, but, yes, for the love of what is precious, don't
| open SQL files as python/java file objects and then parse and
| rummage through them to find the data you are looking for.
| Not impossible, just hard to maintain.
|
| Thanks for pointing out pgTAP, didn't know about this.
|
| For some reason, data-science folks haven't yet caught up
| with ORMs.. I don't know if this is good or bad, but (as the
| OP shows) they are more used to rows and columns (or graphs)
| than objects. Maybe that will change one day.
| dventimi wrote:
| > maybe that will change one day
|
| I pray that it never does.
|
| https://blog.codinghorror.com/object-relational-mapping-
| is-t...
| dventimi wrote:
| As for sharing SQL, that's easy to do within a database
| using views. Across databases with possibly different data
| models, that's not something I personally ever want to do.
| Narew wrote:
| There was also this neural network module in redis if you want to
| do training and inference in a redid DB. (quite old btw)
| https://github.com/antirez/neural-redis
| pradeepchhetri wrote:
| You might want to consider checking out ClickHouse which supports
| many ML functions natively:
|
| - stochasticLinearRegression (https://clickhouse.com/docs/en/sql-
| reference/aggregate-funct...)
|
| - stochasticLogisticRegression
| (https://clickhouse.com/docs/en/sql-reference/aggregate-funct...)
| Lemaxoxo wrote:
| Thanks for the links, I wasn't aware of them. The Russians
| often seem to have a step ahead in the ML world.
___________________________________________________________________
(page generated 2023-03-07 23:00 UTC)