https://github.com/dcmoura/spyql Skip to content Sign up * Why GitHub? + Features + Mobile + Actions + Codespaces + Packages + Security + Code review + Issues + Integrations + GitHub Sponsors + Customer stories * Team * Enterprise * Explore + Explore GitHub + Learn and contribute + Topics + Collections + Trending + Learning Lab + Open source guides + Connect with others + The ReadME Project + Events + Community forum + GitHub Education + GitHub Stars program * Marketplace * Pricing + Plans + Compare plans + Contact Sales + Education [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} dcmoura / spyql Public * Notifications * Fork 4 * Star 166 * Query data on the command line with SQL-like SELECTs powered by Python expressions MIT License 166 stars 4 forks Star Notifications * Code * Issues 23 * Pull requests 0 * Actions * Projects 0 * Wiki * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Wiki * Security * Insights master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags 2 branches 4 tags Code Latest commit @dcmoura dcmoura Bump version: 0.3.0 - 0.4.0 ... 4b5ea3f Dec 2, 2021 Bump version: 0.3.0 - 0.4.0 4b5ea3f Git stats * 185 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time docs added readthedocs config Aug 23, 2021 spyql Bump version: 0.3.0 - 0.4.0 Dec 2, 2021 tests improved handling of NULL/None in pretty/plot writers Nov 29, 2021 .editorconfig packaging and ci/cd Aug 18, 2021 .gitignore updated .gitignore Oct 29, 2020 .readthedocs.yaml updated readthedocs config Aug 23, 2021 .travis.yml travis config arm64 only Aug 23, 2021 AUTHORS.rst packaging and ci/cd Aug 18, 2021 CONTRIBUTING.rst updated setup instructions Aug 23, 2021 HISTORY.rst updated history Dec 2, 2021 LICENSE Initial commit Sep 23, 2020 MANIFEST.in packaging and ci/cd Aug 18, 2021 Makefile packaging and ci/cd Aug 18, 2021 README.md added lag aggregation Oct 6, 2021 codecov.yml updated codecov.yml Sep 15, 2021 requirements_dev.txt added codecov integration Aug 23, 2021 setup.cfg Bump version: 0.3.0 - 0.4.0 Dec 2, 2021 setup.py Bump version: 0.3.0 - 0.4.0 Dec 2, 2021 tox.ini tox conf fix Aug 23, 2021 View code [ ] SpyQL Concept SpyQL command-line tool Installation Hello world Principles Syntax Notable differences to SQL Notable differences to Python Additional syntax NULL datatype Importing python modules and user-defined functions Example queries Query a CSV (and print a pretty table) Convert CSV to a flat JSON Convert from CSV to a hierarchical JSON JSON to CSV, filtering out NULLs Explode JSON to CSV Python iterator/list/comprehension to JSON Importing python modules Getting the top 5 records Aggregations Partial aggregations Distinct rows Command line examples Queries on Parquet with directories Querying multiple json.gz files Querying YAML / XML / TOML files Kafka to PostegreSQL pipeline Monitoring statistics in Kafka Sub-queries (piping) Queries over APIs Plotting to the terminal Plotting with gnuplot README.md SpyQL SQL with Python in the middle https://pypi.python.org/pypi/spyql https://travis-ci.com/dcmoura/ spyql https://spyql.readthedocs.io/en/latest/?version=latest codecov code style: black license: MIT Concept SpyQL is a query language that combines: * the simplicity and structure of SQL * with the power and readability of Python SELECT date.fromtimestamp(purchase_ts) AS purchase_date, price * quantity AS total FROM csv WHERE department.upper() == 'IT' TO json SQL provides the structure of the query, while Python is used to define expressions, bringing along a vast ecosystem of packages. SpyQL command-line tool With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as as Kafka, or from databases such as PostgreSQL. Basically, data can come from any command that outputs text :-). More, data can be generated by a Python iterator! Take a look at the examples section to see how to query parquet, process API calls, transverse directories of zipped JSONs, among many other things. SpyQL also allows you to easily convert between text data formats: * FROM: CSV, JSON, TEXT and Python iterators (YES, you can use a list comprehension as the data source) * TO: CSV, JSON, SQL (INSERT statements), pretty terminal printing, and terminal plotting. The JSON format is JSON lines, where each line has a valid JSON object or array. Piping with jq allows SpyQL to handle any JSON input (more on the examples section). You can leverage command line tools to process other file types like Parquet and XML (more on the examples section). Installation To install SpyQL, run this command in your terminal: pip install spyql Hello world To test your installation run in the terminal: spyql "SELECT 'Hello world' as Message TO pretty" Output: Message ----------- Hello world Try replacing the output format by json and csv, and try adding more columns. e.g. run in the terminal: spyql "SELECT 'Hello world' as message, 1+2 as three TO json" Output: {"message": "Hello world", "three": 3} Principles Right now, the focus is on building a command-line tool that follows these core principles: * Simple: simple to use with a straightforward implementation * Familiar: you should feel at home if you are acquainted with SQL and Python * Light: small memory footprint that allows you to process large data that fit into your machine * Useful: it should make your life easier, filling a gap in the eco-system Syntax [ IMPORT python_module [ AS identifier ] [, ...] ] SELECT [ DISTINCT | PARTIALS ] [ * | python_expression [ AS output_column_name ] [, ...] ] [ FROM csv | spy | text | python_expression | json [ EXPLODE path ] ] [ WHERE python_expression ] [ GROUP BY output_column_number | python_expression [, ...] ] [ ORDER BY output_column_number | python_expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ] [ LIMIT row_count ] [ OFFSET num_rows_to_skip ] [ TO csv | json | spy | sql | pretty | plot ] Notable differences to SQL In SpyQL: * there is guarantee that the order of the output rows is the same as in the input (if no reordering is done) * the AS keyword must precede a column alias definition (it is not optional as in SQL) * you can always access the nth input column by using the default column names colN (e.g. col1 for the first column) * currently only a small subset of SQL is supported, namely SELECT statements without: sub-queries, joins, set operations, etc (check the Syntax section) * sub-queries are achieved by piping (see the [Command line examples](#command line examples) section) * aggregation functions have the suffix _agg to avoid conflicts with python's built-in functions: Operation PostgreSQL SpyQL Sum all values SELECT sum(col_name) SELECT sum_agg of a column (col_name) Sum an array SELECT sum(a) FROM (SELECT unnest SELECT sum (array[1,2,3]) AS a) AS t ([1,2,3]) * expressions are pure Python: SQL SpySQL x = y x == y x BETWEEN a AND b a <= x <= b CAST(x AS INTEGER) int(x) CASE WHEN x > 0 THEN 1 ELSE -1 END 1 if x > 0 else -1 upper('hello') 'hello'.upper() Notable differences to Python Additional syntax We added additional syntax for making querying easier: Python SpySQL shortcut Purpose json['hello'] json->hello-> Easy access of elements in ['planet earth'] 'planet earth' dicts (e.g. JSONs) NULL datatype Python's None generates exceptions when making operations on missing data, breaking query execution (e.g. None + 1 throws a TypeError). To overcome this, we created a NULL type that has the same behavior as in SQL (e.g. NULL + 1 returns NULL), allowing for queries to continue processing data. Operation Native Python SpySQL SpySQL throws returns warning NULL + 1 NameError NULL a_dict KeyError NULL yes ['inexistent_key'] int('') ValueError NULL yes int('abc') ValueError NULL yes The above dictionary key access only returns NULL if the dict is an instance of NullSafeDict. SpyQL adds NullSafeDict, which extends python's native dict. JSONs are automatically loaded as NullSafeDict. Unless you are creating dictionaries on the fly you do not need to worry about this. Importing python modules and user-defined functions By default, spyql do some commonly used imports: * everything from the math module * datetime, date and timezone from the datetime module * the re module SpyQL queries support a single import statement at the beginning of the query where several modules can be imported (e.g. IMPORT numpy AS np, sys SELECT ...). Note that the python syntax from module import identifier is not supported in queries. In addition, you can create a python file that is loaded before executing queries. Here you can define imports, functions, variables, etc using regular python code. Everything defined in this file is available to all your spyql queries. The file should be located at XDG_CONFIG_HOME/spyql/init.py. If the environment variable XDG_CONFIG_HOME is not defined, it defaults to HOME/.config (e.g. / Users/janedoe/.config/spyql/init.py). Example queries You can run the following example queries in the terminal: spyql "the_query" < a_data_file Example data files are not provided on most cases. Query a CSV (and print a pretty table) SELECT a_col_name, 'positive' if col2 >= 0 else 'negative' AS sign FROM csv TO pretty Convert CSV to a flat JSON SELECT * FROM csv TO json Convert from CSV to a hierarchical JSON SELECT {'client': {'id': col1, 'name': col2}, 'price': 120.40} FROM csv TO json or SELECT {'id': col1, 'name': col2} AS client, 120.40 AS price FROM csv TO json JSON to CSV, filtering out NULLs SELECT json->client->id AS id, json->client->name AS name, json->price AS price FROM json WHERE json->client->name is not NULL TO csv Explode JSON to CSV SELECT json->invoice_num AS id, json->items->name AS name, json-items->price AS price FROM json EXPLODE json->items TO csv Sample input: {"invoice_num" : 1028, "items": [{"name": "tomatoes", "price": 1.5}, {"name": "bananas", "price": 2.0}]} {"invoice_num" : 1029, "items": [{"name": "peaches", "price": 3.12}]} Output: id, name, price 1028, tomatoes, 1.5 1028, bananas, 2.0 1029, peaches, 3.12 Python iterator/list/comprehension to JSON SELECT 10 * cos(col1 * ((pi * 4) / 90) FROM range(80) TO json or SELECT col1 FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)] TO json Importing python modules Here we import hashlib to calculate a md5 hash for each input line. Before running this example you need to install the hashlib package (pip install hashlib). IMPORT hashlib as hl SELECT hl.md5(col1.encode('utf-8')).hexdigest() FROM text Getting the top 5 records SELECT int(score) AS score, player_name FROM csv ORDER BY 1 DESC NULLS LAST, score_date LIMIT 5 Aggregations Totals by player, alphabetically ordered. SELECT json->player_name, sum_agg(json->score) AS total_score FROM json GROUP BY 1 ORDER BY 1 Partial aggregations Calculating the cumulative sum of a variable using the PARTIALS modifier. Also demoing the lag aggregator. SELECT PARTIALS json->new_entries, sum_agg(json->new_entries) AS cum_new_entries, lag(json->new_entries) AS prev_entries FROM json TO json Sample input: {"new_entries" : 10} {"new_entries" : 5} {"new_entries" : 25} {"new_entries" : null} {} {"new_entries" : 100} Output: {"new_entries" : 10, "cum_new_entries" : 10, "prev_entries": null} {"new_entries" : 5, "cum_new_entries" : 15, "prev_entries": 10} {"new_entries" : 25, "cum_new_entries" : 40, "prev_entries": 5} {"new_entries" : null, "cum_new_entries" : 40, "prev_entries": 25} {"new_entries" : null, "cum_new_entries" : 40, "prev_entries": null} {"new_entries" : 100, "cum_new_entries" : 140, "prev_entries": null} If PARTIALSwas omitted the result would be equivalent to the last output row. Distinct rows SELECT DISTINCT * FROM csv Command line examples To run the following examples, type Ctrl-x Ctrl-e on you terminal. This will open your default editor (emacs/vim). Paste the code of one of the examples, save and exit. Queries on Parquet with directories Here, find transverses a directory and executes parquet-tools for each parquet file, dumping each file to json format. jq -c makes sure that the output has 1 json per line before handing over to spyql. This is far from being an efficient way to query parquet files, but it might be a handy option if you need to do a quick inspection. find /the/directory -name "*.parquet" -exec parquet-tools cat --json {} \; | jq -c | spyql " SELECT json->a_field, json->a_num_field * 2 + 1 FROM json " Querying multiple json.gz files gzcat *.json.gz | jq -c | spyql " SELECT json->a_field, json->a_num_field * 2 + 1 FROM json " Querying YAML / XML / TOML files yq converts yaml, xml and toml files to json, allowing to easily query any of these with spyql. cat file.yaml | yq -c | spyql "SELECT json->a_field FROM json" cat file.xml | xq -c | spyql "SELECT json->a_field FROM json" cat file.toml | tomlq -c | spyql "SELECT json->a_field FROM json" Kafka to PostegreSQL pipeline Read data from a kafka topic and write to postgres table name customer. kafkacat -b the.broker.com -t the.topic | spyql -Otable=customer -Ochunk_size=1 --unbuffered " SELECT json->customer->id AS id, json->customer->name AS name FROM json TO sql " | psql -U an_user_name -h a.host.com a_database_name Monitoring statistics in Kafka Read data from a kafka topic, continuously calculating statistics. kafkacat -b the.broker.com -t the.topic | spyql --unbuffered " SELECT PARTIALS count_agg(*) AS running_count, sum_agg(value) AS running_sum, min_agg(value) AS min_so_far, value AS current_value FROM json TO csv " Sub-queries (piping) A special file format (spy) is used to efficiently pipe data between queries. cat a_file.json | spyql " SELECT ' '.join([json->first_name, json->middle_name, json->last_name]) AS full_name FROM json TO spy" | spyql "SELECT full_name, full_name.upper() FROM spy" Queries over APIs curl https://reqres.in/api/users?page=2 | spyql " SELECT json->data->email AS email, 'Dear {}, thank you for being a great customer!'.format(json->data->first_name) AS msg FROM json EXPLODE json->data TO json " Plotting to the terminal spyql " SELECT col1 FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)] TO plot " Plotting with gnuplot To the terminal: spyql " SELECT col1 FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)] TO csv " | sed 1d | feedgnuplot --terminal 'dumb 80,30' --exit --lines To GUI: spyql " SELECT col1 FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)] TO csv " | sed 1d | feedgnuplot --lines --points --exit --------------------------------------------------------------------- This package was created with Cookiecutter and the audreyr/ cookiecutter-pypackage project template. About Query data on the command line with SQL-like SELECTs powered by Python expressions Topics python json data csv sql command-line text Resources Readme License MIT License Stars 166 stars Watchers 3 watching Forks 4 forks Releases 4 tags Contributors 3 * @dcmoura dcmoura Daniel Moura * @recharte recharte Diogo Recharte * @pyup-bot pyup-bot pyup.io bot Languages * Python 97.7% * Makefile 2.3% * (c) 2022 GitHub, Inc. * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.