https://github.com/dcmoura/spyql

Skip to content
 
Sign up

  * Why GitHub?
      + Features
      + Mobile
      + Actions
      + Codespaces
      + Packages
      + Security
      + Code review
      + Issues
      + Integrations
      + GitHub Sponsors
      + Customer stories
  * Team
  * Enterprise
  * Explore
      + Explore GitHub
      + Learn and contribute
      + Topics
      + Collections
      + Trending
      + Learning Lab
      + Open source guides
      + Connect with others
      + The ReadME Project
      + Events
      + Community forum
      + GitHub Education
      + GitHub Stars program
  * Marketplace
  * Pricing
      + Plans
      + Compare plans
      + Contact Sales
      + Education

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}

dcmoura / spyql Public

  * Notifications
  * Fork 4
  * Star 166
  * 

Query data on the command line with SQL-like SELECTs powered by
Python expressions

MIT License
166 stars 4 forks
Star
Notifications

  * Code
  * Issues 23
  * Pull requests 0
  * Actions
  * Projects 0
  * Wiki
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Wiki
  * Security
  * Insights

master
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags
2 branches 4 tags
Code

Latest commit

@dcmoura
dcmoura Bump version: 0.3.0 - 0.4.0
...
4b5ea3f Dec 2, 2021
Bump version: 0.3.0 - 0.4.0
4b5ea3f

Git stats

  * 185 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
docs
added readthedocs config
Aug 23, 2021
spyql
Bump version: 0.3.0 - 0.4.0
Dec 2, 2021
tests
improved handling of NULL/None in pretty/plot writers
Nov 29, 2021
.editorconfig
packaging and ci/cd
Aug 18, 2021
.gitignore
updated .gitignore
Oct 29, 2020
.readthedocs.yaml
updated readthedocs config
Aug 23, 2021
.travis.yml
travis config arm64 only
Aug 23, 2021
AUTHORS.rst
packaging and ci/cd
Aug 18, 2021
CONTRIBUTING.rst
updated setup instructions
Aug 23, 2021
HISTORY.rst
updated history
Dec 2, 2021
LICENSE
Initial commit
Sep 23, 2020
MANIFEST.in
packaging and ci/cd
Aug 18, 2021
Makefile
packaging and ci/cd
Aug 18, 2021
README.md
added lag aggregation
Oct 6, 2021
codecov.yml
updated codecov.yml
Sep 15, 2021
requirements_dev.txt
added codecov integration
Aug 23, 2021
setup.cfg
Bump version: 0.3.0 - 0.4.0
Dec 2, 2021
setup.py
Bump version: 0.3.0 - 0.4.0
Dec 2, 2021
tox.ini
tox conf fix
Aug 23, 2021
View code
[                    ]
SpyQL Concept SpyQL command-line tool Installation Hello world
Principles Syntax Notable differences to SQL Notable differences to
Python Additional syntax NULL datatype Importing python modules and
user-defined functions Example queries Query a CSV (and print a
pretty table) Convert CSV to a flat JSON Convert from CSV to a
hierarchical JSON JSON to CSV, filtering out NULLs Explode JSON to
CSV Python iterator/list/comprehension to JSON Importing python
modules Getting the top 5 records Aggregations Partial aggregations
Distinct rows Command line examples Queries on Parquet with
directories Querying multiple json.gz files Querying YAML / XML /
TOML files Kafka to PostegreSQL pipeline Monitoring statistics in
Kafka Sub-queries (piping) Queries over APIs Plotting to the terminal
Plotting with gnuplot

README.md

 SpyQL

SQL with Python in the middle

https://pypi.python.org/pypi/spyql https://travis-ci.com/dcmoura/
spyql https://spyql.readthedocs.io/en/latest/?version=latest codecov
code style: black license: MIT

 Concept

SpyQL is a query language that combines:

  * the simplicity and structure of SQL
  * with the power and readability of Python

SELECT
    date.fromtimestamp(purchase_ts) AS purchase_date,
    price * quantity AS total
FROM csv
WHERE department.upper() == 'IT'
TO json

SQL provides the structure of the query, while Python is used to
define expressions, bringing along a vast ecosystem of packages.

 SpyQL command-line tool

With the SpyQL command-line tool you can make SQL-like SELECTs
powered by Python on top of text data (e.g. CSV and JSON). Data can
come from files but also from data streams, such as as Kafka, or from
databases such as PostgreSQL. Basically, data can come from any
command that outputs text :-). More, data can be generated by a
Python iterator! Take a look at the examples section to see how to
query parquet, process API calls, transverse directories of zipped
JSONs, among many other things.

SpyQL also allows you to easily convert between text data formats:

  * FROM: CSV, JSON, TEXT and Python iterators (YES, you can use a
    list comprehension as the data source)

  * TO: CSV, JSON, SQL (INSERT statements), pretty terminal printing,
    and terminal plotting.

The JSON format is JSON lines, where each line has a valid JSON
object or array. Piping with jq allows SpyQL to handle any JSON input
(more on the examples section).

You can leverage command line tools to process other file types like
Parquet and XML (more on the examples section).

 Installation

To install SpyQL, run this command in your terminal:

pip install spyql

 Hello world

To test your installation run in the terminal:

spyql "SELECT 'Hello world' as Message TO pretty"

Output:

Message
-----------
Hello world

Try replacing the output format by json and csv, and try adding more
columns. e.g. run in the terminal:

spyql "SELECT 'Hello world' as message, 1+2 as three TO json"

Output:

{"message": "Hello world", "three": 3}

 Principles

Right now, the focus is on building a command-line tool that follows
these core principles:

  * Simple: simple to use with a straightforward implementation
  * Familiar: you should feel at home if you are acquainted with SQL
    and Python
  * Light: small memory footprint that allows you to process large
    data that fit into your machine
  * Useful: it should make your life easier, filling a gap in the
    eco-system

 Syntax

[ IMPORT python_module [ AS identifier ] [, ...] ]
SELECT [ DISTINCT | PARTIALS ]
    [ * | python_expression [ AS output_column_name ] [, ...] ]
    [ FROM csv | spy | text | python_expression | json [ EXPLODE path ] ]
    [ WHERE python_expression ]
    [ GROUP BY output_column_number | python_expression  [, ...] ]
    [ ORDER BY output_column_number | python_expression
        [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
    [ LIMIT row_count ]
    [ OFFSET num_rows_to_skip ]
    [ TO csv | json | spy | sql | pretty | plot ]

 Notable differences to SQL

In SpyQL:

  * there is guarantee that the order of the output rows is the same
    as in the input (if no reordering is done)
  * the AS keyword must precede a column alias definition (it is not
    optional as in SQL)
  * you can always access the nth input column by using the default
    column names colN (e.g. col1 for the first column)
  * currently only a small subset of SQL is supported, namely SELECT
    statements without: sub-queries, joins, set operations, etc
    (check the Syntax section)
  * sub-queries are achieved by piping (see the [Command line
    examples](#command line examples) section)
  * aggregation functions have the suffix _agg to avoid conflicts
    with python's built-in functions:

   Operation                  PostgreSQL                   SpyQL
Sum all values   SELECT sum(col_name)                 SELECT sum_agg
of a column                                           (col_name)
Sum an array     SELECT sum(a) FROM (SELECT unnest    SELECT sum
                 (array[1,2,3]) AS a) AS t            ([1,2,3])

  * expressions are pure Python:

               SQL                       SpySQL
x = y                              x == y
x BETWEEN a AND b                  a <= x <= b
CAST(x AS INTEGER)                 int(x)
CASE WHEN x > 0 THEN 1 ELSE -1 END 1 if x > 0 else -1
upper('hello')                     'hello'.upper()

 Notable differences to Python

 Additional syntax

We added additional syntax for making querying easier:

      Python         SpySQL shortcut              Purpose
json['hello']       json->hello->      Easy access of elements in
['planet earth']    'planet earth'     dicts (e.g. JSONs)

 NULL datatype

Python's None generates exceptions when making operations on missing
data, breaking query execution (e.g. None + 1 throws a TypeError). To
overcome this, we created a NULL type that has the same behavior as
in SQL (e.g. NULL + 1 returns NULL), allowing for queries to continue
processing data.

      Operation          Native Python       SpySQL        SpySQL
                             throws          returns       warning
NULL + 1               NameError          NULL
a_dict                 KeyError           NULL          yes
['inexistent_key']
int('')                ValueError         NULL          yes
int('abc')             ValueError         NULL          yes

The above dictionary key access only returns NULL if the dict is an
instance of NullSafeDict. SpyQL adds NullSafeDict, which extends
python's native dict. JSONs are automatically loaded as NullSafeDict.
Unless you are creating dictionaries on the fly you do not need to
worry about this.

 Importing python modules and user-defined functions

By default, spyql do some commonly used imports:

  * everything from the math module
  * datetime, date and timezone from the datetime module
  * the re module

SpyQL queries support a single import statement at the beginning of
the query where several modules can be imported (e.g. IMPORT numpy AS
np, sys SELECT ...). Note that the python syntax from module import
identifier is not supported in queries.

In addition, you can create a python file that is loaded before
executing queries. Here you can define imports, functions, variables,
etc using regular python code. Everything defined in this file is
available to all your spyql queries. The file should be located at
XDG_CONFIG_HOME/spyql/init.py. If the environment variable
XDG_CONFIG_HOME is not defined, it defaults to HOME/.config (e.g. /
Users/janedoe/.config/spyql/init.py).

 Example queries

You can run the following example queries in the terminal: spyql
"the_query" < a_data_file

Example data files are not provided on most cases.

 Query a CSV (and print a pretty table)

SELECT a_col_name, 'positive' if col2 >= 0 else 'negative' AS sign
FROM csv
TO pretty

 Convert CSV to a flat JSON

SELECT * FROM csv TO json

 Convert from CSV to a hierarchical JSON

SELECT {'client': {'id': col1, 'name': col2}, 'price': 120.40}
FROM csv TO json

or

SELECT {'id': col1, 'name': col2} AS client, 120.40 AS price
FROM csv TO json

 JSON to CSV, filtering out NULLs

SELECT json->client->id AS id, json->client->name AS name, json->price AS price
FROM json
WHERE json->client->name is not NULL
TO csv

 Explode JSON to CSV

SELECT json->invoice_num AS id, json->items->name AS name, json-items->price AS price
FROM json
EXPLODE json->items
TO csv

Sample input:

{"invoice_num" : 1028, "items": [{"name": "tomatoes", "price": 1.5}, {"name": "bananas", "price": 2.0}]}
{"invoice_num" : 1029, "items": [{"name": "peaches", "price": 3.12}]}

Output:

id, name, price
1028, tomatoes, 1.5
1028, bananas, 2.0
1029, peaches, 3.12

 Python iterator/list/comprehension to JSON

SELECT 10 * cos(col1 * ((pi * 4) / 90)
FROM range(80)
TO json

or

SELECT col1
FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
TO json

 Importing python modules

Here we import hashlib to calculate a md5 hash for each input line.
Before running this example you need to install the hashlib package
(pip install hashlib).

IMPORT hashlib as hl
SELECT hl.md5(col1.encode('utf-8')).hexdigest()
FROM text

 Getting the top 5 records

SELECT int(score) AS score, player_name
FROM csv
ORDER BY 1 DESC NULLS LAST, score_date
LIMIT 5

 Aggregations

Totals by player, alphabetically ordered.

SELECT json->player_name, sum_agg(json->score) AS total_score
FROM json
GROUP BY 1
ORDER BY 1

 Partial aggregations

Calculating the cumulative sum of a variable using the PARTIALS
modifier. Also demoing the lag aggregator.

SELECT PARTIALS
    json->new_entries,
    sum_agg(json->new_entries) AS cum_new_entries,
    lag(json->new_entries) AS prev_entries
FROM json
TO json

Sample input:

{"new_entries" : 10}
{"new_entries" : 5}
{"new_entries" : 25}
{"new_entries" : null}
{}
{"new_entries" : 100}

Output:

{"new_entries" : 10,   "cum_new_entries" : 10,  "prev_entries": null}
{"new_entries" : 5,    "cum_new_entries" : 15,  "prev_entries": 10}
{"new_entries" : 25,   "cum_new_entries" : 40,  "prev_entries": 5}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": 25}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": null}
{"new_entries" : 100,  "cum_new_entries" : 140, "prev_entries": null}

If PARTIALSwas omitted the result would be equivalent to the last
output row.

 Distinct rows

SELECT DISTINCT *
FROM csv

 Command line examples

To run the following examples, type Ctrl-x Ctrl-e on you terminal.
This will open your default editor (emacs/vim). Paste the code of one
of the examples, save and exit.

 Queries on Parquet with directories

Here, find transverses a directory and executes parquet-tools for
each parquet file, dumping each file to json format. jq -c makes sure
that the output has 1 json per line before handing over to spyql.
This is far from being an efficient way to query parquet files, but
it might be a handy option if you need to do a quick inspection.

find /the/directory -name "*.parquet" -exec parquet-tools cat --json {} \; |
jq -c |
spyql "
        SELECT json->a_field, json->a_num_field * 2 + 1
        FROM json
"

 Querying multiple json.gz files

gzcat *.json.gz |
jq -c |
spyql "
        SELECT json->a_field, json->a_num_field * 2 + 1
        FROM json
"

 Querying YAML / XML / TOML files

yq converts yaml, xml and toml files to json, allowing to easily
query any of these with spyql.

cat file.yaml | yq -c | spyql "SELECT json->a_field FROM json"

cat file.xml | xq -c | spyql "SELECT json->a_field FROM json"

cat file.toml | tomlq -c | spyql "SELECT json->a_field FROM json"

 Kafka to PostegreSQL pipeline

Read data from a kafka topic and write to postgres table name
customer.

kafkacat -b the.broker.com -t the.topic |
spyql -Otable=customer -Ochunk_size=1 --unbuffered "
        SELECT
                json->customer->id AS id,
                json->customer->name AS name
        FROM json
        TO sql
" |
psql -U an_user_name -h a.host.com a_database_name

 Monitoring statistics in Kafka

Read data from a kafka topic, continuously calculating statistics.

kafkacat -b the.broker.com -t the.topic |
spyql --unbuffered "
        SELECT PARTIALS
        count_agg(*) AS running_count,
                sum_agg(value) AS running_sum,
                min_agg(value) AS min_so_far, 
        value AS current_value
        FROM json
        TO csv
"

 Sub-queries (piping)

A special file format (spy) is used to efficiently pipe data between
queries.

cat a_file.json |
spyql "
        SELECT ' '.join([json->first_name, json->middle_name, json->last_name]) AS full_name
        FROM json
        TO spy" |
spyql "SELECT full_name, full_name.upper() FROM spy"

 Queries over APIs

curl https://reqres.in/api/users?page=2 |
spyql "
        SELECT
                json->data->email AS email,
                'Dear {}, thank you for being a great customer!'.format(json->data->first_name) AS msg
        FROM json
        EXPLODE json->data
        TO json
"

 Plotting to the terminal

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO plot
"

 Plotting with gnuplot

To the terminal:

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO csv
" |
sed 1d |
feedgnuplot --terminal 'dumb 80,30' --exit --lines

To GUI:

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO csv
" |
sed 1d |
feedgnuplot --lines --points --exit

---------------------------------------------------------------------

This package was created with Cookiecutter and the audreyr/
cookiecutter-pypackage project template.

About

Query data on the command line with SQL-like SELECTs powered by
Python expressions

Topics

python json data csv sql command-line text

Resources

Readme

License

MIT License

Stars

166 stars

Watchers

3 watching

Forks

4 forks

Releases

4 tags

Contributors 3

  * @dcmoura dcmoura Daniel Moura
  * @recharte recharte Diogo Recharte
  * @pyup-bot pyup-bot pyup.io bot

Languages

  * Python 97.7%
  * Makefile 2.3%

  *  (c) 2022 GitHub, Inc.

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.