Extract¶
Functions¶
ETL Helper provides four functions for extracting data from a SQL query.
The fetch*
functions return results once they have finished with the database.
fetchone()
: returns the first result.fetchall()
: returns all results as a list. This function returns once all rows have been fetched into memory.
"""ETL Helper script to demonstrate using fetch functions."""
import sqlite3
import etlhelper as etl
db_file = "igneous_rocks.db"
with sqlite3.connect(db_file) as conn:
first_row = etl.fetchone("SELECT * FROM igneous_rock", conn)
all_rows = etl.fetchall("SELECT * FROM igneous_rock", conn)
print(first_row)
print(all_rows)
returns
{'id': 1, 'name': 'basalt', 'grain_size': 'fine'}
[{'id': 1, 'name': 'basalt', 'grain_size': 'fine'},
{'id': 2, 'name': 'granite', 'grain_size': 'coarse'}]
The iter_*
functions yield data, either one or many rows at a time.
Results are fetched in chunks, and only one chunk of data is held in memory at any time.
Within a data processing pipeline, the next step can begin as soon as the first chunk has
been fetched.
The database connection must remain open until all results have been processed.
iter_rows()
: returns an iterator that yields all results, one at a time. This function can be used in place offetchall
within processing pipelines and when looping over large datasets.iter_chunks()
: returns an iterator that yields chunks of multiple results. This provides similar functionality to thefetchmany
method specified in the DB API 2.0.
The following is an example of iter_rows()
:
"""ETL Helper script to demonstrate iter_rows."""
import sqlite3
import etlhelper as etl
db_file = "igneous_rocks.db"
with sqlite3.connect(db_file) as conn:
for row in etl.iter_rows("SELECT * FROM igneous_rock", conn):
print(row)
returns
{'id': 1, 'name': 'basalt', 'grain_size': 'fine'}
{'id': 2, 'name': 'granite', 'grain_size': 'coarse'}
Keyword arguments¶
All extract functions are derived from iter_chunks()
and take the same keyword arguments, which are passed through.
parameters¶
Variables can be inserted into queries by passing them as parameters. These “bind variables” are sanitised by the underlying drivers to prevent SQL injection attacks.
It is necessary to use the correct
paramstyle
for the database type as a placeholder (e.g. ?
, :1
).
The paramstyle for a DbParams object can be checked with the
paramstyle
attribute.
A dictionary is used for named placeholders,
select_sql = "SELECT * FROM src WHERE id = :id" # SQLite style
with sqlite3.connect("rocks.db") as conn:
etl.fetchall(sql, conn, parameters={'id': 1})
or a tuple for positional placeholders.
select_sql = "SELECT * FROM src WHERE id = ?" # SQLite style
with sqlite3.connect("rocks.db") as conn:
etl.fetchall(sql, conn, parameters=(1,))
Named parameters result in more readable code.
row_factory¶
Row factories control the output format of returned rows.
The default row factory for ETL Helper is a dictionary, but this can be
changed with the row_factory
argument.
"""ETL Helper script to demonstrate using fetch functions with a given row factory."""
import sqlite3
import etlhelper as etl
from etlhelper.row_factories import namedtuple_row_factory
db_file = "igneous_rocks.db"
with sqlite3.connect(db_file) as conn:
row = etl.fetchone(
"SELECT * FROM igneous_rock",
conn,
row_factory=namedtuple_row_factory,
)
print(row)
print(row.name)
The output is:
Row(id=1, name='basalt', grain_size='fine')
basalt
Four different row_factories are included, based in built-in Python types:
Row Factory |
Attribute access |
Mutable |
Parameter placeholder |
---|---|---|---|
dict_row_factory (default) |
|
Yes |
Named |
tuple_row_factory |
|
No |
Positional |
list_row_factory |
|
Yes |
Positional |
namedtuple_row_factory |
|
No |
Positional (or Named with load) |
The choice of row factory depends on the use case. In general named tuples and dictionaries are best for readable code, while using tuples or lists can give a slight increase in performance. Mutable rows are convenient when used with transform functions because they can be modified without need to create a whole new output row.
When loading or copying data, it is necessary to use appropriate parameter
placeholder style for the chosen row factory in the INSERT query.
Using the tuple_row_factory
requires a switch from named to positional
parameter placeholders (e.g. %s
instead of %(id)s
for PostgreSQL,
:1
instead of :id
for Oracle).
The pyodbc
driver for MSSQL only supports positional placeholders.
When using the load
function in conjuction with iter_chunks
data
must be either named tuples or dictionaries.
transform¶
The transform
parameter takes a callable (e.g. function) that
transforms the data before returning it.
See the Transform section for details.
chunk_size¶
All data extraction functions use iter_chunks
behind the scenes.
This reads rows from the database in chunks to prevent them all being
loaded into memory at once.
The chunk_size
argument sets the number of rows in each chunk.
The default chunk_size
is 5000.