Extract ^^^^^^^ Functions --------- ETL Helper provides four functions for extracting data from a SQL query. The ``fetch*`` functions *return* results once they have finished with the database. - :func:`fetchone() `: returns the first result. - :func:`fetchall() `: returns all results as a list. This function returns once all rows have been fetched into memory. .. literalinclude:: ../code_demos/extract/demo_fetch.py :language: python returns .. code:: python {'id': 1, 'name': 'basalt', 'grain_size': 'fine'} [{'id': 1, 'name': 'basalt', 'grain_size': 'fine'}, {'id': 2, 'name': 'granite', 'grain_size': 'coarse'}] The ``iter_*`` functions *yield* data, either one or many rows at a time. Results are fetched in *chunks*, and only one chunk of data is held in memory at any time. Within a data processing pipeline, the next step can begin as soon as the first chunk has been fetched. The database connection must remain open until all results have been processed. - :func:`iter_rows() `: returns an iterator that yields all results, one at a time. This function can be used in place of ``fetchall`` within processing pipelines and when looping over large datasets. - :func:`iter_chunks() `: returns an iterator that yields chunks of multiple results. This provides similar functionality to the ``fetchmany`` method specified in the DB API 2.0. The following is an example of :func:`iter_rows() `: .. literalinclude:: ../code_demos/extract/demo_iter.py :language: python returns .. code:: python {'id': 1, 'name': 'basalt', 'grain_size': 'fine'} {'id': 2, 'name': 'granite', 'grain_size': 'coarse'} Keyword arguments ----------------- All extract functions are derived from :func:`iter_chunks() ` and take the same keyword arguments, which are passed through. .. _parameters: parameters """""""""" Variables can be inserted into queries by passing them as parameters. These “bind variables” are sanitised by the underlying drivers to prevent `SQL injection attacks `__. It is necessary to use the correct `paramstyle `__ for the database type as a placeholder (e.g. ``?``, ``:1``). The paramstyle for a DbParams object can be checked with the :func:`paramstyle ` attribute. A dictionary is used for named placeholders, .. code:: python select_sql = "SELECT * FROM src WHERE id = :id" # SQLite style with sqlite3.connect("rocks.db") as conn: etl.fetchall(sql, conn, parameters={'id': 1}) or a tuple for positional placeholders. .. code:: python select_sql = "SELECT * FROM src WHERE id = ?" # SQLite style with sqlite3.connect("rocks.db") as conn: etl.fetchall(sql, conn, parameters=(1,)) Named parameters result in more readable code. row_factory """"""""""" Row factories control the output format of returned rows. The default row factory for ETL Helper is a dictionary, but this can be changed with the ``row_factory`` argument. .. literalinclude:: ../code_demos/extract/demo_row_factory.py :language: python The output is: .. code:: bash Row(id=1, name='basalt', grain_size='fine') basalt Four different row_factories are included, based in built-in Python types: +-----------------------+------------------+---------+------------------+ | Row Factory | Attribute access | Mutable | Parameter | | | | | placeholder | +=======================+==================+=========+==================+ | dict_row_factory | ``row["id"]`` | Yes | Named | | (default) | | | | +-----------------------+------------------+---------+------------------+ | tuple_row_factory | ``row[0]`` | No | Positional | +-----------------------+------------------+---------+------------------+ | list_row_factory | ``row[0]`` | Yes | Positional | +-----------------------+------------------+---------+------------------+ | namedtuple_row_factory| ``row.id`` or | No | Positional (or | | | ``row[0]`` | | Named with | | | | | *load*) | +-----------------------+------------------+---------+------------------+ The choice of row factory depends on the use case. In general named tuples and dictionaries are best for readable code, while using tuples or lists can give a slight increase in performance. Mutable rows are convenient when used with transform functions because they can be modified without need to create a whole new output row. When loading or copying data, it is necessary to use appropriate parameter placeholder style for the chosen row factory in the INSERT query. Using the ``tuple_row_factory`` requires a switch from named to positional parameter placeholders (e.g. ``%s`` instead of ``%(id)s`` for PostgreSQL, ``:1`` instead of ``:id`` for Oracle). The ``pyodbc`` driver for MSSQL only supports positional placeholders. When using the ``load`` function in conjuction with ``iter_chunks`` data must be either named tuples or dictionaries. transform """"""""" The ``transform`` parameter takes a callable (e.g. function) that transforms the data before returning it. See the :ref:`Transform ` section for details. chunk_size """""""""" All data extraction functions use ``iter_chunks`` behind the scenes. This reads rows from the database in *chunks* to prevent them all being loaded into memory at once. The ``chunk_size`` argument sets the number of rows in each chunk. The default ``chunk_size`` is 5000.