Skip to content

Watsonx.data Introduction

Watsonx.data is based on open source PrestoDB, a distributed query engine that enables querying data stored in open file formats using open table formats for optimization or performance. Some of the characteristics which you will learn and see in action include:

  • Compute processing is performed in memory and in parallel.
  • Data is pipelined between query stages and over the network reducing latency overhead that one would have if disk I/O were involved.

All the below tasks will be done using the Developer edition of watsonx.data.

Using watsonx.data

Connectivity to watsonx.data can be done using the following methods:

  • Command line interface (CLI)
  • JDBC drivers
  • watsonx.data UI

Connecting to watsonx.data and executing queries using CLI

Open the watsonx.data CLI using the development directory. Make sure you are the root user.

whoami

If not, switch to the root user.

sudo su -

Change to the development directory.

cd /root/ibm-lh-dev/bin

Start the Presto CLI.

./presto-cli

We are going to inspect the available catalogs in the watsonx.data system. A watsonx.data catalog contains schemas and references a data source via a connector. A connector is like a driver for a database. Watsonx.data connectors are an implementation of Presto’s SPI which allows Presto to interact with a resource. There are several built-in connectors for JMX, Hive, TPCH etc., some of which you will use as part of the labs.

Display the catalogs.

show catalogs;
    Catalog    
---------------
 hive_data     
 iceberg_data 
 jmx           
 system        
 tpcds         
 tpch          
(6 rows)

Let's look up what schemas are available with any given catalog. We will use the TPCH catalog which is an internal PrestoDB auto-generated catalog and look at the available schemas.

show schemas in tpch;
       Schema       
--------------------
 information_schema 
 sf1                
 sf100              
 sf1000             
 sf10000            
 sf100000           
 sf300              
 sf3000             
 sf30000            
 tiny               
(10 rows)

Quit the presto-cli interface by executing the “quit;” command.

quit;

You can connect to a specific catalog and schema and look at the tables etc.

./presto-cli --catalog tpch --schema tiny
presto:tiny>

You will notice that the Presto prompt includes the name of the schema we are currently connected to.

Look at the available tables in the TPCH catalog under the tiny schema.

show tables;
  Table   
----------
 customer 
 lineitem 
 nation   
 orders   
 part     
 partsupp 
 region   
 supplier 
(8 rows)

Inspect schema of the customer table.

describe customer;
   Column   |     Type     | Extra | Comment 
------------+--------------+-------+---------
 custkey    | bigint       |       |         
 name       | varchar(25)  |       |         
 address    | varchar(40)  |       |         
 nationkey  | bigint       |       |         
 phone      | varchar(15)  |       |         
 acctbal    | double       |       |         
 mktsegment | varchar(10)  |       |         
 comment    | varchar(117) |       |         
(8 rows)

You could also use the syntax below to achieve the same result.

show columns from customer;
Column     |     Type     | Extra | Comment
-----------+--------------+-------+---------
custkey    | bigint       |       |
name       | varchar(25)  |       |
address    | varchar(40)  |       |
nationkey  | bigint       |       |
phone      | varchar(15)  |       |
acctbal    | double       |       |
mktsegment | varchar(10)  |       |
comment    | varchar(117) |       |
(8 rows)

Inspect available functions.

show functions like 'date%';
  Function   |       Return Type        |                         Argument Types                         | Function Type | Deterministic |                         Description                         | Variable Arity | Built In | Temporary | Language 
-------------+--------------------------+----------------------------------------------------------------+---------------+---------------+-------------------------------------------------------------+----------------+----------+-----------+----------
 date        | date                     | timestamp                                                      | scalar        | true          |                                                             | false          | true     | false     |          
 date        | date                     | timestamp with time zone                                       | scalar        | true          |                                                             | false          | true     | false     |          
 date        | date                     | varchar(x)                                                     | scalar        | true          |                                                             | false          | true     | false     |          
 date_add    | date                     | varchar(x), bigint, date                                       | scalar        | true          | add the specified amount of date to the given date          | false          | true     | false     |          
 date_add    | time                     | varchar(x), bigint, time                                       | scalar        | true          | add the specified amount of time to the given time          | false          | true     | false     |          
 date_add    | time with time zone      | varchar(x), bigint, time with time zone                        | scalar        | true          | add the specified amount of time to the given time          | false          | true     | false     |          
 date_add    | timestamp                | varchar(x), bigint, timestamp                                  | scalar        | true          | add the specified amount of time to the given timestamp     | false          | true     | false     |          
 date_add    | timestamp with time zone | varchar(x), bigint, timestamp with time zone                   | scalar        | true          | add the specified amount of time to the given timestamp     | false          | true     | false     |          
 date_diff   | bigint                   | varchar(x), date, date                                         | scalar        | true          | difference of the given dates in the given unit             | false          | true     | false     |          
 date_diff   | bigint                   | varchar(x), time with time zone, time with time zone           | scalar        | true          | difference of the given times in the given unit             | false          | true     | false     |          
 date_diff   | bigint                   | varchar(x), time, time                                         | scalar        | true          | difference of the given times in the given unit             | false          | true     | false     |          
 date_diff   | bigint                   | varchar(x), timestamp with time zone, timestamp with time zone | scalar        | true          | difference of the given times in the given unit             | false          | true     | false     |          
 date_diff   | bigint                   | varchar(x), timestamp, timestamp                               | scalar        | true          | difference of the given times in the given unit             | false          | true     | false     |          
 date_format | varchar                  | timestamp with time zone, varchar(x)                           | scalar        | true          |                                                             | false          | true     | false     |          
 date_format | varchar                  | timestamp, varchar(x)                                          | scalar        | true          |                                                             | false          | true     | false     |          
 date_parse  | timestamp                | varchar(x), varchar(y)                                         | scalar        | true          |                                                             | false          | true     | false     |          
 date_trunc  | date                     | varchar(x), date                                               | scalar        | true          | truncate to the specified precision in the session timezone | false          | true     | false     |          
 date_trunc  | time                     | varchar(x), time                                               | scalar        | true          | truncate to the specified precision in the session timezone | false          | true     | false     |          
 date_trunc  | time with time zone      | varchar(x), time with time zone                                | scalar        | true          | truncate to the specified precision                         | false          | true     | false     |          
 date_trunc  | timestamp                | varchar(x), timestamp                                          | scalar        | true          | truncate to the specified precision in the session timezone | false          | true     | false     |          
 date_trunc  | timestamp with time zone | varchar(x), timestamp with time zone                           | scalar        | true          | truncate to the specified precision                         | false          | true     | false     |          
(21 rows)

Switch to a different schema.

use sf1;

Display the Tables in the schema.

show tables;
  Table   
----------
 customer 
 lineitem 
 nation   
 orders   
 part     
 partsupp 
 region   
 supplier 
(8 rows)

Query data from customer table.

select * from customer limit 5;
 custkey |        name        |                 address                  | nationkey |      phone      | acctbal | mktsegment |                                                comment                                                
---------+--------------------+------------------------------------------+-----------+-----------------+---------+------------+-------------------------------------------------------------------------------------------------------
   37501 | Customer#000037501 | Ftb6T5ImHuJ                              |         2 | 12-397-688-6719 | -324.85 | HOUSEHOLD  | pending ideas use carefully. express, ironic platelets use among the furiously regular instructions.  
   37502 | Customer#000037502 | ppCVXCFV,4JJ97IibbcMB5,aPByjYL07vmOLO 3m |        18 | 28-515-931-4624 |  5179.2 | BUILDING   | express deposits. pending, regular deposits wake furiously bold deposits. regular                     
   37503 | Customer#000037503 | Cg60cN3LGIUpLpXn0vRffQl8                 |        13 | 23-977-571-7365 | 1862.32 | BUILDING   | ular deposits. furiously ironic deposits integrate carefully among the iron                           
   37504 | Customer#000037504 | E1 IiMlCfW7I4 1b9wfDZR                   |        21 | 31-460-590-3623 | 2955.33 | HOUSEHOLD  | s believe slyly final foxes. furiously e                                                              
   37505 | Customer#000037505 | Ad,XVdA6XAa0h aukZHUo5Mxh,ZRwVR3k7b7     |         3 | 13-521-760-7263 | 3243.15 | FURNITURE  | ites according to the quickly bold instru                                                             
(5 rows)

Gather statistics on a given table.

show stats for customer;
 column_name |  data_size  | distinct_values_count | nulls_fraction | row_count | low_value | high_value 
-------------+-------------+-----------------------+----------------+-----------+-----------+------------
 custkey     | NULL        |              150039.0 |            0.0 | NULL      | 1         | 150000     
 name        |   2700000.0 |              149980.0 |            0.0 | NULL      | NULL      | NULL       
 address     |   3758056.0 |              150043.0 |            0.0 | NULL      | NULL      | NULL       
 nationkey   | NULL        |                  25.0 |            0.0 | NULL      | 0         | 24         
 phone       |   2250000.0 |              150018.0 |            0.0 | NULL      | NULL      | NULL       
 acctbal     | NULL        |              140166.0 |            0.0 | NULL      | -999.99   | 9999.99    
 mktsegment  |   1349610.0 |                   5.0 |            0.0 | NULL      | NULL      | NULL       
 comment     | 1.0876099E7 |              149987.0 |            0.0 | NULL      | NULL      | NULL       
 NULL        | NULL        | NULL                  | NULL           |  150000.0 | NULL      | NULL       
(9 rows)

Quit Presto.

quit;