Skip to content

DataSet

oceandlr edited this page Mar 9, 2019 · 20 revisions

The DataSource class is used to access various storage backends. The DataSet class encapsulates data returned by the DataSource. A DataSet instance is a collection of Series. The DataSet further defines functional computations on the various Series and also enables accessing a Series directly as a numpy ndarray.

This is illustrated below, using the build, environment variables, and CSV data load as described in the SOS QuickStart:

In [1]: from sosdb import Sos

In [2]: from numsos.DataSource import SosDataSource

In [3]: src = SosDataSource()

In [4]: src.config(path='/dir/my-container')

In [5]: src.show_schemas()
Name            Id           Attr Count
--------------- ------------ ------------
meminfo_E5-2698          129           50

In [6]: src.show_schema('meminfo_E5-2698')
Name                             Id       Type         Indexed  Info
-------------------------------- -------- ------------ -------- --------------------------------
timestamp                               0 TIMESTAMP    True     
component_id                            1 UINT64       True     
job_id                                  2 UINT64       True     
app_id                                  3 UINT64       False    
...
DirectMap1G                            46 UINT64       False    
comp_time                              47 JOIN         True     component_id, timestamp
job_comp_time                          48 JOIN         True     job_id, component_id, timestamp
job_time_comp                          49 JOIN         True     job_id, timestamp, component_id

In [7]: src.select(['timestamp','component_id','Active'], from_ = ['meminfo_E5-2698'], where =   [['component_id', Sos.COND_GE, 175]], order_by = 'comp_time')


In [8]: src.show()
meminfo_E5-2698                                 
timestamp       component_id    Active          
--------------- --------------- --------------- 
(1518803953, 2825)             175          216756 
(1518803954, 2780)             175          216756 
(1518803955, 3614)             175          216756 
(1518803956, 3451)             175          216756 
(1518803957, 3245)             175          216756 
(1518803958, 3056)             175          216756 
(1518803959, 1220)             175          216756 
...
(1518803961, 2806)             179          209712 
(1518803962, 2662)             179          209712 
--------------- --------------- --------------- 
50 record(s)


In [9]: dst = src.get_results()

In [10]: len(dst)
Out[10]: 50

In [11]: dst.get_series_size()
Out[11]: 50

# This next call returns another DataSet with the Series, which allows it to be composed in functions with other DataSets

In [12]: dst['timestamp']
Out[12]: <sosdb.DataSet.DataSet at 0x7f5df329d290>

# Multiple variables are supported in the DataSet
In [13]: dst['timestamp','component_id','Active'].show(limit=10)
   timestamp     component_id           Active 
---------------- ---------------- ---------------- 
2018-02-16T17:59:13.002825            175.0         216756.0 
2018-02-16T17:59:14.002780            175.0         216756.0 
2018-02-16T17:59:15.003614            175.0         216756.0 
2018-02-16T17:59:16.003451            175.0         216756.0 
2018-02-16T17:59:17.003245            175.0         216756.0 
2018-02-16T17:59:18.003056            175.0         216756.0 
2018-02-16T17:59:19.001220            175.0         216756.0 
2018-02-16T17:59:20.002107            175.0         216756.0 
2018-02-16T17:59:21.002038            175.0         216756.0 
2018-02-16T17:59:22.003040            175.0         216756.0 
---------------- ---------------- ---------------- 
10 results


# This next call returns the Series, with _zero-copy_, as a numpy ndarray, which allows it to be used directly in numpy calls
# Note that this only works for a _single variable_
In [14]: dst.array('timestamp')
Out[14]: 
array(['2018-02-16T17:59:13.002825', '2018-02-16T17:59:14.002780',
   '2018-02-16T17:59:15.003614', '2018-02-16T17:59:16.003451',
   '2018-02-16T17:59:17.003245', '2018-02-16T17:59:18.003056',
   '2018-02-16T17:59:19.001220', '2018-02-16T17:59:20.002107',
   ...
'2018-02-16T17:59:19.001032', '2018-02-16T17:59:20.001899',
'2018-02-16T17:59:21.002806', '2018-02-16T17:59:22.002662'], dtype='datetime64[us]')

Using functions on a DataSet:

# Use the call that returns the data Series as a numpy ndarray to see the current values
In [17]: dst.array('component_id')
Out[17]: 
array([ 175.,  175.,  175.,  175.,  175.,  175.,  175.,  175.,  175.,
    175.,  176.,  ...  178.,
    178.,  178.,  178.,  178.,  179.,  179.,  179.,  179.,  179.,
    179.,  179.,  179.,  179.,  179.])

# Use the call that returns the data Series as a DataSet with a Series and operate on it with a function
In [19]: dst['component_id']
Out[19]: <sosdb.DataSet.DataSet at 0x7f5df826f3d0>

In [20]: dst['component_id']+=1

# Use the call that returns the data Series as an ndarray _for a single variable only_ to see the updated values:
In [21]: dst.array('component_id')
Out[21]: 
array([ 176.,  176.,  176.,  176.,  176.,  176.,  176.,  176.,  176.,
    176.,  177.,  177.,  177.,  177.,  177.,  177.,  177.,  177.,
    177.,  177.,  178.,  178.,  178.,  178.,  178.,  178.,  178.,
    178.,  178.,  178.,  179.,  179.,  179.,  179.,  179.,  179.,
    179.,  179.,  179.,  179.,  180.,  180.,  180.,  180.,  180.,
    180.,  180.,  180.,  180.,  180.])

You can also use python functions on the numpy array:

# Use the call that returns the data Series as a numpy array and operate on it with a function
In [23]: foo = dst.array('component_id')

In [24]: foo
Out[24]: 
array([ 176.,  176.,  176.,  176.,  176.,  176.,  176.,  176.,  176.,
    176.,  177.,  ... 179.,  180.,  180.,  180.,  180.,  180.,
    180.,  180.,  180.,  180.,  180.])

In [25]: foo*=2

In [26]: foo
Out[26]: 
array([ 352.,  352.,  352.,  352.,  352.,  352.,  352.,  352.,  352.,
    352.,  354.,  ... 358.,  360.,  360.,  360.,  360.,  360.,
    360.,  360.,  360.,  360.,  360.])

Show can be used to show values, without changing the base data:

In [27]: dst['Active'].show(limit=10)
      Active 
---------------- 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
---------------- 
10 results


In [28]: (dst['Active'] + 1 ).show(limit=10)
  (Active+1) 
---------------- 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
    216757.0 
---------------- 
10 results

In [29]: dst['Active'].show(limit=10)
      Active 
---------------- 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
    216756.0 
---------------- 
10 results

DataSets also support multivariate and statistical functions of Series in the DataSet:

# Use the call that returns a DataSet with multiple Series
In [41]: src.select(['timestamp','component_id','Active','MemTotal'], from_ = ['meminfo_E5-2698'], where =    [['component_id', Sos.COND_GE, 175]], order_by = 'comp_time')

In [42]: dst = src.get_results()

In [43]: dst.series
Out[43]: ['timestamp', 'component_id', 'Active', 'MemTotal']

In [45]: dst['Active']
Out[45]: <sosdb.DataSet.DataSet at 0x7f669c159650>

# Operate on the Series in the DataSet with a multivariate function. The output is a DataSet, with a Series whose name reflects the function performed.
In [49]: foo = dst['Active']/dst['MemTotal']

In [50]: foo
Out[50]: <sosdb.DataSet.DataSet at 0x7f669c1a80d0>

In [51]: foo.series
Out[51]: ['(Active/MemTotal)']

# Use the call that returns the Series as a numpy array to see the values:
In [52]: ratio = foo.array('(Active/MemTotal)')

In [53]: ratio
Out[53]: 
array([ 0.00164334,  0.00164334,  0.00164334,  0.00164334,  0.00164334,
    0.00164334,  0.00164334,  0.00164334,  0.00164334,  0.00164334,
    0.00163813,  ...  0.00517634,
    0.00158994,  0.00158994,  0.00158994,  0.00158994,  0.00158994,
    0.00158994,  0.00158994,  0.00158994,  0.00158994,  0.00158994])

# The new DataSet Series can be appended to the original DataSet
In [54]: dst.append_series(foo)
Out[54]: <sosdb.DataSet.DataSet at 0x7f669c194f90>

In [55]: dst.series
Out[55]: ['timestamp', 'component_id', 'Active', 'MemTotal', '(Active/MemTotal)']

# Some statistical functions are also defined:
In [56]: foo.array('(Active/MemTotal)').max()
Out[56]: 0.0051763412917759863

Series can be renamed; this is particularly useful when the name is unwieldy (as after an operation, as shown above)

In [148]: dst.series
Out[148]: ['timestamp', 'component_id', 'Active', 'MemTotal', '(Active/MemTotal)']

In [149]: dst.rename('(Active/MemTotal)','ActiveRatio')

In [150]: dst.series
Out[150]: ['timestamp', 'component_id', 'Active', 'MemTotal', 'ActiveRatio']

The Transform class in numSOS provides an interface meant to simplify complex operations on DataSets.

Main

Basic

Data Computations

Reference Docs

Other

Clone this wiki locally