-
Notifications
You must be signed in to change notification settings - Fork 7
DataSet
The DataSource class is used to access various storage backends. The DataSet class encapsulates data returned by the DataSource. A DataSet instance is a collection of Series. The DataSet further defines functional computations on the various Series and also enables accessing a Series directly as a numpy ndarray.
This is illustrated below, using the build, environment variables, and CSV data load as described in the SOS QuickStart:
In [1]: from sosdb import Sos
In [2]: from numsos.DataSource import SosDataSource
In [3]: src = SosDataSource()
In [4]: src.config(path='/dir/my-container')
In [5]: src.show_schemas()
Name Id Attr Count
--------------- ------------ ------------
meminfo_E5-2698 129 50
In [6]: src.show_schema('meminfo_E5-2698')
Name Id Type Indexed Info
-------------------------------- -------- ------------ -------- --------------------------------
timestamp 0 TIMESTAMP True
component_id 1 UINT64 True
job_id 2 UINT64 True
app_id 3 UINT64 False
...
DirectMap1G 46 UINT64 False
comp_time 47 JOIN True component_id, timestamp
job_comp_time 48 JOIN True job_id, component_id, timestamp
job_time_comp 49 JOIN True job_id, timestamp, component_id
In [7]: src.select(['timestamp','component_id','Active'], from_ = ['meminfo_E5-2698'], where = [['component_id', Sos.COND_GE, 175]], order_by = 'comp_time')
In [8]: src.show()
meminfo_E5-2698
timestamp component_id Active
--------------- --------------- ---------------
(1518803953, 2825) 175 216756
(1518803954, 2780) 175 216756
(1518803955, 3614) 175 216756
(1518803956, 3451) 175 216756
(1518803957, 3245) 175 216756
(1518803958, 3056) 175 216756
(1518803959, 1220) 175 216756
...
(1518803961, 2806) 179 209712
(1518803962, 2662) 179 209712
--------------- --------------- ---------------
50 record(s)
In [9]: dst = src.get_results()
In [10]: len(dst)
Out[10]: 50
In [11]: dst.get_series_size()
Out[11]: 50
# This next call returns another DataSet with the Series, which allows it to be composed in functions with other DataSets
In [12]: dst['timestamp']
Out[12]: <sosdb.DataSet.DataSet at 0x7f5df329d290>
# Multiple variables are supported in the DataSet
In [13]: dst['timestamp','component_id','Active'].show(limit=10)
timestamp component_id Active
---------------- ---------------- ----------------
2018-02-16T17:59:13.002825 175.0 216756.0
2018-02-16T17:59:14.002780 175.0 216756.0
2018-02-16T17:59:15.003614 175.0 216756.0
2018-02-16T17:59:16.003451 175.0 216756.0
2018-02-16T17:59:17.003245 175.0 216756.0
2018-02-16T17:59:18.003056 175.0 216756.0
2018-02-16T17:59:19.001220 175.0 216756.0
2018-02-16T17:59:20.002107 175.0 216756.0
2018-02-16T17:59:21.002038 175.0 216756.0
2018-02-16T17:59:22.003040 175.0 216756.0
---------------- ---------------- ----------------
10 results
# This next call returns the Series, with _zero-copy_, as a numpy ndarray, which allows it to be used directly in numpy calls
# Note that this only works for a _single variable_
In [14]: dst.array('timestamp')
Out[14]:
array(['2018-02-16T17:59:13.002825', '2018-02-16T17:59:14.002780',
'2018-02-16T17:59:15.003614', '2018-02-16T17:59:16.003451',
'2018-02-16T17:59:17.003245', '2018-02-16T17:59:18.003056',
'2018-02-16T17:59:19.001220', '2018-02-16T17:59:20.002107',
...
'2018-02-16T17:59:19.001032', '2018-02-16T17:59:20.001899',
'2018-02-16T17:59:21.002806', '2018-02-16T17:59:22.002662'], dtype='datetime64[us]')
Using functions on a DataSet:
# Use the call that returns the data Series as a numpy ndarray to see the current values
In [17]: dst.array('component_id')
Out[17]:
array([ 175., 175., 175., 175., 175., 175., 175., 175., 175.,
175., 176., ... 178.,
178., 178., 178., 178., 179., 179., 179., 179., 179.,
179., 179., 179., 179., 179.])
# Use the call that returns the data Series as a DataSet with a Series and operate on it with a function
In [19]: dst['component_id']
Out[19]: <sosdb.DataSet.DataSet at 0x7f5df826f3d0>
In [20]: dst['component_id']+=1
# Use the call that returns the data Series as an ndarray _for a single variable only_ to see the updated values:
In [21]: dst.array('component_id')
Out[21]:
array([ 176., 176., 176., 176., 176., 176., 176., 176., 176.,
176., 177., 177., 177., 177., 177., 177., 177., 177.,
177., 177., 178., 178., 178., 178., 178., 178., 178.,
178., 178., 178., 179., 179., 179., 179., 179., 179.,
179., 179., 179., 179., 180., 180., 180., 180., 180.,
180., 180., 180., 180., 180.])
You can also use python functions on the numpy array:
# Use the call that returns the data Series as a numpy array and operate on it with a function
In [23]: foo = dst.array('component_id')
In [24]: foo
Out[24]:
array([ 176., 176., 176., 176., 176., 176., 176., 176., 176.,
176., 177., ... 179., 180., 180., 180., 180., 180.,
180., 180., 180., 180., 180.])
In [25]: foo*=2
In [26]: foo
Out[26]:
array([ 352., 352., 352., 352., 352., 352., 352., 352., 352.,
352., 354., ... 358., 360., 360., 360., 360., 360.,
360., 360., 360., 360., 360.])
Show can be used to show values, without changing the base data:
In [27]: dst['Active'].show(limit=10)
Active
----------------
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
----------------
10 results
In [28]: (dst['Active'] + 1 ).show(limit=10)
(Active+1)
----------------
216757.0
216757.0
216757.0
216757.0
216757.0
216757.0
216757.0
216757.0
216757.0
216757.0
----------------
10 results
In [29]: dst['Active'].show(limit=10)
Active
----------------
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
216756.0
----------------
10 results
DataSets also support multivariate and statistical functions of Series in the DataSet:
# Use the call that returns a DataSet with multiple Series
In [41]: src.select(['timestamp','component_id','Active','MemTotal'], from_ = ['meminfo_E5-2698'], where = [['component_id', Sos.COND_GE, 175]], order_by = 'comp_time')
In [42]: dst = src.get_results()
In [43]: dst.series
Out[43]: ['timestamp', 'component_id', 'Active', 'MemTotal']
In [45]: dst['Active']
Out[45]: <sosdb.DataSet.DataSet at 0x7f669c159650>
# Operate on the Series in the DataSet with a multivariate function. The output is a DataSet, with a Series whose name reflects the function performed.
In [49]: foo = dst['Active']/dst['MemTotal']
In [50]: foo
Out[50]: <sosdb.DataSet.DataSet at 0x7f669c1a80d0>
In [51]: foo.series
Out[51]: ['(Active/MemTotal)']
# Use the call that returns the Series as a numpy array to see the values:
In [52]: ratio = foo.array('(Active/MemTotal)')
In [53]: ratio
Out[53]:
array([ 0.00164334, 0.00164334, 0.00164334, 0.00164334, 0.00164334,
0.00164334, 0.00164334, 0.00164334, 0.00164334, 0.00164334,
0.00163813, ... 0.00517634,
0.00158994, 0.00158994, 0.00158994, 0.00158994, 0.00158994,
0.00158994, 0.00158994, 0.00158994, 0.00158994, 0.00158994])
# The new DataSet Series can be appended to the original DataSet
In [54]: dst.append_series(foo)
Out[54]: <sosdb.DataSet.DataSet at 0x7f669c194f90>
In [55]: dst.series
Out[55]: ['timestamp', 'component_id', 'Active', 'MemTotal', '(Active/MemTotal)']
# Some statistical functions are also defined:
In [56]: foo.array('(Active/MemTotal)').max()
Out[56]: 0.0051763412917759863
Series can be renamed; this is particularly useful when the name is unwieldy (as after an operation, as shown above)
In [148]: dst.series
Out[148]: ['timestamp', 'component_id', 'Active', 'MemTotal', '(Active/MemTotal)']
In [149]: dst.rename('(Active/MemTotal)','ActiveRatio')
In [150]: dst.series
Out[150]: ['timestamp', 'component_id', 'Active', 'MemTotal', 'ActiveRatio']
The Transform class in numSOS provides an interface meant to simplify complex operations on DataSets.
- SOS QuickStart - includes creating SOS from CSV
- Building
- Viewing Class Documentation
- numSOS overview - python queries to numSOS data objects.