developer.qdataset

From autoplot.org

Jump to: navigation, search

Purpose: write new page documenting QDataSet, after discovering a lack of resources for describing Bundles. This will replace http://autoplot.org/QDataSet, and will be completed as a part of the migration of QDataSet into the das2 core.

Audience: das2 developers and interested scientists.

Contents

  1. Introduction
  2. Simple Arrays
    1. Rank 1 Example
    2. Rank 2 Example - Scalar Time Series
    3. Rank 2 Example - Vector Time Series
    4. Rank 3 Example
  3. More Metadata
  4. Example Datasets
  5. Bundles
  6. Comparing IDL array operations to QDataSets

1. Introduction

QDataSet is the mature interface used for representing, or modelling, data in Java. When one interface is used for all data, a rich set of operators can be developed, and software is written more quickly and correctly. Often QDataSet is referred to as a "Data Model."

QDataSet was originally developed in 2005-2007 for PaPCO an IDL application, to provide a standard way of representing data between plug-in modules for different instruments and different spacecraft. Autoplot, a Java application, was begun in 2007, and completed this model. Note that one of the design goals of QDataSet is to do most of the work in semantics instead of syntax, so that the model can be implemented in any software framework. Autoplot is in its eighth year, and QDataSet is quite mature, providing note just the interface, but also hundreds (TODO: count) of operators and sources for creating QDataSets from data files.

Because QDataSets carry rich metadata that allow the numbers to be interpreted, rich operations can be developed that need little documentation can be developed as well. For example waveform dataset can be passed to the fftPower function, and a power spectrum returned, with all the labels and frequency channels correctly calculated.

QDataSet reuses a number of conventions found in CDF and NetCDF files.

2. Simple Arrays

A QDataSet is a basically an array with metadata attached to it. An array is a set of N numbers accessed by an index from 0 to N-1. Some languages support 2-D and 3-D arrays, so to access any number, you need two or three indices, respectively.

With QDataSet, we call the number of indices the "rank" of the dataset. Core are three accessor methods:

ds.rank()         # The number of indices this QDataSet uses.
ds.length()       # The number of values in the 0th (first) index of the QDataSet.
ds.value(i)       # A value within a rank 1 QDataSet.
# or
ds.value(i,j)     # A value within a rank 2 QDataSet.
# or
ds.value(i,j,k)   # A value within a rank 3 QDataSet.

A rank 1 dataset is similar to a simple 1-D array. A number of elements are available (ds.length()), and each element is accessed with ds.value(i). The index i can take on values of 0,1,2,...,ds.length()-1.

In Autoplot, Jython can be used to create and modify QDataSets.

2.1. Rank 1 Example

# Create a Jython array
array= [ 2,3,2,4,5,4,2,1 ]
print 'Jython Array: ', array # Jython Array:  [2, 3, 2, 4, 5, 4, 2, 1]
 
# Convert Jython array into a QDataSet
ds= dataset( array )
print ds.rank() # 1
print ds.length() # 8
 
# Add metadata to QDataSet
ds= putProperty( ds, QDataSet.LABEL, "frequency" )
ds= putProperty( ds, QDataSet.UNITS, Units.hertz )
 
# Extract metadata and store them as Python strings
print str(ds.property( QDataSet.LABEL ) ) # frequency
 
# Extract data
array= [ ds.value(0), ds.value(1), ds.value(2) ]
print 'Jython Array of floats: ', array # [2.0, 3.0, 2.0]
# or
array= [ ds[0], ds[1], ds[2] ] 
print 'Jython Array of QDataSets: ', array # [2.0 Hz, 3.0 Hz, 2.0 Hz]
print 'First element has rank ' + array[0].rank() + ' and length ' + array[0].length()

2.2. Rank 2 Example - Scalar Time Series

A rank 2 dataset is similar to a 2-D array. Like Java and unlike Fortran, this is an array of arrays, so a rank 2 QDataSet can be thought of as a set of rank 1 datasets, and each rank 1 dataset can have a different length. Therefore to get the length of the second index, length(i) is called, and value(i,j) is used to access the values. Note most rank 2 QDataSets are actually repeated measurements of the same thing, so length(0) will equal length(1) and all other lengths. This special case is called "Qubes," and are more like Fortran arrays.

x = [1,4,9,16]
t = [0,1,2,3]
...

2.3. Rank 2 Example - Vector Time Series

Vx = [1,4,9,16]
Vy = [1,1,1,1]
t = [0,1,2,3]
...
Show how functions like fftPower(), autoHistogram(), mean(), smooth(), etc. can operate on the dataset.

2.4. Rank 3 Example

Higher rank datasets exist as well, namely rank 3 and rank 4 datasets. These have value methods value(i,j,k) and value(i,j,k,l), and might be Qubes as well.

The method "slice" accesses the dataset at the position. For example:

ds.slice(0) 

extracts the 0th dataset.

To access the properties of the dataset, which describe the units and labels, but also declare relationships between datasets, are accessed with

ds.property(name) # the name of properties associated with this dataset.

Properties:

  • DEPEND_0 The independent parameter "causing" the first index, which is also a QDataSet.
  • DEPEND_1 The independent parameter "causing" the second index.

Note there is no method to query which properties are available, and no property is required.

3. More Metadata

Properties:

  • UNITS The units. In Autoplot this is a Java class of enumerated values plus additional ones encountered.
  • LABEL Human-readable label for plot axes
  • TITLE Longer, human-readable title
  • NAME Java-identifier for the dataset "flux" or "energyFlux" or "spacecraft_time"
  • VALID_MAX, double, maximum valid value (inclusive)
  • VALID_MIN, double, minimum valid value (inclusive)

4. Example Datasets

Rank 1 QDataSet: Image:qdataset-rank1-v2.jpg

Rank 2 QDataSet: Image:qdataset-rank2.jpg

5. Bundles

We can also model tuples of data, by which we mean a set of data that are coincident. For example at a given time we collect density, velocity, and flux. We then have an array with elements time, density, vx, vy, vz, f1, f2, f3, f4. TODO: create this dataset in org.virbo.dataset.examples.Schemes.

These values can be represented in a rank 1 qdataset of nine elements, of course, but how do you store the different labels and units? For this, QDataSet uses the property BUNDLE_0, which is itself a QDataSet, but a very odd one called a BundleDataSetDescriptor. It is always rank 2, and carries the properties for each of the bundled elements. For example:

bds= ds.property( QDataSet.BUNDLE_0 )
print bds.property( 1, QDataSet.TITLE )   #  'Density'  
print bds.property( 2, QDataSet.LABEL )   #  'Vx'

Having the BundleDataSetDescriptor allows us to model commonly-seen rank 2 table of numbers:

2015-02-26T08:36   1 1 2 3 1 2 3 4
2015-02-26T08:37   1 1 2 3 1 2 3 4
2015-02-26T08:38   1 1 2 3 1 2 3 4

which can be modeled as:

print ds.rank()                           #   '2'
bds= ds.property( QDataSet.BUNDLE_1 )  
print bds.property( 1, QDataSet.TITLE )   #  'Density'  
print bds.property( 2, QDataSet.LABEL )   #  'Vx'

and a slice results in the tuple described above:

ds.slice(0)                               #   results in the tuple of individual measurements for time, density, vx, vy, vz, f1, f2, f3, f4.

To access individual datasets from the bundle, the unbundle function can be used:

print unbundle( ds, 2 )     # results in the vx column
print unbundle( ds, 'vx' )  # also results in the vx column, looking for the dataset by the name property.

A special "high-rank" mode is also supported to unbundle rank 2 (or greater) datasets:

print unbundle( ds, 'v' )   # results in v[n,3]

It can do this because there is metadata within the bundle properties describing this dataset:

bds.property( 2, QDataSet.START_INDEX )     # 2
bds.length( 2 )                             # 1 because each record contains a rank 1 dataset.  
bds.value( 2 )                              # 3 (3 elements: vx,vy,vz)

TODO:

bds.property( 1, QDataSet.DEPENDNAME_0 )  which returns the name of the DEPEND_0 dataset, which is attached automatically when unbundling.

6. Comparing IDL array operations to QDataSets

There are a number of differences between IDLs array operations and QDataSet operations.

Note in Jython ds[2,3] is the same as ds.slice(2).slice(3) and similar to ds.value(2,3). (Similar because value(2,3) returns a double, while ds.slice(2).slice(3) returns a rank 0 dataset.)

In the following table, "ds" may change from row to row, but is the same for the two columns. TODO: these should probably be equivalent codes.

IDLQDataSets in Autoplot's Jython
ds[1,*] returns 1 by n array, reform used to remove the 1ds[1,:] returns rank 1 n-element dataset.
ds[1] is a scalards[1] is a rank 0 dataset
arrays can have 1 to 8 indicesdatasets can have 0 to 4 indices
single index aliases to multi-index ds[19] is the same as ds[3,4] of 4 by 5 arraysingle index slices to dataset of rank N-1, DataSetIterator should be used to access all elements
where command returns aliased indices for 2-D arrayswhere returns r[n,2] of indices for each dimension.
where command returns -1 to indicate no matcheswhere returns zero-length dataset for no matches.
zero-length arrays supported in IDL 8.1 and upzero-length arrays have always been supported
size command shows number of dimensions and size of each dimension (its geometry)ds.rank() and ds.length() are used to discover dataset geometry.
arrays are qubes, having the same number of columns in each roweach record could have a different length, QDataSet.QUBE property asserts that the dataset is a qube.
print, ds prints all the elementsfor d in ds: print d prints all the elements
help, ds shows information about the arrayprint ds shows information about the array
!dtor converts to radianstoRadians(ds) converts to radians
Personal tools