You are on page 1of 21

Components of a Data Analysis System

Scientific Drivers in the Design of an Analysis System

Data Import
Format

Either widely used/accepted, or Can be converted easily from something widely used User need not know the details of the format Well documented (e.g., which flavor of latitude).
Disk I/O speeds do not follow Moores law Read speed is more important than write speed Caching File size is only important to keep access times low

Fast Access

Content must represent the details of the data E2E - Full intent of the observer must be embedded

Data Export
Format
Either widely used/accepted, or Can be converted easily into something widely used User need not know the details of the format Well documented (e.g., which flavor of latitude).

You can read what you write


Import format == Export format

Fast Access
Disk I/O speeds do not follow Moores law Read speed is more important than write speed

Content must represent the details of the data E2E - Full intent of the observer must be embedded.

Includes user annotation/comments

Data Base System


Ability to work with more than one data set Data base for both export and import files Large data volumes
Access using scan numbers is no longer sufficient Require the ability to select subsets of data via sophisticated data-base queries Moderate number of columns in data base index Index to data kept in memory to speed data access File summaries at various levels of detail

Various levels of granularity Calibrated and raw data E2E - User can add annotation/comments Security Only the observer can access data

Data Archive
Write speed more important than read speed. File size is very important Cannot anticipate types of user queries
Large number of columns in data base index Very sophisticated/fast RDBMS

Storage need not be a widely used data format


Format can be very different from that used by analysis system.

Export format should be a widely used data format

Interactive On-Line Data Analysis


The ability to access data ASAP
Import file updates automatically as observations proceed (real-time filler). Index to file updates automatically Updates happen per integration (spectral-line) or per N seconds (continuum) Minimum integration time ~ few times the minimum time of real-time filler Analysis system automatically is aware of updated index. Read-protect online/filled data?

User should be able to see the data within an integration of when it was taken (or N seconds).

User Interface
Command line
Familiar syntax better than a good syntax Procedural with byte-wise compiling (performance) History, min-match or command completion Useful error messages Interruptible Error trapping and exception handling Ability to Undo

User Interface
GUIs best for:
Interacting with data visualizations Filling in forms
data base queries options for data pipelines

Browsing for data files Defining E2E data flow (ala labview)

Imaging Tools
Visualization
Shouldnt try to recreate those things already available in another package export instead.

Data Flagging Pick a system that works Graphics


Traditional capabilities (zoom in/out, scroll, print, save, ) Data volume requires great performance, smart libraries (screen resolution << # data pts) Interactive feedback (e.g., defining baseline regions).

Publishable plots or export into something else?


Default plot style Ability to tweak everything (label formats; char sizes; add, remove, move annotation; tick mark size; major/minor ticks, full box; grid; multiple X and Y axes, ..)

Analysis Algorithms
Algorithms well documented Study what exists in other packages. Robustness very important but so is speed
Provide less robust but faster alternatives

Developers should not force an algorithm on users Developers should provide defaults only Building blocks better than a do-all algorithm. Ability to use and modify header information as well as data. E2E do-alls are built out of the same building blocks.

Documentation
On-line and hardcopy
Tutorials/Quick Guides Cookbook
Based on observing types

Reference Manuals
Full, gory details Data Formats Algorithms

Searchable by keywords

Quick, interactive command help from within the system. Never release until these are in place

User Support/Feedback
A familiar system minimizes staff support Easily accessed, on-line help desk and Suggestion box Automatic generation of bug reports Observers of observers

Marketing
A familiar system already has a market Dont be another cereal on the supermarket shelf Workshops are better than papers Create a User Community Responsive feedback from developers Independent Beta testers Reputation & first experiences are everything

User Community
User Forums Newsletters Accept User Contributions/Additions
Sourceforge-like system NRAO-seal-of-approval

NRAO Moderator

Real-Time Data Display


To guarantee data quality
Product is not stored (except for hardcopy) Sequential processing -- different from E2E/Data pipeline Fast is more important than accurate Few bells and whistles -- must avoid the RTD black hole A simple display for all observation types more important than sophisticated displays for a few data types

Display happens within an integration of when data were taken tied to real time filler GUI based underlying language is unimportant Output understandable by an operator

Real Time Data Analysis


Pointing/Focus/Tipping/ are different from RTD
Results should be stored (Data Base) Results are used by the control system (pointing/focus) or by subsequent analysis (tipping) Accuracy is as important as speed More bells, whistles, user-options Sequential processing (non E2E/data pipeline) Only a few observation types are handled

Analysis happens within an integration of when data were taken GUI based underlying language is unimportant Output understandable by an operator

IDL Work Package


SDFITS
Interim solution for data import/export Class/IDL specific; soon Aips++/Aips/UniPOPS? MD/BDFITS next generation (keywords, incompleteness of contents, versatility, )

IDL Tom Bania


Uses UniPOPS as a model familiar to many Very good reproduction Bania-centric needs to be generalized

IDL Work Package


Glen Langston
Assess whether IDL will meet performance, extensibility, usability, goals. Generalization to other observing types. Real-Time data access and display Developed on top of and in parallel with Toms work (so, implementations have diverged) Works well for Glens own experiments

IDL Work Package


Institutionalize what Tom and Glen have done
Code management Code review Combine Tom and Glens branch Generalize code Provide ways for Tom and Glen to contribute within the same revision-control branch.

Develop Institutionalized code


Improve performance, usability, maintenance Add/Replace I/O components with better CS methods.

Calibration Work Package


User-tunable algorithms
Options for the real-time filler sequential Options for E2E pipeline non-sequential Options for interactive data reduction

Default algorithms for all observing cases Extensible as new algorithms are developed User-defined/tweaked algorithms Robust and not-so-robust algorithms

Calibration Work Package


Opacity/atmosphere model Output units Efficiencies
Source size Telescope model

Tsys(f) estimates Differencing schemes Non-linearities/template fitting/.

You might also like