SDSS Data Distribution using Sector and UDT

Overview

Using this web site, you can download the Sloan Digital Sky Survey (SDSS) data if you have access to a high speed wide area network. For example, if your organization is attached to the National Lambda Rail or Internet2's Abilene Network, then you should be able to download the entire SDSS BESTDR5 catalog data set in less than five hours.

In general, it can be quite challenging to use effectively the available bandwidth over a wide area, high performance network. This project uses the UDP-based Data Transfer Protocol or UDT, which has been developed by the National Center for Data Mining (NCDM) at the University of Illinois at Chicago to make effective use of the bandwidth available from high performance wide area networks.

The project is supported by the National Science Foundation through the grant SCI II: The TeraFlow Project: High Performance Flows for Mining Large Distributed Data Archives, Award SCI-0430781.

Sloan Digital Sky Survey (SDSS)

The SDSS is systematically mapping a quarter of the entire sky, producing a detailed image of it, and determining the positions and absolute brightness of more than 100 million celestial objects. It is also measuring the distances to a million of the nearest galaxies, giving us a three-dimensional picture of the universe through a volume one hundred times larger than that explored to date. SDSS is also recording the distances to 100,000 quasars — the most distant objects known — giving us unprecedented knowledge of the distribution of matter to the edge of the visible universe.

The SDSS completed its first phase of operations — SDSS-I — in June, 2005. Over the course of five years, SDSS-I imaged more than 8,000 square degrees of the sky in five band passes, detecting nearly 200 million celestial objects, and it measured spectra of more than 675,000 galaxies, 90,000 quasars, and 185,000 stars. These data have supported studies ranging from asteroids and nearby stars to the large scale structure of the Universe.

The most recent data product is DR6, which was released on June, 2007.

For more information about the project, see their web site www.sdss.org.

Sector

In the spring of 2005, we began developing a new P2P distributed storage system, based upon UDT, called Sector. Sector is currently running on a wide area 10G, high performance NSF-supported testbed we operate called the Teraflow Network . Either the entire SDSS DR6 catalog, or a portion of it, is stored on various nodes that are part of the testbed.

The P2P Sector software you can download will select automatically the node or nodes that are nearest to you in order to provide you the requested files.

We chose to use a P2P distributed file system for the following reasons. First, data sets these days are often so large that it is difficult to store them on a single node, and therefore it is convenient to distribute them across several nodes. Second, you can usually achieve higher performance by retrieving files from nodes that are closer to you. Sector automatically retrieves the requested data from the required node or nodes. Finally, P2P distributed file systems are more robust than traditional file systems in the sense that nodes can easily be added or dropped without effecting the availability of the data.

The core of Setcor is a distributed file system built on top of a P2P routing infrastructure. The client for downloading the SDSS data is specific application using the Sector API. You can download other data stored on the Teraflow Network by simply providing the appropriate list of files.

©2006 National Center for Data Mining. Last updated on Wednesday August 15, 2007 12:01 AM.