SDSS Data Distribution using Sector and UDT

Software

During the last several years, we have developed several related software tools for bulk data transfer over high speed wide area networks.

Sector

In the spring of 2005, we began developing a new P2P distributed storage system, based upon UDT, called Sector. Sector is currently running on a wide area 10G, high performance NSF-supported testbed we operate called the Teraflow Testbed. Either the entire SDSS DR6, or a portion of it, is stored on various nodes that are part of the testbed.

The P2P Sector software you can download will select automatically the node or nodes that are nearest to you in order to provide you the requested files.

We chose to use a P2P distributed file system for the following reasons. First, data sets these days are often so large that it is difficult to store them on a single node, and therefore it is convenient to distribute them across several nodes. Second, you can usually achieve higher performance by retrieving files from nodes that are closer to you. Sector automatically retrieves the requested data from the required node or nodes. Finally, P2P distributed file systems are more robust than traditional file systems in the sense that nodes can easily be added or dropped without effecting the availability of the data.

The core of Sector is a distributed file system built on top of a P2P routing infrastructure. The client for downloading the SDSS data is specific application using the Sector API. You can download other data stored on the Teraflow Testbed by simply providing the appropriate list of files.

UDT

UDT is an application level data transport protocol designed for the emerging applications that will require transfer of large amounts of data distributed over high-speed wide area networks (e.g., 1 Gb/s or above). UDT uses UDP to transfer data but unlike simple UDP it has its own reliability control and congestion control mechanisms. UDT is not only for private or QoS-enabled links, but also for shared networks. Furthermore, the current version of UDT (version 3.0) is designed using a Composible framework that supports multiple congestion control algorithms.

For more information about UDT, please visit udt.sf.net.

UDT-Gateway

For many end users, it is easier to use a file transfer utility employing TCP, or a web application employing HTTP and TCP, rather than to use UDT directly. To support this requirement, we developed the UDT-Gateway utility. To the user, it appears they are accessing data using a TCP-based application on the gateway machine, but, in fact, the data resides on a data server that is connected to the gateway machine using a high performance network and UDT. The data server can serve multiple gateway machines.

Specifically, the UDT gateway behaves exactly as an HTTP file server, and serves clients files via the ordinary HTTP/TCP channels. However, the gateway server does not host the files it serves locally. When a request arrives for a file, the file is streamed from a central repository via UDT, then streamed to the end consumer via TCP. In other words, the gateway machine allows the user to access large data sets using UDT and high performance networks for all except the “last mile,” which is handled using more standard networks and TCP.

©2006 National Center for Data Mining. Last updated on Friday June 23, 2006 11:35 PM.