Ahmet Sayar's Reports: May 2007

Thursday, May 31, 2007

Status Report for 05/30/2007

We summarize the common GIS problems in a general context (Option 2) as
..........-....geospatial (or geographic) data access and integration
..........-....coupling the data grid with service grid
..........-....unified queries for integrated data
..........-....integrated data display
..........-....interactive smart visualization tools
..........-....performance
and discuss and propose our approaches to these problems.

These issues are undeniably the crucial points of the numerous research and development efforts. Especially the problems related to the data and storage heterogeneites are being addressed by a number of groups and organizations some of which also offer solutions to the application level interoperability issues.

We generally focus on the issues in terms of GIS and geographic data, but findings and recommendations are relevant for any other science domains and data types.

Below I summarized these issues as my thesis prospectus:

*****************************************************
Summary of my thesis prospectus:

We investigate the issues pertaining to the traditional Geographic Information Systems (GIS)approaches and propose solutions to these problems based on modern Service Oriented Grids approaches. As in general science domains, GIS requires decision making and situation assessment based on integrated data display. We generally focus on the issues in terms of GIS and geographic data, but findings and recommendations are relevant for any other science domains and data types.

GIS is a system of computer software, hardware, and data used to manipulate, analyze, and graphically present a potentially wide array of information associated with geographic locations. GIS’s powerful ability to integrate different kinds of information about a physical location can lead to better informed decisions about public investments in infrastructure and services—including national security, law enforcement, health care, and the environment—as well as a more effective and timely response in emergency situations. However, long-standing challenges to data sharing and integration need to be addressed before the benefits of geographic information systems can be fully realized. Our focus regarding data integration is different from the Database or digital library communities. We deal with the integration at the higher level than they do and we try to utilize their approaches at the bottom level by proposing generic mediator services.

Our work is about developing a Web Services architecture that provides coupling of scientific geophysical applications with archival data through the innovative interactive smart decision making tools. This work can be detailed in a couple of sub-research areas such as: accessing and querying heterogeneous data provided by heterogeneous storages with unified query structures, developing GIS data services, considering performance issues of transferring, parsing and rendering of large geographic data, and composition of GIS services.

In the light of the explanations above we categorize our work as below:

1. Heterogeneous and distributed data integration (Data Grid) through mediator services
.........o Different data types
.........o Different storage types

2. Coupling Geo-Science Grid with Data Grid
.........o Integrating Web Map Services with Geo-Science Grid
.........o Enabling decision making through integrated data display (3-layered display structure)
.........o Creating view-level integration structure. View is abstracted as layers in GIS domain.
.........o Creating generic plotting Web Services (Sci-Plot) in order to couple Geo-Science Grid outputs with its inputs (from data grid) at the view level.

3. Interactive and smart decision making tools
.........o Coupling interface for browser based remote access
.........o Data/information display
.........o Interactive querying and mining the data
.........o Visualization and analysis of the data and Science Grid simulation outputs
.........o Movies and animations tools over time-series data

4. Performance
.........o Accessing remote large data sets provided by geographically distributed data vendors.
.........o Transferring, integrating, processing and interpreting data.
.........o Proposing: High-performance streaming data services through messaging middleware.
.........o Proposing: Advanced pre-fetching, caching and load balancing techniques.

*****************************************************
Special note to Professor Geoffrey: As you realized I was confused with options 1 and 2. My previous document was according to the idea you named as Option-2. The above statements are summary of my upcoming document in option-2. Once I am done with it I will send to you.

Saturday, May 12, 2007

Pre-fetching Concept in Brief

1. Pre-fetching Concept in Brief

Proposed system (mentioned in previous documents as grid oriented map services) enables Aggregator WMS to run in combination of streaming (or non-streaming) and pre-fetching (or on-demand) data transfer modes.

In order to run it in pre-fetched mode Aggregator WMS's "DATACCESS" entry in the properties file should be set as below

#--------------------------------
# DATA ACCESSING MODE #
#--------------------------------
DATACCESS = pre-fetched
WMSTYPE = streaming

In order for the pre-fetching algorithm to work properly, data should be fetched as a whole.No constraint should be defined in the query.

Aggregator WMS queries and transfers the data by using Web Feature Service's (we call it DSFS in general) getFeature Web Service interface. WFS provides data in GML common data model (we call it DSL in general)

In our motivating domain (GIS) one of the criteria define the constrant in the query is bounding box and it is defined in the getFeature request as below.

<gml:coordinates>minx,miny maxx,maxy</gml:coordinates>

in order to get whole data we set this criteria in its widest range as it is shown below:

<gml:coordinates>-180,-90 180,90</gml:coordinates>

Or alternatively,
Remove this constraint tag from the getFeature request

Why do you do pre-fetching?:
We do pre-fetching to get rid of the poor performance of transferring XML based feature data. According to our proposed framework, pre-fetching is done over Layer set 2 data (see the previous documents for three layered structures). Layer set 2 data is represented in Common Data Model (CDM). As CDM we use GML.

The names of the data to be pre-fetched are obtained from the capabilities file. Here are the example layer-set 2 data names listed in capability file:
California_Faults
States_Boundaries
World_Seismic

These data are pre-fetched into seperate files of different names. File names are same as corresponding data names that they will be fetched in.
California_Faults.xml
State_Boundaries.xml
World_Seismic.xml

There will be two seperate locations for the pre-fetched data. One is temporary which will be active during prefetchin the data. Another is stable which will be used for serving the clients' requests. When the data transfer is done to the temporary location, all the data at that location will be moved to stable location. Reading and writing the data files at the stable locations will be synchronized to keep the data files consistent. This cycle will be repeating.

Requests from clients contain some constrants. So, querying the pre-fetched data is handled at the Aggregator server side at which pre-fetching initialized. Query is basically done by using Pull Parsing techniques and XPATH queries over pre-fetched data.

2. Coding Level
-how to define the task and periodicity of it to fetch the data?

It should not be less than the time to transfer one set of layer-set 2 data.
So from our experience for the data listed above, it should be more than 2 hours.Since these data sets are not updated often we plan to set the periodicity once avery day.

what technology do you use to pre-fetch the data?
Timer
TaskTimer Java class libraries.

//TimerTask definition:
TimerTask task = new TimerTask() {
...... public void run() {
............ Vector CDMdataList = new Vector();
............ CDMdataList = getListOfGMLDataInCapability();

............ String tempDatastore = applpath + "/prefetchedData";
............ String usedDatastore = applpath + "/prefetchedDataUsed";

............ //Fetching all the data in CDM format (GML) - with NB
............ fd.FetchDataWithStreamingNBip,NBport,NBtopic,
........................................wfs_address,tempDatastore,CDMdataList);

............ //after pre-fetching is done move the data to stable storage
............ fd.moveData(tempDatastore, usedDatastore);
...... }
}

//Running the Task defined above:
Timer timer = new Timer();
timer.schedule(task, 0, 40000);
//sets timer to run once every three days

Timer class schedules the specified task for repeated fixed-delay execution, beginning after the specified delay. Subsequent executions take place at approximately regular intervals separated by the specified period

3. Possible Challenges:

One minor challenge with storing the data in file system is that some systems allow a limitid size for user created files such as:

LINUX(RedHat)...........................SOLARIS
2GB{512B block size}................... 1 TB
8192GB{8KB block size} ............... 2 GB {=<2.5.1}

Even if operating system does not have limited file size constraint, then file size will be contrant by the hard dizk size.

Another challenge is synchronization of the two storages (temporary for fetching the data and stable for answering the client requests while fetching is happening). This challenge is easier to solve.

Wednesday, May 02, 2007

Status Report for 05/02/2007

1. I have implemented "pre-fetching" according to the proposed architecture that I have mentioned in my previous posts. You can test the performance by going to demo page.
http://toro.ucs.indiana.edu:8083/aaa/maptools/newmap.jsp

Notes for simple instructions to test:

__testing fault and state data
__At the top left drop down list select "California Faults"
............then select layers named "California:faults" and "State:Boundaries"
............then use map tools (zooming, move etc.) at the bottom of the map image

__testing earthquake seismic data
__At the top left drop down list select "Pattern Informatics"
............then select layers named "World:Seismic" and "State:Boundaries"
............then use map tools (zooming, move etc.) at the bottom of the map image

For the stress test please zoom out to whole world and in secion "time interval for seismic data" set minimum magnitude as 0, and set "from time" to 01/01/1900 and "to time" to 12/31/2006.

Please do not use buttons "Create Map Movie" and "Plot Pi Output" (FTB)

2. I have answered one of user questions about "Google Maps" and our "Grid-oriented mapping services" and, posted it to my blog. Please check my previous post.

Our Grid-oriented Map Servers vs. Google Map Servers

Question: How are you different from Google Maps?
Can you use only Google maps (without any mediators) in Geo-Science Applications?

Google Map Services enables navigating into an Earth framework for map/satellite display in images and applying dynamic browser based geo-processing services.

Instead of competing, Google Map provides complementary services and layers to our integration framework. We use its earth map/satellite images as Layer-Set 1 (see my previous posts) to create more comprehensible map images in our integration framework.

We mostly deal with the performance issues at Layer set 2.Google Map and NASA OnEarth WMS map servers provide layer set 1 in tolerable performance level.

Our main focus is integrating Mapping Services with Geo-science grids . Outcomes of these integration is three layer structured map images. Geo-Science applications use scientific data (geo-data, spatial data) provided by different vendors in differetn format. Because of this, we mainly deal with the interoperability of services and data. In order to solve the data interoperability problem we use OGC defined standard GML specifications. GML is XML encoding of the scientific vector data. Using XML causes performance problems which is not related to the problems Google people deal with such as ransferring parsing rendering and displaying XML based large data (GB in size).

Actually Google enables overlaying vector data (Layer-Set 2) over their satellite images or maps but when you tried to use Google map API with geo-science any XML encoded science data system will be stuck. it is because of that Google map API uses DOM parsers. DOM enables parsing XML documents of limited sizes.

How Google Map Server works:

Here, I summarized their approach. You can find more detailed architecture description on the web.
----------------------------------------------------------------------------------
Google cut the whole world's satellite image into pieces and call them tiles. The vital question is how to formulate the accepted requests and regions composed of tiles. Google Map provides two different maps, (1) Google maps and (2) Google satellite images.

1. Google Maps:

They use three parameters, xcoord, ycoord and zoom factor to formulate requests and regions.

A sample request to mt1 server at zoom level 9:
http://mt1.google.com/mt?n=404&v=w2.12&amp;amp;amp;amp;amp;amp;amp;amp;x=130&y=93&zoom=9

Zoom factor gets values from 17 (fully zoomed out) to 0 (maximum definition).
x coordinate takes values from -90 to 90
y coordinate takes values from -180 to 180

At a factor 17, the whole earth is in one tile where x=0 and y=0. At a factor 16, the earth is divided in 2x2 parts, where 0<=x<=1 and 0<=y<=1. and at each zoom step, each tile is divided into 4 parts. So at a zoom factor Z, the number of horizontal and vertical tiles is 2^(17-z)

Google uses 4 servers to balance the load. these are mt0, mt1, mt2 and mt3. And each tile is a 256X256 png picture.

To find a tile from the cached tiles:

latitude=90-latitude;
longitude=180+longitude;

double latTileSize=180/(pow(2,(17-zoom)));
double longTileSize=360/(pow(2,(17-zoom)));

//Tile coordinates:
int tilex=(int)(longitude/longTileSize);
int tiley=(int)(latitude/latTileSize);

(2) Google Satellite Images:

A sample request to kh0 server at zoom level 6:
http://kh0.google.com/kh?n=404&v=8&amp;amp;amp;amp;t=trtqtt

the length of parameter t defines the zoom level.

To see the whole globe, just use 't=t'. This gives a single tile representing the earth. For the next zoom level, this tile is divided into 4 quadrants, called, clockwise from top left : 'q' 'r' 's' and 't'. To see a quadrant, just append the letter of that quadrant to the image you are viewing. For example :'t=tq' will give the upper left quadrant of the 't' image. And so on at each zoom level...

Google uses 4 servers to balance the load. these are kh0, kh1, kh2 and kh3. And each tile is a 256X256 png picture.

To find a tile from the cached tiles:

double xmin=-180; double xmax=180;
double ymin=-90; double ymax=90;
double xmid=0; double ymid=0;

string location="t";

//google use a latitude divided by 2;
double halflat = latitude / 2;
for (int i = 0; i < zoom; i++)
{
xmoy = (xmax + xmin) / 2;
ymoy = (ymax + ymin) / 2;
if (halflat > ymoy) //upper part (q or r)
{
ymin = ymoy;
if (longitude < xmoy)
{ /*q*/
location+= "q";
xmax = xmoy;
}
else
{/*r*/
location+= "r";
xmin = xmoy;
}
}
else //lower part (t or s)
{
ymax = ymoy;
if (longitude < xmoy)
{ /*t*/
location+= "t";
xmax = xmoy;
}
else
{/*s*/
location+= "s";
xmin = xmoy;
}
}
}