Saturday, May 12, 2007

Pre-fetching Concept in Brief



1. Pre-fetching Concept in Brief



Proposed system (mentioned in previous documents as grid oriented map services) enables Aggregator WMS to run in combination of streaming (or non-streaming) and pre-fetching (or on-demand) data transfer modes.

In order to run it in pre-fetched mode Aggregator WMS's "DATACCESS" entry in the properties file should be set as below


#--------------------------------
# DATA ACCESSING MODE #
#--------------------------------
DATACCESS = pre-fetched
WMSTYPE = streaming

In order for the pre-fetching algorithm to work properly, data should be fetched as a whole.No constraint should be defined in the query.

Aggregator WMS queries and transfers the data by using Web Feature Service's (we call it DSFS in general) getFeature Web Service interface. WFS provides data in GML common data model (we call it DSL in general)

In our motivating domain (GIS) one of the criteria define the constrant in the query is bounding box and it is defined in the getFeature request as below.

<gml:coordinates>minx,miny maxx,maxy</gml:coordinates>

in order to get whole data we set this criteria in its widest range as it is shown below:

<gml:coordinates>-180,-90 180,90</gml:coordinates>

Or alternatively,
Remove this constraint tag from the getFeature request


Why do you do pre-fetching?:
We do pre-fetching to get rid of the poor performance of transferring XML based feature data. According to our proposed framework, pre-fetching is done over Layer set 2 data (see the previous documents for three layered structures). Layer set 2 data is represented in Common Data Model (CDM). As CDM we use GML.

The names of the data to be pre-fetched are obtained from the capabilities file. Here are the example layer-set 2 data names listed in capability file:
California_Faults
States_Boundaries
World_Seismic

These data are pre-fetched into seperate files of different names. File names are same as corresponding data names that they will be fetched in.
California_Faults.xml
State_Boundaries.xml
World_Seismic.xml

There will be two seperate locations for the pre-fetched data. One is temporary which will be active during prefetchin the data. Another is stable which will be used for serving the clients' requests. When the data transfer is done to the temporary location, all the data at that location will be moved to stable location. Reading and writing the data files at the stable locations will be synchronized to keep the data files consistent. This cycle will be repeating.

Requests from clients contain some constrants. So, querying the pre-fetched data is handled at the Aggregator server side at which pre-fetching initialized. Query is basically done by using Pull Parsing techniques and XPATH queries over pre-fetched data.


2. Coding Level
-how to define the task and periodicity of it to fetch the data?

It should not be less than the time to transfer one set of layer-set 2 data.
So from our experience for the data listed above, it should be more than 2 hours.Since these data sets are not updated often we plan to set the periodicity once avery day.

what technology do you use to pre-fetch the data?
Timer
TaskTimer Java class libraries.


//TimerTask definition:
TimerTask task = new TimerTask() {
...... public void run() {
............ Vector CDMdataList = new Vector();
............ CDMdataList = getListOfGMLDataInCapability();

............ String tempDatastore = applpath + "/prefetchedData";
............ String usedDatastore = applpath + "/prefetchedDataUsed";

............ //Fetching all the data in CDM format (GML) - with NB
............ fd.FetchDataWithStreamingNBip,NBport,NBtopic,
........................................wfs_address,tempDatastore,CDMdataList);

............ //after pre-fetching is done move the data to stable storage
............ fd.moveData(tempDatastore, usedDatastore);
...... }
}

//Running the Task defined above:
Timer timer = new Timer();
timer.schedule(task, 0, 40000);
//sets timer to run once every three days

Timer class schedules the specified task for repeated fixed-delay execution, beginning after the specified delay. Subsequent executions take place at approximately regular intervals separated by the specified period



3. Possible Challenges:

One minor challenge with storing the data in file system is that some systems allow a limitid size for user created files such as:

LINUX(RedHat)...........................SOLARIS
2GB{512B block size}................... 1 TB
8192GB{8KB block size} ............... 2 GB {=<2.5.1}

Even if operating system does not have limited file size constraint, then file size will be contrant by the hard dizk size.

Another challenge is synchronization of the two storages (temporary for fetching the data and stable for answering the client requests while fetching is happening). This challenge is easier to solve.

No comments: