You are on page 1of 6

ISBN: 978 - 15084725 - 51

Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

CONSTRUCTING VOLUNTEERED GAZETTEERS USING GEO-DATA


Krithika R, Mansoor Hussain D
Sri Krishna College of Engineering and Technology
Coimbatore, India.

Abstract Gazetteers is the one which is used to provide the information about the geographical make up, social statistics and
physical features of a country, region, or continent. Traditional gazetteers are built by some mapping agencies. However it may not
contain the entire information about the geo graphical related information. In the exiting work, a scalable distributed platform and is
used to build the gazetteers which will produce an efficient gazetteers. And also it used a novel spatial-computing infrastructure to
prove the quality of the built gazetteers. The semi structured information about the space of geo graphical is handled on hadoop eco
system effectively by using the spatial analysis function. The spatial analysis function used in this work will support the different GIS
formats in the hadoop eco system. However it cannot support the co-ordinates of the geographical location. In the proposed work, the
geometric transformations are also used for spatial analysis which can assign the ground co-ordinates to the data or map.
Keywords- Gazetteers volunteered geographic information,hadoop,scalable geoprocessing workflow, big geo-data,cyberGIS

I. INTRODUCTION
BIG DATA
It is a collection of vulminous data. Big data is an all-encompassing term for any collection of dataset so large and complex that it
becomes difficult to process using traditional data processing applications. The challenges include analysis, capture, curation,
search, sharing, storage, transfer, visualization, and privacy violations. Big data is difficult to work with using most relational
database management system and desktop statistics and visualization packages, requiring instead "massively parallel software
running on tens, hundreds, or even thousands of servers. Big Data is a moving target; what is considered to be "Big" today will
not be so years ahead. "For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to
reconsider data management options.
Characteristics
Big data can be described by the following characteristics:
Volume The quantity of data that is generated is very important in this context. It is the size of the data which determines the
value and potential of the data under consideration and whether it can actually be considered as Big Data or not. The name Big
Data itself contains a term which is related to size and hence the characteristic.
Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also a very
essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are
associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data.
Velocity - The term velocity in the context refers to the speed of generation of data or how fast the data is generated and
processed to meet the demands and the challenges which lie ahead in the path of growth and development.
Variability - This is a factor which can be a problem for those who analyze the data. This refers to the inconsistency which can
be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source
data.
Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple
sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be
conveyed by these data. This situation, is therefore, termed as the complexity of Big Data.
Tool-Hadoop

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

78

ISBN: 978 - 15084725 - 51


Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

Hadoop is an open source software framework for distributed storage and distributed processing of big data on clusters
of commodity hardware. Hadoop distributed files systems(HDFS) splits files into large blocks (default 64MB or 128MB) and
distributes the blocks amongst the nodes in the cluster. For processing the data, the Hadoop map/reduce ships code
(specifically jar files) to the nodes that have the required data, and the nodes then process the data in parallel.
The base Apache Hadoop framework is composed of the following modules:

Hadoop Common contains libraries and utilities needed by other Hadoop modules.

Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing
very high aggregate bandwidth across the cluster.

Hadoop YARN a resource-management platform responsible for managing compute resources in clusters and using
them for scheduling of users applications.

Hadoop MapReduce a programming model for large scale data processing.


Architecture

Fig 1 Map Reduce Architecture


1.2 Gazetteers
A gazetteer is a geographical dictionary or directory used in conjunction with a map or atlas. They typically contain information
concerning the geographical makeup, social statistics and physical features of a country, region, or continent. Content of a
gazetteer can include a subject's location, dimensions of peaks and waterways, population, GDP and literacy rate. This
information is generally divided into topics with entries listed in alphabetical order.
Gazetteers are needed to make sense of this maze of complex geographies for three main reasons. Firstly many geographic names
have a number of variant forms; secondly there are many incompatibilities between different geographies which means that
boundaries do not align; and thirdly geographic names, units and hierarchies have changed in the past, and will continue to
change. These problems are greatest with historical data, which are often associated with geographic names that have changed, or
with geographical units that no longer exist, or with geographical units whose boundaries have changed significantly. It hardly
needs saying that the disparity between modern and historical geographies increases with time.
Gazetteers improve information retrieval and data browsing by standardizing geographic names and providing a controlled
vocabulary of current and historical names within a system of preferred and non-preferred names. By linking disparate and
changing geographies, gazetteers can help to integrate geographically referenced data collections, and deal with some of the

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

79

ISBN: 978 - 15084725 - 51


Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

incompatibilities when boundaries do not align. For example, gazetteers can make it easier to construct time series and other
comparative data series by helping to identify those geographical units which, to a greater or lesser extent, correspond in different
geographies.
If gazetteers are to be used to improve information retrieval and data browsing, it is essential that we understand the needs and
requirements of users. The History Data Service has an active and ongoing policy of consulting with actual and potential users,
and we have established that many users from the historical community require web-based catalogues and data delivery systems
which will allow them to perform sophisticated geographical searches in an fairly automated manner. Users would like to be able
to search for data that cover a given place at a sufficient level of detail.
For example, a user searching for the county of Essex would like to recover not only data that are indexed by the geographic name
Essex, but also data collections that contain Essex county-level data but which are indexed by a higher level geographical unit
such as England. They might also wish to extend the search to include data that are indexed by geographical units within Essex.
Users would also like to be able to search for any data that can be analysed at the level of a specified geographical unit. It is selfevident that a reasonably complex gazetteer, which holds information about geographical units and hierarchies, would be required
if these types of geographical searches were to be supported.

Fig 2 Architecture
Proposed Work
In the proposed work, new spatial analysis function is explored that can be executed on Hadoop, to improve the quality of
gazetteer. Geometric transformations are used to assign ground coordinates to a map or data layer within the GIS or to adjust one
data layer so it can be correctly overlaid on another of the same area. The procedure used to accomplish this correction is termed
registration.
Two approaches are used in registration: the adjustment of absolute positions and the adjustment of relative position. Relative
Position refers to the location of features in relation to a geographic coordinate system. Rubber sheeting (registration by Relative
Position) is the procedure using "slave" and "master" mathematical transformations to adjust coverage features in a non-uniform
manner. Links representing from- and to-locations are used to define the adjustment. It needs easily identifiable, accurate, well
distributed control points. Absolute Position is the location in relation to the ground. This registration is done by individual layers.
The advantage is that it does not propagate errors.
Module description

Network Setup
Mapping data to hadoop cluster
Spatial Analysis
o Format transformation
o Geometric transformation
Geo-processing workflow
Performance Evaluation

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

80

ISBN: 978 - 15084725 - 51


Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

Network Setup:
In this module, the hadoop setup is done. The location datas will begathered from the various environment with their description.
The hadoop environment to process those big data will be setup in this module. The hadoop cannot support the geo spatial datas
directly. Hence the uploading and storing an geo spatial datas will done in different way. The hadoop setup is done based on
supporting the geo spatial datas which contains the various symbols and notations within it. In this module, the conversion of geo
spatial datas not in text format into an hadoop supported format will be done. If convertion of the target polygon features from
standard ArcGIS format (shape file) into the GeoJSON format will be done in this module.
Mapping data to hadoop cluster
In this module, Transmission of the polygons GeoJSON file based on Web HDFS mechanism, which uses the standard HyperText Transfer Protocol (HTTP) to support all HDFS user operations including reading files, writing data to files, creating
directories, and so on will be done. The user needs permission to access the Hadoop Name node host server and to operate the
HDFS. The Hadoop Cluster is the corpus of all server nodes within a group (their physical locations can differ) on Hadoop. Two
Hadoop components the Hadoop distributed file system (HDFS) and the MapReduce programming model are implemented on
our platform. HDFS is a distributed storage system for reliably storing and streaming petabytes of both unstructured and
structured data on clusters. HDFS has three classes of nodes in each cluster:
Name node
Secondary name node
Data node
In this module, the map reduce programming will be applied to the datas that are entering into the network. The mapper will
divide the data into partitions and that will be mapped into the different nodes. In that nodes spatial analysis will be done. After
performing the spatial analysis on those datas, the data will be collected and redued into single data set with the help of reducer.
The mapreduce algorithm is works as follows:
Input: A place name of interest P and the textual documents containing the description about the places
Output: A list of words co-occurances counting
Mapper (String P, String fileName)
List <String> T = Tokenize (fileName);
For all the wordToken T do
For all the each row fileName do
If (both wordToken and P in row) then
Emit (String wordToken, Integer 1);
End
End
End
Reducer (String workToken, List [Integer] values)
For all the wordToken T do
Integer frequency = 0
For all the value values do
Frequency = frequency + value;
End
Emit (String workToken, Integer frequency);
End
Return <P, wordToken>, frequency

Spatial Analysis
HDFS cannot directly support the standard GIS data. That need to be converted into the format which can be adapted by the
HDFS. The information are analyzed and converted into the understandable form by using following approaches.
Format transformation
In this method, the co-ordinates of the shapes will be used to convert the spatial features into the human understandable form.
This transformation make use of vector data model to convert the spatial datas. The Vector Data Model uses points and their x,y- coordinates to construct a spatial feature such as point, line, area, region etc . A point may represent a well, a benchmark or a

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

81

ISBN: 978 - 15084725 - 51


Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

gravel pit whereas a line may represent a road, a stream or a administrative boundary . The basic vector data types comprises of
the POINTS, LINES, POLYLINES, POLYGONS, and REGIONS.
POINT DATA
The Point data type is the simplest data type. It is stored as a pair of X ,Y Coordinates . It represents the entity whose size is
negligible as compared to the scale of the MAP such as it may represent a city or a town on the map of a country, or may
represent a shopping complex on the map scaled to represent the city.
LINES AND POLYLINES:
A LINE is formed by joining two POINTS in a end to end manner. A Line is represented as a pair of POINT data types.
POLYLINES comprises of a series of lines connected in an end to end manner. In Spatial Databases, POLYLINE is represented
as a sequence of lines such that end point of a line is the starting point of next line in series. POLYLINES are used to represent
spatial features such as Roads, Rivers, Streets, Highways, Routes Or any one Dimensional spatial feature.
POLYGONS:
POLYGON is one of the most widely used spatial data type. They capture two Dimensional spatial features such as Cities, States,
Countries etc. In Spatial Databases, polygons are represented by the ordered sequence of Coordinates of its Vertices, first and the
last Coordinates being the same. In the same figure , one POLYGON is also shown. In Spatial Databases, this shall be represented
as POLYGON((6 1,7 4,6 6,5 3,6 1)). Not that first and last coordinates has to be same.
REGIONS:
Regions is another significant spatial data model. A Region is a collection of overlapping, non-overlapping or disjoint polygons.
Regions are used to represent the spatial feature such as: area burned in 1917 forest fire and area burned in 1930 fire or the State
of Hawaii, including several islands (polygons). A region can also have a void area contained within it, called a hole.
Geometric transformation
This module concentrates on the geometric shapes to construct the spatial databases. This method will work by using the raster
data model. The Raster Data model uses a Regular Grid to cover the space and the value in each grid cell to correspond to the
characteristics of a spatial phenomenon at the cell location. Conceptually, the variation of the spatial phenomenon is reflected by
the changes in the cell value. A wide variety of used in GIS are encoded in raster format. They include digital elevation data,
satellite images, scanned maps and graphics files. For Example, Grid cab be used to represent the height of every point on the
land surface above the Sea Level. This type of information which smoothly varies from point to point on spatial Surface is
difficult to model as Vector data. In this example each grid cell stores the value of height of a point it represent on the surface
from sea level.
Geo-processing workflow
The geoprocessing workflow of spatial join for Hadoop facilitates fast processing and statistics of gazetteer entries. Enabled by
this new distributed geoprocessing framework, other computationally intensive spatial analysis tasks can be substantially speeded
up, after being decomposed into sub-processes according to the MapReduce paradigm. In this moduel integration of a GIS
function Join to append the MapReduce processing results to the target features by matching the key field (e.g., the name of
each polygon) will be done. As the output of this geoprocessing workflow the aggregated features will be automatically added to
display in the ArcGIS environment.
Performance Evaluation
In this module, the performance evaluation is done based on the comparison of the existing work method with the proposed work
methodologies. The performance evaluation is done by using the performance metrics called accuracy and the user reputations.
CONCLUSION
To build a gazetteers using geographical information efficiently in a scalable distributed platform. The formation of gazetteers
using different spatial data analysis techniques which support the GIS format. Thus the map reducer simplifies the large data and

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

82

ISBN: 978 - 15084725 - 51


Proceedings of International Conference on Developments in Engineering Research

Date: 15.2.2015

forms an efficient gazetteer in a scalable distributed platform. The development of gazetteers leads to the performance evaluation
between different techniques.
References
1. Adams B., & Janowicz, K. (2012), On the geo-indicativeness of non-geo-referenced text, ICWSM 2012, pp.375378.
2. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., et al. (2013). Hadoop-GIS: A high performance spatial data warehousing system
over MapReduce,IEEE, pp.10091020.
3. Gao, S., Janowicz, K., McKenzie, G., & Li, L. (2013), Towards platial joins and buffers in place-based GIS , ACM, pp. 18.
4. Keler, C., & de Groot, R. T. A. (2013),Trust as a proxy measure for the quality of volunteered geographic information in the case of
OpenStreetMap,Springer, pp.2137.
5. Jones, C. B., Alani, H., & Tudhope, D. (2001), Geographical information retrieval with ontologies of place in Spatial information,Springer
pp.322335.
6. Li, L., & Goodchild, M. F. (2012), Constructing places from spatial footprints in Proceeding geographic information ACM, pp.1521.
7. Li, W., Yang, P., Zhou, B. (2008), Internet-based spatial information retrieval in Encyclopedia of GIS Springer,pp.596599
8. Goodchild, M. F. (2011), Formalizing place in geographic information systems Springer, pp.2135.
9. Liu, Y., Li, R., Chen, K., Yuan, Y., Huang, L., & Yu, H. (2009), KIDGS: A geographical knowledge-informed digital gazetteer service,
IEEE, pp.16
10. Uryupina, O.(2003), Semi-supervised learning of geographical gazetteers from the internet,IEEE, pp.1825
11.Kambatla, K., Pathak, A., & Pucha, H. (2009), Towards optimizi Hadoop provisioning in the cloud, in Proceeding the geographic
information,ACM, pp.15.
12. Liu, Y., Li, R., Chen, K., Yuan, Y., Huang, L., & Yu, H. (2009),geographical
IEEE,pp.16.

knowledge-informed digital gazetteer service,

13. Yao, X., & Thill, J.-C. (2006), Spatial queries with qualitative locations in spatial information systems. Computers, environment and
urban systems,
pp. 485502.

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

83

You might also like