Welcome to Scribd!

Memex GeoParser 2016

Uploaded by

0% found this document useful (0 votes)

65 views9 pages

GeoParser is an open source project that extracts geospatial information from documents and visualizes it on a map. It uses Apache Tika to extract text and metadata from files, Apache OpenNLP for natural language processing, a Lucene gazetteer for geocoding locations, and Apache Solr for indexing and searching over millions of documents. The project aims to advance online search capabilities for DARPA's Memex program beyond the current state-of-the-art.

Original Description:

memex as memory index

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

65 views9 pages

Memex GeoParser 2016

Uploaded by

Ahyar Ajah

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 9

Search inside document

Memex GeoParser

Mazi Boustani, Madhav Sharan, Chris Mattmann, Lauren Wong

NASA/JPL - USC

About Memex

Memex is a 3 years DARPA funded project which seeks to develop software

that advances online search capabilities far beyond the current state of the art.

Three technical areas - data gathering, tools development and technology

deployment in the field.

Revolutionize discovery, organization and presentation of search results.

http://memex.jpl.nasa.gov/

What is GeoParser

One of Memex sub projects, it is open source

Extract geospatial information from any type of file as well as indexed data

Visualize extracted information on map

Search capabilities over textual data

Example: (http://www.marriott.com/hotels/travel/rdumc-raleigh-marriott-city-center)
1.
Madrid
2.
Taiwan
3.
NorthCarolina
4.
Raleigh
5.
HongKongSpecialAdministrativeRegion

https://github.com/MBoustani/GeoParser

Technologies
-

Apache Tika
-

Khooshe
-

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most
common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity recognition, chunking,
parsing, and coreference resolution.

Apache Lucene (Geo Gazetteer)

Big GeoSpatial Data Points Visualization Tool by using vector tiles [https://github.com/MBoustani/Khooshe]

Apache OpenNLP
-

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and
PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content
analysis, translation, and much more.

A command line gazetteer built around the Geonames.org dataset, that uses the Apache Lucene library to create a searchable
gazetteer.

Apache Solr
-

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated
failover and recovery, centralized configuration and more.

GeoParser flow

Apache Tika

Apache OpenNLP

Lucene Geo Gazetteer

Major Challenges we addressed

Indexing a solr core of 7 million documents.

Representing 20 million points on a map.

Speeding up the process to 700 docs per minute.

Server side clustering using Khooshe

Finding latitude longitude of a diverse geo locations.

Submitted a paper in IRI 2016

An Automatic Approach for Discovering and Geocoding Locations in Domain-Specific Web Data

Plotting Khooshe layers to OpenLayers 3.

Acknowledgements

This work was supported by the DARPA XDATA/Memex program.

NSF Polar Cyberinfrastructure award numbers PLR-1348450 and PLR144562 funded a portion of the work.

Effort supported in part by JPL, managed by the California Institute of

Technology on behalf of NASA.

References

GeoParser: https://github.com/MBoustani/GeoParser
Memex: http://memex.jpl.nasa.gov/
Apache Tika: https://tika.apache.org/
Khooshe: https://github.com/MBoustani/Khooshe
Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geogazetteer
Apache OpenNLP: https://opennlp.apache.org/
Apache Solr: http://lucene.apache.org/solr/

Web Crawler & Scraper Design and Implementation
Document9 pages
Web Crawler & Scraper Design and Implementation
kassila
100% (1)
Cloud Enabled Robots
Document28 pages
Cloud Enabled Robots
Erico Guizzo
100% (1)
FIFA Manager 13
Document51 pages
FIFA Manager 13
Kenan Kahrić
0% (2)
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
Document10 pages
The Design and Implementation of Erachnid: An Extensible, Scalable Web Crawler in Erlang
somesh22122
No ratings yet
Build Geodatabase in Oracle PDF
Document142 pages
Build Geodatabase in Oracle PDF
Bimo Fachrizal Arvianto
100% (1)
Work Experiences-Farzaneh Gerami
Document7 pages
Work Experiences-Farzaneh Gerami
Farzane Gerami
No ratings yet
Paper 1
Document8 pages
Paper 1
Rakeshconclave
No ratings yet
Resume NiteshKedia
Document1 page
Resume NiteshKedia
Nitesh Kedia
No ratings yet
Department of Informatics University of Leicester CO7201 Individual Project
Document8 pages
Department of Informatics University of Leicester CO7201 Individual Project
qwerty
No ratings yet
Resume Yan - 20090604
Document3 pages
Resume Yan - 20090604
yan qi
No ratings yet
Fninf 03 022 PDF
Document10 pages
Fninf 03 022 PDF
tonylee24
No ratings yet
49 T100 PDF
Document5 pages
49 T100 PDF
kam_anw
No ratings yet
A Comparison of Big Remote Sensing Data Processing With Hadoop MspReduce and Spark
Document4 pages
A Comparison of Big Remote Sensing Data Processing With Hadoop MspReduce and Spark
Sara sherif
No ratings yet
Resume Nilesh
Document3 pages
Resume Nilesh
Shyam Singh
No ratings yet
1 s2.0 S1570826810000612 Main
Document9 pages
1 s2.0 S1570826810000612 Main
Ahmed Amamou
No ratings yet
Summer Training AT S.R.S.A.C.: Project Name:-Super 6 Webserver
Document24 pages
Summer Training AT S.R.S.A.C.: Project Name:-Super 6 Webserver
Gaurav Goyal
No ratings yet
Research Paper
Document4 pages
Research Paper
premacute
No ratings yet
M2 Bigdata&Hadoop
Document27 pages
M2 Bigdata&Hadoop
Shreeveni
No ratings yet
Hidden Web Crawler Research Paper
Document5 pages
Hidden Web Crawler Research Paper
afnkcjxisddxil
100% (1)
p181 - Mars Exploration Rover
Document13 pages
p181 - Mars Exploration Rover
jnanesh582
No ratings yet
Google Earth Engine Planetary-Scale Geospatial Ana
Document10 pages
Google Earth Engine Planetary-Scale Geospatial Ana
jhunior mayco tan vásquez
No ratings yet
Web-Based Mineral Information System of Thailand U
Document9 pages
Web-Based Mineral Information System of Thailand U
Cường Nguyễn
No ratings yet
Compare Data Mining Tools
Document11 pages
Compare Data Mining Tools
npnbkck
No ratings yet
SPIDER ASystemforScalableParallelDistributedEvaluationoflarge ScaleRDFData
Document5 pages
SPIDER ASystemforScalableParallelDistributedEvaluationoflarge ScaleRDFData
Kishore Kumar RaviChandran
No ratings yet
Effective Query Processing For Web-Scale RDF Data Using Hadoop Components
Document7 pages
Effective Query Processing For Web-Scale RDF Data Using Hadoop Components
Lakshmi Srinivasulu
100% (1)
Hyperjournal, PHP Scripting and Semantic Web Technologies For The Open Access
Document6 pages
Hyperjournal, PHP Scripting and Semantic Web Technologies For The Open Access
Ganang Ka
No ratings yet
Hadoop GIS A High Performance Spatial Data
Document12 pages
Hadoop GIS A High Performance Spatial Data
Hendra_aj
No ratings yet
Semantic Search Engine
Document12 pages
Semantic Search Engine
Sajid Sheikh
No ratings yet
Semantic Crawling: An Approach Based On Named Entity Recognition
Document5 pages
Semantic Crawling: An Approach Based On Named Entity Recognition
Kanishka Silva
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Document5 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Saif Chogle
No ratings yet
Search Engine Case Study: Searching The Web Using Genetic Programming and MPI
Document19 pages
Search Engine Case Study: Searching The Web Using Genetic Programming and MPI
Satyam Saurabh
No ratings yet
Beginning Database Design
Document2 pages
Beginning Database Design
I Made Putrama
No ratings yet
Artigo PingER SLAC
Document7 pages
Artigo PingER SLAC
Leandro Justino
No ratings yet
Apache Hadoop Next Generation Compute Platform: © Hortonworks Inc. 2013
Document22 pages
Apache Hadoop Next Generation Compute Platform: © Hortonworks Inc. 2013
DSunte Wilson
No ratings yet
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
Document6 pages
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
ashalizajohn
No ratings yet
Sesi 2 Manajemen Big Data
Document46 pages
Sesi 2 Manajemen Big Data
alpaomega
No ratings yet
Efficient Job Execution For Map Reduce Using Phase-Level Scheduling Algorithm
Document5 pages
Efficient Job Execution For Map Reduce Using Phase-Level Scheduling Algorithm
Editor IJAERD
No ratings yet
Low Latency Apache Spark
Document5 pages
Low Latency Apache Spark
Andres Benavides
No ratings yet
Ashu Singh's Resume
Document2 pages
Ashu Singh's Resume
asamra1986
No ratings yet
DMBI Presentations Unit-8
Document28 pages
DMBI Presentations Unit-8
Nayan Patel
No ratings yet
TMP - 11927-Information Retrieval From Big Data For Sensor Data Collection-520372139741037689682152
Document3 pages
TMP - 11927-Information Retrieval From Big Data For Sensor Data Collection-520372139741037689682152
Boy Nata
No ratings yet
Grafa: Scalable Faceted Browsing For RDF Graphs
Document16 pages
Grafa: Scalable Faceted Browsing For RDF Graphs
Sherman Monroe
No ratings yet
Spatial Big Data: Joe Niemi
Document47 pages
Spatial Big Data: Joe Niemi
Syahru Ramdhoni
No ratings yet
Fregrapad: Frequent RDF Graph Patterns Detection For Semantic Data Streams
Document9 pages
Fregrapad: Frequent RDF Graph Patterns Detection For Semantic Data Streams
Nguyễn Nga Pisces
No ratings yet
EXP2
Document3 pages
EXP2
Visual Ads
No ratings yet
BY:-Farooq Ahmad Dept. of Computer Science
Document6 pages
BY:-Farooq Ahmad Dept. of Computer Science
Farooqkashmir Bhat
No ratings yet
Apache Hadoop Next Generation Compute Platform: Bikas Saha @bikassaha
Document22 pages
Apache Hadoop Next Generation Compute Platform: Bikas Saha @bikassaha
Rohit
No ratings yet
Project Muse 206445
Document11 pages
Project Muse 206445
tuhinscribd007
No ratings yet
Synopsis "Time Series Geospatial Big Data Analysis Using Array Database"
Document5 pages
Synopsis "Time Series Geospatial Big Data Analysis Using Array Database"
saurabh
No ratings yet
NASA: InteroperabilityPrototype
Document9 pages
NASA: InteroperabilityPrototype
NASAdocuments
No ratings yet
Architecture of A Grid-Enabled Web Search Engine
Document15 pages
Architecture of A Grid-Enabled Web Search Engine
Bama Raja Segaran
No ratings yet
Data Engineering - PPP Rank
Document15 pages
Data Engineering - PPP Rank
saisuchandan
No ratings yet
SPREE: Object Prefetching For Mobile Computers
Document18 pages
SPREE: Object Prefetching For Mobile Computers
mohitsaxena019
No ratings yet
Compusoft, 2 (6), 171-175 PDF
Document5 pages
Compusoft, 2 (6), 171-175 PDF
Ijact Editor
No ratings yet
Apache Hadoop YARN - Enabling Next Generation Data Applications
Document64 pages
Apache Hadoop YARN - Enabling Next Generation Data Applications
Engin Sözer
No ratings yet
A Scalable Geospatial Web Service For Near Real-Time, High-Resolution Land Cover Mapping
Document10 pages
A Scalable Geospatial Web Service For Near Real-Time, High-Resolution Land Cover Mapping
Phani N
No ratings yet
Unit-3 Lecture 31
Document12 pages
Unit-3 Lecture 31
Harsh Babu
No ratings yet
Fidoop
Document5 pages
Fidoop
JournalNX - a Multidisciplinary Peer Reviewed Journal
No ratings yet
File000000 1302989981
Document20 pages
File000000 1302989981
黄伟
No ratings yet
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
A Librarian's Guide to Graphs, Data and the Semantic Web
From Everand
A Librarian's Guide to Graphs, Data and the Semantic Web
James Powell
No ratings yet
Mail Server Zimbra
Document19 pages
Mail Server Zimbra
Ahyar Ajah
No ratings yet
Microsoft Word - CDCP
Document3 pages
Microsoft Word - CDCP
Ahyar Ajah
No ratings yet
OpenBSD FAQ
Document245 pages
OpenBSD FAQ
Ahyar Ajah
No ratings yet
Owncloud Developer Manual: Release 8.1
Document178 pages
Owncloud Developer Manual: Release 8.1
Ahyar Ajah
No ratings yet
Rituals Taken From Filipino Sorcerer - For Success To Your Plans and Personal Undertakings and For Those Who Has Hatre - Wattpad
Document3 pages
Rituals Taken From Filipino Sorcerer - For Success To Your Plans and Personal Undertakings and For Those Who Has Hatre - Wattpad
Gene Ganutan
50% (2)
Quiz 2. Lesson 3
Document3 pages
Quiz 2. Lesson 3
Louis Gabriel Adaya
No ratings yet
13th Sunday in Ordinary Time
Document3 pages
13th Sunday in Ordinary Time
Henry Kaweesa
No ratings yet
5% Flag
Document3 pages
5% Flag
Michael Muhammad
100% (2)
Politik Identitas
Document10 pages
Politik Identitas
popochan popochan1
No ratings yet
Unit 6 Spec
Document10 pages
Unit 6 Spec
ayoemmanuel36
No ratings yet
A Corpus Based Comparative Ideational Meta Functional Analysis of Pakistani English and UK English Newspaper Editorials On COVID 19
Document19 pages
A Corpus Based Comparative Ideational Meta Functional Analysis of Pakistani English and UK English Newspaper Editorials On COVID 19
Awais Malik
No ratings yet
It's Modal Madness
Document1 page
It's Modal Madness
Xander Thrumble
No ratings yet
Platonic Mysticism Contemplative Science, Philosophy, Literature, and Art (Arthur Versluis (Versluis, Arthur) )
Document174 pages
Platonic Mysticism Contemplative Science, Philosophy, Literature, and Art (Arthur Versluis (Versluis, Arthur) )
Juliano Queiroz Aliberti
100% (2)
Lady of Shalott
Document16 pages
Lady of Shalott
Kevin Lumowa
0% (1)
Webcenter Portal Installation PDF
Document72 pages
Webcenter Portal Installation PDF
ahmed_sft
No ratings yet
B. Wordsworth Text PDF
Document4 pages
B. Wordsworth Text PDF
Alejandro Caceres
No ratings yet
Lecture 5
Document5 pages
Lecture 5
Екатерина Митюкова
No ratings yet
Cool Kids 1 SB Sample
Document12 pages
Cool Kids 1 SB Sample
Barbara Arenas Gutierrez
No ratings yet
Pengembangan Bahan Ajar Berbasis Modul Interaktif Bagi Pemelajar Bipa Tingkat A1
Document18 pages
Pengembangan Bahan Ajar Berbasis Modul Interaktif Bagi Pemelajar Bipa Tingkat A1
Zakia Salma
No ratings yet
List of Recognized Boards/Councils-2022: SL No Board / Council
Document2 pages
List of Recognized Boards/Councils-2022: SL No Board / Council
Parvez
No ratings yet
Browser's XSS Filter Bypass Cheat Sheet Masatokinugawa - Filterbypass Wiki GitHub
Document20 pages
Browser's XSS Filter Bypass Cheat Sheet Masatokinugawa - Filterbypass Wiki GitHub
Abhisar Kam
No ratings yet
Allin All Book 2
Document82 pages
Allin All Book 2
Lyon Woo
No ratings yet
Varga Charts
Document39 pages
Varga Charts
alnewman
100% (2)
Tet Holiday Grade 5
Document7 pages
Tet Holiday Grade 5
Dương Tú Quyên
No ratings yet
Tulukkar
Document1 page
Tulukkar
Yams Sp
No ratings yet
Chapter1 - Database Concepts
Document23 pages
Chapter1 - Database Concepts
tinishdharan
100% (1)
Rachana Devara: @bvrit - Ac.in
Document2 pages
Rachana Devara: @bvrit - Ac.in
Devara Rachana
No ratings yet
Making Sense of Words
Document10 pages
Making Sense of Words
gnobe
No ratings yet
BIOS Level Programming
Document39 pages
BIOS Level Programming
Mehmet Demir
No ratings yet
Chapter 3: Fixed Asset Reclassifications
Document6 pages
Chapter 3: Fixed Asset Reclassifications
Alice Wairimu
No ratings yet
Ramadan Dua List
Document12 pages
Ramadan Dua List
test
No ratings yet
LotR TCG - 3 - Realms of The Elf Lords Rulebook
Document22 pages
LotR TCG - 3 - Realms of The Elf Lords Rulebook
mrtibbles
100% (1)