Search Pubmed With R Part1

Search Pubmed with R
R Project

R is a free software environment for statistical computing, data manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4). Facilitates the inclusion of biological metadata from literature data such as PubMed. Provides access to powerful statistical and graphical methods.
References:
1- The R Project for Statistical Computing: http://www.r-project.org/ 2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29). 3-R & Bioconductor Manual. Author: Thomas Girke, UC. Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics 4- Bioconductor: http://www.bioconductor.org/
Install R
1- Install the latest release of R according to instructions provided in The R Project for Statistical Computing- http://www.r-project.org/ 2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you type the commands. 4- Any text or comment in R beginning with the hash # symbol is ignored.
References 1- The R Project for Statistical Computing: http://www.r-project.org/ 2- Bioconductor: http://www.bioconductor.org/ 3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html
Install packages in R
1- In the R Console type the following in the R command window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R") 2- request instalation of the package type: biocLite() 3- Install packages, "RISmed" , and "tm" by typing (see next slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package RISmed is to download content from NCBI databases. Package tm is for text mining functionalities Package ggplot2 is for data visualization
References 1- Bioconductor: http://www.bioconductor.org/ RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed tm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm ggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html
The R Console
Use of Packages RISmed and tm
References RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed
Query pubmed titles for oncolytic virus using RISmed

Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]" fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713 onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit, file="title_oncolytic_virus.txt")
Query pubmed MESH topic for oncolytic virus using RISmed

# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind, Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading, collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text file write(mh.list , file="mesh_oncolytic_virus.txt")
View results in excel

# export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F, sep="\t") # !!open file in excel
Column containing titles
Column containing corresponding Mesh terms
Preparing forText Mining Analysis
Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents" Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus) Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt Start the Text Mining Analysis
Text Mining Analysis

# Type the following in the R Console library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"), readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra
whitespace
my.corpus <- tm_map(my.corpus, gsub, pattern="[^[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash
"-"
# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes

numbers- optional

# Continue and type the following code in the R Console:
my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from
words to get common origin Document matrix
my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Termmat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from
matrix displaying all the terms in any of the 2 documents.
my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df

data frame for later
to keep original

#sort the most freq mesh term in the data frame my.corpus.df<- my.corpus.df[
order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]
# assign the 50 most freq mesh term to xx
xx<- my.corpus.df[1:50,]
# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2)
Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw() p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1

VIEW the 50 most frequent mesh term
End of Part 1

Search Pubmed With R Part1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Pubmed With R Part1

Uploaded by

Copyright:

Available Formats

Search Pubmed with R

Use of Packages RISmed and tm

Query pubmed titles for oncolytic virus using RISmed

Query pubmed MESH topic for oncolytic virus using RISmed

View results in excel

Column containing titles

Column containing corresponding Mesh terms

Preparing forText Mining Analysis

Text Mining Analysis

# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes

Text Mining Analysis

my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df

Text Mining Analysis

order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]

# assign the 50 most freq mesh term to xx

Text Mining Analysis

Text Mining Analysis

You might also like