Pollutantmean Assignment Help

This question already has an answer here:

This is the first time I'm trying to import multiple CSV files in R and to solve that part of the assignment, using some of the csv files to calculate the mean of sulfate and nitrate. I searched for answers here in stackoverflow and other sites but I wasn't able to fix that issue based on what is told in questions here about the topic. I'm also new in R Programming.

If its useful: R version is 3.2.1 Mac OS X version 10.7.5

I have an assignment in Coursera where I have 332 CSV files that I have to calculate the mean of pollutants.

Link to download the file: https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip

Assignment Part 1:

Write a function named 'pollutantmean' that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function 'pollutantmean' takes three arguments: 'directory', 'pollutant', and 'id'. Given a vector monitor ID numbers, 'pollutantmean' reads that monitors' particulate matter data from the directory specified in the 'directory' argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

Prototype of the function:

My outcome should be that:

I already created my working directory and this is were I wasn't able to go further.

Whenever I try to do use F1 <-read.csv("name of the file", header=TRUE) the error that appears is Error in file(file, "rt") : not possible to open a connection In addition: Warning message: In file(file, "rt") : not possible to open the file 'nameoffile.csv': No such file or directory When I use the command read.table(filechoose(), header=TRUE) works for all the files except for the first file (001.csv) which says Error in scan (file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 don't have 7 elements When I try sapply(filelist, read.csv) appears the same error. When I use read.csv, sapply or lapply for the "specdata" the error is Error in read.table(file = file, header = header, sep = sep, quote = quote,: no lines available in input although I have all the 332.csv files in the "specdata" file.

I hope I posted everything needed for a reproducible exercise. If there is anything more needed, just let me know.

Thanks !

rcsvmultiple-files

I’m currently going through the John Hopkins Data Science  specialization. So far they are okay. These courses are pretty tough so if you are complete beginner you can complement these courses with Data Camp course if you need more practice.The only annoying part about this class is that they do not mention some of the functions you will need to complete the assignments during lectures.  Luckily there are TA hints to supplement this lack of information.

I am a researcher. When I run into problems I would go online and try to ask the right question to help me solve the problem that’s in front of me. Or I would talk to different people who have expert knowledge on the subject matter or I will just find the answer in a book at my local library. But being a programmer or data scientist involves breaking down problems. The toughest part for me is being comfortable with problem solving.

Background about the data

Air Pollution: look at the report that you did about it.

First thing I do whenever I have data is to explore the file in excel in order to get a better understand of its structure. Each file in the specdata folder contains data for one monitor.

  • Date: the date of the observation
  • sulfate: the level of sulfate particle matter in the air on that date
  • nitrate: the level of nitrate particle matter in the air on that date

The pollutantmean function the prompt states:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

At this point I don’t really know what I’m doing. It’s just me staring at the computer for an hour…

With every coding problem I try to think about things on a higher level. If I have an understanding of the problem then I can muddle my way through the syntax of the code.

What are they asking for here?

There just want the mean of the pollutants by the IDs.

So the higher level stuff is done next we break the problem into steps.

  1. read the files all the data files
  2. merge the data files in one data frame
  3. ignore the NAs
  4. subset the data frame by pollutant
  5. calculate the mean.

Lets find out what the mean function requires.

?mean Usage mean(x, ...) ## Default S3 method: mean(x, trim = 0, na.rm = FALSE, ...) Arguments x An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only. trim the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. na.rm a logical value indicating whether NA values should be stripped before the computation proceeds.

According to the documentation the mean can only take one R object. But there is a problem.

 

The information for each monitor is in an individual csv file. The monitor id is the same as the csv files names. In order to find the mean of pollutants of monitors by different ids we will have to put each monitor’s information into one data frame.

 

At this point  I do not know how to walk  rough a directory of files. I know in python there is os.walk . In R there is the list.files function .

In order to perform the same actions to multiple files I looped through the list.

pollutantmean<-function(directory,pollutant,id=1:332){ #create a list of files filesD<-list.files(directory,full.names = TRUE) #create an empty data frame dat <- data.frame() #loop through the list of files until id is found for(i in id){ #read in the file temp<- read.csv(filesD[i],header=TRUE) #add files to the main data frame dat<-rbind(dat,temp) } #find the mean of the pollutant, make sure you remove NA values return(mean(dat[,pollutant],na.rm = TRUE)) }

 

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

This question was easier than both parts 1 and parts 2. The question is simple to understand. They just want a report that displays the number of completed cases in each data file. This question reminds me of nobs function  that is used in another statistical software called SAS its used in many businesses. Unlike R which seems to be primarily used in academia and research.

Steps

  1. read in the files
  2. remove the NAs from the set
  3. count the number of rows
  4. create a new data set  has two columns that contains the monitors id number and the number of observations
complete <- function(directory,id=1:332){ #create a list of files filesD<-list.files(directory,full.names = TRUE) #create an empty data frame dat <- data.frame() for(i in id){ #read in the file temp<- read.csv(filesD[i],header=TRUE) #delete rows that do not have complete cases temp<-na.omit(temp) #count all of the rows with complete cases tNobs<-nrow(temp) #enumerate the complete cases by index dat<-rbind(dat,data.frame(i,tNobs)) } return(dat) }

 

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

Part 3 was a dosey. Its similar to part 1 for the fact that we are basically aggregating data. But this time instead of a data frame we are aggregating data inside a vector.

  1. read in the files
  2. remove the NAs from the set
  3. check to see if the number of complete cases are > then threshold
  4. find  the correlation between different types of pollutants.

In order to combine different data sets  we used  rbind (combine rows). To combine different  vectors we use cbind(combine columns).

corr<-function(directory,threshold=0){ #create list of file names filesD<-list.files(directory,full.names = TRUE) #create empty vector dat <- vector(mode = "numeric", length = 0) for(i in 1:length(filesD)){ #read in file temp<- read.csv(filesD[i],header=TRUE) #delete NAs temp<-temp[complete.cases(temp),] #count the number of observations csum<-nrow(temp) #if the number of rows is greater than the threshold if(csum>threshold){ #for that file you find the correlation between nitrate and sulfate #combine each correlation for each file in vector format using the concatenate function #since this is not a data frame we cannot use rbind or cbind dat<-c(dat,cor(temp$nitrate,temp$sulfate)) } } return(dat) }

In retrospect this assignment is very useful. In most of the data science courses throughout this specialization you are going to be selecting, aggregating, and doing basic statistics with data.

Like this:

LikeLoading...

Related

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *