Skip to main content

Initial data analysis functions

Create a function in R:

Function function() allows us to create specific functions. Between parenthesis we put the name of the parameters we want to create to be used in our function.

    Function(parameter1, parameter2, ...){
        statements
        return(object) }

NameFunction0 = function(x,y){
  return((x+y)/2)
}
NameFunction0(4,5)
## [1] 4.5
If we want use a parameter that may or may not be specified we use parameter = NULL:
NameFunction1 = function(x,y = NULL) {   
  if(!is.null(y)) x = x+y
  return(x)
}
NameFunction1(5,10) #with 2 paramers (x,y)
## [1] 15
NameFunction1(5) #with only only parameter (x)
## [1] 5


Initial Data Analysis Functions

Here we have created some basic functions that can be used for initial data analysis of our data:
INLIERS returns the values of a vector that are between the vales start and end:
INLIERS = function(x,start,end){
    x1 = x[x>start & x<end]
    x2 = sort(x1)
return(x2)}

x = c(1:100, NA, NA)
x
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100  NA  NA
INLIERS(x, 2,55)
##  [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [24] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## [47] 49 50 51 52 53 54
OUTLIERS returns the values of a vector that are outside the vales start and end:
OUTLIERS = function(x,s,e){
  n = x[is.na(x) == TRUE]
  x1 = x[x<s | x>e]
  x2 = sort(x1)
  return(c(x2,n))
}

OUTLIERS(x, 2,55)
##  [1]   1  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
## [18]  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88
## [35]  89  90  91  92  93  94  95  96  97  98  99 100  NA  NA
The following functions OUTMatrix() and INMatrix() do the same as the funcitons INLIERS() and OUTLIERS() but instead of working with a vector, here we work with a matrix.
head(iris) #here we are using the iris dataset
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
OUTMatrix = function(x,y,s,e){
   mat = x[(y<s | y>e),]
   return(mat)
   }
OM = OUTMatrix(iris, iris$Sepal.Length,5,6)
head(OM)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
INMatrix = function(x,y,s,e){
  mat = x[(y>s & y<e),]
  return(mat)
}
IM = INMatrix(iris, iris$Sepal.Length,5,6)
head(IM)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
INFOOULIERS() returns a summary info about the values that are outside the range of the values selected (s,e):
INFOOUTLIERS = function(x,s,e){
  x1 = x[x<s | x>e]
  n = x[is.na(x) == TRUE]
  x2 = sort(x1)
  x3 = unique(x2)
  n1 = unique(n)
  x4 = min(na.omit(x1))
  x5 = max(na.omit(x1))
  x6 = length(x1)
  return(cat("\n", "SORTED OUTLIERS:", x2, n, "||", "\n", "OUTLIER VALUES:", x3, n1, "||","\n", "MIN:", x4," ", "MAX:", x5,"||","\n", "QUANTITY OF OUTLIERS:", x6, "||"))
}
x = c(1:50, 1:50, NA, NA)
x
##   [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
##  [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
##  [47] 47 48 49 50  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
##  [70] 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
##  [93] 43 44 45 46 47 48 49 50 NA NA
INFOOUTLIERS(x,4,45)
## 
##  SORTED OUTLIERS: 1 1 2 2 3 3 46 46 47 47 48 48 49 49 50 50 NA NA || 
##  OUTLIER VALUES: 1 2 3 46 47 48 49 50 NA || 
##  MIN: 1   MAX: 50 || 
##  QUANTITY OF OUTLIERS: 18 ||
OUTLIERSDIF() returns summary values (min, 1stquantile, median, mean, 3rdquantil, max) taking into account all the data, and also, with the values from the range (s,e) removed, This allows us to compare how the summary values of the data have been modified when removing the values from the range (s,e).
OUTLIERSDIF = function(x,s,e){
  x = x[]
  x = x[!is.na(x)]
  y = x[x>s & x<e]
  x1 = min(x)
  x2 = quantile(x,0.25)
  x3 = median(x)
  x4 = mean(x)
  x5 = quantile(x,0.75)
  x6 = max(x)
  y1 = min(y)
  y2 = quantile(y,0.25, na.rm = TRUE)
  y3 = median(y)
  y4 = mean(y)
  y5 = quantile(y,0.75, na.rm = TRUE)
  y6 = max(y)
return(cat("\n", "ALL DATA SUMMARY:", "MIN:", x1, "-", "1stQUANT:", x2 , "-", "MEDIAN:", x3 ,"-", "MEAN:", x4 ,"-", "3rdQUANT", x5 ,"-", "MAX:", x6 , "\n", "\n", "OUTLIERS REMOVED:","MIN:", y1, "-", "1stQUANT:", y2, "-",  "MEDIAN:", y3, "-",  "MEAN:", y4, "-",  "3rdQUANT", y5, "-",  "MAX:", y6))
}

OUTLIERSDIF(x, 7,44)
## 
##  ALL DATA SUMMARY: MIN: 1 - 1stQUANT: 13 - MEDIAN: 25.5 - MEAN: 25.5 - 3rdQUANT 38 - MAX: 50 
##  
##  OUTLIERS REMOVED: MIN: 8 - 1stQUANT: 16.75 - MEDIAN: 25.5 - MEAN: 25.5 - 3rdQUANT 34.25 - MAX: 43
GRAPHOUTLIERS() return a boxplot that compares the summary values when using all the data and when using the data within the range values (s,e). For further information about boxplot values: http://dataworldblog.blogspot.com.es/2017/06/univariate-graphs-part-3-boxplot.html
GRAPHOUTLIERS = function(x,s,e){
  x = x[]
  y = x[x>s & x<e]
  boxplot(x, y, notch = TRUE, main = "GRAPHOUTLIERS", col = c("pink", "orange"), names = c("All data", "Without outliers"))
}

GRAPHOUTLIERS(x,7,44)


Popular posts from this blog

Support Vector Machines (SVM) in R (package 'kernlab')

Support Vector Machines (SVM) learning combines of both the instance-based nearest neighbor algorithm and the linear regression modeling. Support Vector Machines can be imagined as a surface that creates a boundary (hyperplane) between points of data plotted in multidimensional that represents examples and their feature values. Since it is likely that the line that leads to the greatest separation will generalize the best to the future data, SVM involves a search for the Maximum Margin Hyperplane (MMH) that creates the greatest separation between the 2 classes. If the data ara not linearly separable is used a slack variable, which creates a soft margin that allows some points to fall on the incorrect side of the margin. But, in many real-world applications, the relationship between variables are nonlinear. A key featureof the SVMs are their ability to map the problem to a higher dimension space using a process known as the Kernel trick, this involves a process of constructing ne

Initial Data Analysis (infert dataset)

Initial analysis is a very important step that should always be performed prior to analysing the data we are working with. The data we receive most of the time is messy and may contain mistakes that can lead us to wrong conclusions. Here we will use the dataset infert , that is already present in R. To get to know the data is very important to know the background and the meaning of each variable present in the dataset. Since infert is a dataset in R we can get information about the data using the following code: require(datasets) ?infert #gives us important info about the dataset inf <- infert #renamed dataset as 'inf' This gives us the following information: Format 1.Education: 0 = 0-5 years, 1 = 6-11 years, 2 = 12+ years 2.Age: Age in years of case 3.Parity: Count 4.Number of prior induced abortions: 0 = 0, 1 = 1, 2 = 2 or more 5.Case status: 1 = case 0 = control 6.Number of prior spontaneous abortions: 0 = 0, 1 = 1, 2

Ant Colony Optimization (part 2) : Graph optimization using ACO

The Travelling Salesman Problem (TSP) is one of the most famous problems in computer science for studying optimization, the objective is to find a complete route that connects all the nodes of a network, visiting them only once and returning to the starting point while minimizing the total distance of the route. The problem of the traveling agent has an important variation, and this depends on whether the distances between one node and another are symmetric or not, that is, that the distance between A and B is equal to the distance between B and A, since in practice is very unlikely to be so. The number of possible routes in a network is determined by the equation: (𝒏−𝟏)! This means that in a network of 5 nodes the number of probable routes is equal to (5-1)! = 24, and as the number of nodes increases, the number of possible routes grows factorially. In the case that the problem is symmetrical the number of possible routes is reduced to half: ( (𝒏−𝟏)! ) / 𝟐 The complexity o