r,group-by,dataframes,dplyr,outliers

Here's a method using base R: element <- sample(letters[1:5], 1e4, replace=T) value <- rnorm(1e4) df <- data.frame(element, value) means.without.ols <- tapply(value, element, function(x) { mean(x[!(abs(x - median(x)) > 2*sd(x))]) }) And using dplyr df1 = df %>% group_by(element) %>% filter(!(abs(value - median(value)) > 2*sd(value))) %>% summarise_each(funs(mean), value) Comparison of results:...

No, it cannot identify outliers. It is just a preprocessing method, which brings the outliers towards the majority of the data. Excerpt from the Applied Predictive Modeling book: If a model is considered to be sensitive to outliers, one data transformation that can minimize the problem is the spatial sign...

Here's a replacement for the outliers part. It's about 5x faster for your sample data on my computer. >>> pd.DataFrame( np.where( np.abs(df) > df.mean(), 2, df ), columns=df.columns ) a b 0 NaN 2 1 2 3 2 3 -4 3 4 5 4 5 6 5 0 2 6...

python-2.7,cluster-analysis,hierarchical-clustering,outliers,dbscan

If finding the appropriate value of epsilon is a major problem, the real problem may be long before that: you may be using the wrong distance measure all the way, or you may have a preprocessing problem. Your code looks a lot like a naive preprocessing approach - and that...

Looking inside the mvOutlier function, it looks like it doesn't save the chi-squared values. Right now your text code is treating xcoord as a y-value, and assumes that the actual x value is 1:2. Thankfully the chi-squared value is a fairly simple calculation, as it is rank-based in this case....

Set the weights of those points to zero, then update the model: w <- abs(rstudent(lm1)) < 3 & abs(cooks.distance(lm1)) < 4/nrow(lm1$model) lm2 <- update(lm1, weights=as.numeric(w)) This is probably a weak approach statistically, but at least the code isn't too hard......

r,data,statistics,analytics,outliers

I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it...

I think this may work. The dropout function will do iterative looping to test for outliers. For each element you pass in, it will return 1 if the element is not an outliers, otherwise it will return the p-value < .05 for the outlier test. library(outliers) dropout<-function(x) { if(length(x)<2) return...

matlab,plot,statistics,outliers

This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable. So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast...

You can use the by function in order to group the dataframe in smaller subsets and subsequently perform function calls on the individual subgroups. During these function calls you can easily remove the outliers from each of the subsets and return the results. Next, you can obtain the resulting dataframe...

algorithm,statistics,median,standard-deviation,outliers

Based on the description you have provided, the problem can be split into 2: Finding and excluding Statistical Outliers from the data set Sorting the resulting values in descending (or just in any) order The general solution to the first problem and example using Microsoft Excel is described at :...

How about this code: set.seed(1) mat <- matrix(rnorm(100), ncol=10) temp <- abs(apply(mat, 1, scale)) mat[temp > 2] ### [1] 1.9803999 0.2670988 -1.2765922 I took 2 standard deviations for your Z limit. First i create a random matrix. Then i then scale it row by row (the '1' argument of the...

Here's a try. Turn the outliers data frame into a named vector: out <- outliers$outlier names(out) <- outliers$subject Then use it as a lookup table to select all the rows of data where the RT column is less than the outlier value for the subject: data[data$RT < out[as.character(data$subject)], ] The...

r,performance,optimization,bigdata,outliers

There are a few ways of optimizing the function, but as your question stands, the operation isn't that slow. Anyway, without resorting to data.table, dplyr, or parallel programming, we can still get a modest speed increase by simply rewriting your function to replace_outliers2 = function(x, na.rm = TRUE, ...) {...

So just write a function that directly computes the quantile, then directly applies clipping to each column. The <- conditional assignment inside your lapply call is bogus; you want ifelse to return a vectorized expression for the entire column, already. ifelse is your friend, for vectorization. # Make up some...

You could try: df %>% filter(lag(value) != lead(value) | (value - lag(value)) %in% c(0, NA)) You might also be interested in the lag and lead functions from dplyr. Edit: thanks @Frank for a couple modifications...

You could use stat_summary to customize the appearance, e.g. ggplot(dfmelt, aes(x=bin, y=value, fill=variable)) + stat_summary(geom = "boxplot", fun.data = function(x) setNames(quantile(x, c(0.05, 0.25, 0.5, 0.75, 0.95)), c("ymin", "lower", "middle", "upper", "ymax")), position = "dodge") ...

r,function,outliers,memorization

You just need to update temp for any indices that aren't outliers: test <- function(x) { temp <- x[1] st1 <- numeric(length(x)) for (i in 2:(length(x))){ if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) { st1[i] <- 1 } else { temp <- x[i] } } return(st1) } myts[,2] <- apply(as.data.frame(myts[,1]),2,test)...

Update: missing values in the input data broke the histogram visualizer. This has been fixed, and will work in the next release (missing values will simply be ignored though - there won't be a separate histogram bar to indicate the number of missing values. Contributions are welcome!) The class label...

sparse-matrix,sparse,outliers,elki

ArrayAdapterDatabaseConnection is designed for dense data. For sparse data, it does not make much sense to first encode it into a dense array, then re-encode it into sparse vectors. Consider reading the data as sparse vectors directly to avoid overhead. The error you are seeing has a different reason, though:...

Not that I would suggest you do this, but you can change the statistical summary used to draw the boxplot, and replace any of the stats with your own statistics. For example, to do as you asked and draw the upper bound of the box at the 0.8 quantile of...

python-2.7,scipy,curve-fitting,outliers,best-fit-curve

You are most probably speaking about recursive regression (which is quite easy in Matlab). For python, try and use the scipy.optimize.curve_fit. For a simple 3 degree polynomial fit, this would work based on numpy.polyfit and poly1d. import numpy as np import matplotlib.pyplot as plt points = np.array([(1, 1), (2, 4),...

To remove the outliers, you must set the option outline to FALSE. Let's assume that your data are the following: data <- data.frame(a = c(seq(0,1,0.1),3)) Then, you use the boxplot function: res <- boxplot(data, outline=FALSE) In the res object, you have several pieces of information about your data. Among these,...

You're better off using the outlier function directly, to successively remove outliers: replaceoutliers <- function(x, threshold) { t(apply(data, 1, function(row) { exclude <- rep(FALSE, length(row)) repeat { outliers <- outlier(row[!exclude], logical=TRUE) exclude[!exclude] <- outliers if (sd(row[!exclude]) < threshold) break } row[exclude] <- mean(row) row })) } Here, outliers are successively...

The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text. If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler....