×

Best way to go about normalizing 'spike-like curves' in data by 1cedrake in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

The first thing you need to do is collect data at different ambient temperatures. Next, I would consider first attempting a linear regression with CPU useage as an additional variable. The coefficients may be a bit messed up since you have collinearity of your explanatory variables, but the predictive ability might still be fine.

How can I correlate my data? by rhxxn in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

A linear regression would still work in this case. Just make sure to do your due diligence with ensuring the assumptions of the test are met.

What approach should I start with for b? by AndreasKralj in AskStatistics

[–]multi-mod 7 points8 points  (0 children)

There are 729 combinations, but only 1 is correct, so the answer is 1/729.

How to explain Wilcoxon and Mannwhitney test? by o-rka in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

Sure, but my point was that since it's not a test for medians, this is no more surprising than getting equal means or equal lower quartiles and a significant p-value.

It's a test for something other than medians, and when you impose conditions that make it one, you also make it a test for any other reasonable location measure (at least the ones that exist for your distribution). There's nothing special about medians when it comes to WMW -- it's no more related to the WMW than means are.

I agree, and this is more of a failure on my part in trying to convey the intention of my answer. I always try to keep the curse of knowledge in mind when answering questions, which is that over time we forget how it was like not to not have knowledge we now take for granted. This results in answers that are technically well written, but are impenetrable to those curious enough to ask the question in the first place. I try to simplify my answers to be digestible to what I think the audience could reasonably understand, but I sometimes bend too far in that direction to the detriment of the answer. Parsimony gone too far.

I definitely welcome a kick in the rear when it does happen. Scientific communication can be a bit tricky at times when you try to explain something to a person outside the field.

How to explain Wilcoxon and Mannwhitney test? by o-rka in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You are correct, my wording could have been a little clearer. What I wanted to convey was that if the shapes are different you could have identical medians but a significant p-value. This could or could not be interesting depending on what this test is being used to measure. I point this out because people tend to think these tests are panaceas for strangely distributed data, but their incorrect use could lead to erroneous conclusions. I have updated my original post to make this more clear.

How should I average these ratios? by RifRifRif in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

Are the two values proportionate? For example, for every hour increase in useful time, is the decrease in wasted time one hour lower?

How to explain Wilcoxon and Mannwhitney test? by o-rka in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You need to be somewhat careful with these tests. If the shape of the distributions being compared are similar, they are more or less tests of differences in the medians/position. On the other hand, if the shapes of the distributions being compared are different, these tests also become sensitive to difference in distribution shape. By this I mean you could have identical medians, but there be a significant p-value returned from the test. This is why it would be advisable to first visualize the data with a histogram or density plot to ensure relevance of the p-value.

In the first case from above, your null hypothesis will be that there is no significant difference in the median/position of the groups. The null hypothesis for the second case will be that shapes and/or the median/position of the distributions are the same.

*edited for clarity, see the post from /u/efrique below

Question about study design and sampling (scenario included) by shinracorp_ in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

First, it's important to define what the minimal difference in groups you would consider to be biologically relevant. This will help inform the power your study will have with the sample size given. You may find that, for example, your power is only 10%, making the study itself somewhat pointless.

Next, there should be considerations for stratified sampling of groups. By this I mean semi-randomly splitting groups to ensure equal representations for relevant sub-populations. For example, if by chance you end up assigning all women to one group and men to another, you are now contending with a potentially large gender effect on top of your treatment. If stratified correctly, effects of confounding variables such as age or gender can be appropriately controlled for. This will also help inform sample size at the same time. If you find that there are 10 relevant subpopulations and only have 15 people in each group, that means it will be difficult to control for certain confounding factors because of limited group sizes.

Three types of research designs can be used to compare differences.. by sassafrasfly76 in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

With an ordinal response variable I wouldn't necessarily start with ANOVA or a t-test. Those tests are designed primarily with continuous data in mind.

Ordinal logistic regression or a rank based test such as kruskal wallis would be better starting points.

Statistical test advice: comparing results for binary categories, with multiple dividing lines by SigmoidSquare in AskStatistics

[–]multi-mod 4 points5 points  (0 children)

For what reason are you changing the definition of your data splitting method? By constantly changing it and looking for significance, you will be inflating the chances of type I error.

Accuracy and Precision for Non-Normally Distributed Data by nizarghozali in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You'll never have complete certainty of the true value, but there are ways to quantify a range of values that could plausibly contain the true value.

You may be interested in generating a confidence interval for your precision and accuracy measurement. To simplify the meaning of a 95% confidence interval, it means that 95 out of every 100 confidence intervals generated by sampling your data will contain the true population parameter.

Since it appears your data has a somewhat complicated structure, you should consider generating a confidence interval through bootstrapping. A bootstrap sample is a sample of the same size as your original sample generated by sampling from your original data with replacement. To get a bootstrapped CI you generate 10,000+ bootstrap samples, calculate your parameter (like accuracy or precision) for all of the samples, and then get the 2.5% and 97.5% quantiles from the resulting list of parameter values.

How can I tell if a point estimate of a difference in means between two groups shows some central tendency? by MNAAAAA in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

A question such as this is best answered by bayesian statistics. Since you are answering this question with a t-test, I highly recommend the R package BEST, which simplifies a bayesian equivalent of the t-test. Importantly for you, it will give you a probabilistic distribution of your mean differences.

Here is a great stack exchange post discussing this very topic.

How many phones to win HQ? by dimitry88 in AskStatistics

[–]multi-mod 2 points3 points  (0 children)

Yea, that's right. There are 312 (531,441) possible combinations of answers, with only one of those combinations being correct.

Correct for variability in independent variable in paired t-test (or analogous) by iseekknwldg in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

perform a logistic regression with X and Y being explanatory of your A/B groups. The coefficient and p-value you get for Y will be when holding X constant.

Unsure how to approach my statistics calculations for a personal project. by Amzer97 in AskStatistics

[–]multi-mod 3 points4 points  (0 children)

The data type and structure of the data will be different for each category you would want to compare. We unfortunatly can't provide a one size fits all answer to this question because of this. Your best bet would be to come to us with exicit comparisons you would want ti do, and then tell how how the data is formatted for each category.

Comparing data from a table vs in data from our clinic? by DamnYellowKnight in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

This is fairly easy to do in R.

First, you want to make a variable with your data (there are numerous ways to do this. I'm just showing a quick example):

data <- matrix(c(150, 1000, 50, 1000), nrow=2, ncol=2)
rownames(data) <- c("clinic", "report")
colnames(data) <- c("male", "female")

The contents of your variable now look like the following.

       male female
clinic  150     50
report 1000   1000

You can now run the chi-squared test of independence on that variable with your data.

chisq.test(data)

And you get your results which intuitively matches what you would expect from just looking at the contingency table.

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 44.552, df = 1, p-value = 2.477e-11

Any statistic test to run after logistic regression? by irishrapist in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

You may want to consider survival analysis in this case. It sounds more like what you are after.

Not sure if I can run analysis on a data set... need advice by medicatedanxiety in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

Your best bet would be a mixed effect linear regression with your random effect being the student. I would code each individual test as simply being online or not.

Median vs. Average: when is each appropriate? by tacocat627 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

No problem, glad I could help.

As a side note, you said that you were unsure what a histogram is. The graph that /u/not_really_redditing provided is a histogram, which is a great way of visualizing continuous data like income.

In a random draw of 20 numbers, "16" hasn't been drawn for 30 drawings. Odds higher/the same for the next draw? by [deleted] in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

the payoff depends on how many people have those numbers (the pool is shared), and if there's lots of people who believe in the gambler's fallacy (which there are), then you actually will be better off choosing numbers that came up a lot (especially recently), even if the draw is completely fair.

This gave me a chuckle. I never considered that betting against the gambler's fallacy in a lottery where winnings are shared could slightly increase the chances for a higher payout. It's a fun little way to think about it, and I thank you for pointing it out.

Median vs. Average: when is each appropriate? by tacocat627 in AskStatistics

[–]multi-mod 2 points3 points  (0 children)

/u/not_really_redditing covered this below. Often when assessing income you have a small part of the population that makes much more than a majority of the population. Because of this the mean tends to be higher than the median, because the median tends to be less affected by these outliers. Of course without having access to the raw data this is somewhat speculative, but that would be my best bet.

In either case both statistics tell a more interesting story than either one alone could convey by itself.

If you let me speculate, I would hazard to guess that a majority of people make around the median of $73,000, but there is a right tailed skew in the data caused by a small number of people making a large amount of money that pushes the mean about $20,000 higher than what most people are making.

I'm confused on something rather 'beginner' level regarding variable / appropriate test for a research design? by [deleted] in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

Variable 1 (nationality) should probably be treated as an ordered factor (ordinal), and variable 2 (vote intention) should be treated as an unordered factor (nomimal).

If you want to predict vote intention by nationality, you may want to consider a multinomial regression.

Median vs. Average: when is each appropriate? by tacocat627 in AskStatistics

[–]multi-mod 11 points12 points  (0 children)

There isn't really a hard a fast rule, it depends highly on the structure of the data and the question you want to answer.

If you really want to be careful there is nothing stopping you from reporting all relevant summary statistics, since they often provide different parts to the story. If you want to be extra safe generate a graph such as a histogram to let people visualize the data.

Chance of average being below certain value? by pb7090 in AskStatistics

[–]multi-mod 1 point2 points  (0 children)

I have data points (~50 let's say) and want to know from that data what is the chance that an average of 10 of them chosen randomly will be below a certain number.

To get the probability by simulation, you keep sampling 10 values from your 50 and record whether their mean is below your threshold value. Do this 10,000+ times, and the proportion that are below your threshold is the approximate probability of the event occurring.

or, 10 new data points created in a similar fashion

This is a bit more difficult, and /u/efrique goes into it a bit with his post.

Confidence intervals using Likert scale responses, does this example make sense? by yavaran in AskStatistics

[–]multi-mod 0 points1 point  (0 children)

On second look it appears you are correct. Their wording in the entire paragraph was so odd that it threw me off.