Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
12
Stickied post

The last year has really taken off in terms of opportunities for me and I've been a less than ideal mod. I've had the survey results sitting in R script since last summer but never got around to finishing it.

What's the community's interest in finishing the analysis? It wasn't the greatest survey, but definitely some interesting results. I can either post the data here publicly or PM to who ever is interested .

EDIT:

Data: https://ufile.io/09qpl

Codebook: https://ufile.io/wxs7e

Tell me if those links worked. I wanted to be able to upload anonymously without linking to my google drive or dropbox.

12
23 comments
1

Machine learning is one of the most searched keyword on any search engine at this point of time. The reason is quite clear; the benefits of utilising it in any industry is beyond imagination. Machine learning is making computers learn from data to find patterns & generate business insights. In e-commerce, machine learning is even far more relevant because of digitally generated user-specific data points. Daily, we read so much about big companies using machine learning in their business decisions. With the technological advancement, machine learning is very much accessible for any small to medium enterprise. However, still thousands of companies are not capitalising the value generated from machine learning. We will briefly discuss most useful cases of machine learning in e-commerce.

With the technological advancement, machine learning is very much accessible for any small to medium enterprise

  • User churn prediction: By using customer transactional historic data and other behavioural traits, user churn probability can be predicted. Engaging a customer at right time can help reduce the churn if we know specific customers are about to churn, machine learning plays a pivotal role.
  • Recommendation engine: Up-selling & cross-selling based on machine learning basket analytics can boost revenue. Everyone know about amazon product recommendations. It has been surfaced in one of the report that 27% of Amazon revenue comes from recommendations only. The power of recommender engine can be estimated from these numbers itself.
  • Customer Life Time value v/s Customer acquisition cost : Understanding customer LTV can be very crucial for any business. Using RFM (recency, frequency & monetary), machine learning can figure out the customer LTV to make strategic decisions on acquisition channels & cost of acquisitions.
  • Customer segmentation: With statistical segmentations, users can be defined in the specific type of users to better understand of your customer base. Which type of users are more profitable, who buys more stuff. These types of answers will create a solid foundation for strategic business decisions.
  • Marketing Campaign optimisation: Every marketing campaign has its cost. To better manage marketing budget, one need to analyse which campaign doing well and why. Machine learning can work quite well in figuring this out.
  • Spatial analytics: Matching demand supply spatially & timely can be very productive in any business. Using machine learning, demand & supply can be predicted to take business actions to reduce this gap.
  • Product inventory optimisation: Another use case of machine learning is inventory management, with the demand prediction, a business can be lean enough to reduce storage & waiting for costs for various products.

The above mentioned key areas where any e-commerce firm can make better business decisions using machine learning. In addition, fraud detection, customer service, voice analytics, web page & content selection analytics, image recognition and lot more can make managers better at business decisions.

This blog has been originally published at DataToBiz official blog.

1
comment
9

General question: in a population of n, there is a p chance of a certain condition being true. I sample s at random. What are the chances that the majority of my sample represents the majority of the population as a whole?

Specific example: a country has 1,000,000 people. 600,000 of them prefer Whipplescrumptious Fudgemallow Delight, and 400,000 prefer Nutty Crunch Surprise. I call 5 random people from the phone book and ask them their favourite Wonka bar. Most of the time, the majority of my five people (three or four or five of them), will prefer Whipplescrumptious Fudgemallow Delight. In this example, n = 1,000,000, p = 0.6, s = 5.

I have made some progress on this, but am looking for a general formula for any values of n, p, and s.

In the example, I was able to calculate there is a 68.26% chance my sample will reflect the majority. We can get this by adding:

(s^0) * (p^(s-0) )  * (  (1-p)^0) )    #this is the chance 5 people in my sample will prefer Whipplescrumptious Fudgemallow Delight. This reduces to just p^s

+  (s^1)  * ( p^(*s-1)) * (  (1-p)^1 ) #this is the chance 4 will

+ (s^2) * ( p^(s-2) ) * (  (1-p)^2 ) #this is the chance 3 will

This version stops after 3 lines because 3 constitutes a majority of 5. I just need to generalise this to any s.

9
6 comments
2

I don't understand this point:

In probability theory and statistics, given a stochastic processX = ( X t ) r/https://wikimedia.org/api/rest_v1/media/math/render/svg/5647138a0187cb5f5116023e365f6c6638ea4775 , the autocovariance is a function that gives the covariance of the process with itself at pairs of time points. With the usual notation E  for the expectation operator, if the process has the mean function μ t = E [ X t ] https://wikimedia.org/api/rest_v1/media/math/render/svg/9b9a2b5c7b15a367b656f56d2f32d6ab9bfc1dda 📷, then the autocovariance is given by

https://wikimedia.org/api/rest_v1/media/math/render/svg/8b3f23aea5d9cad0778c0592fad4baaa23ad6cbe

where t and s are two time periods or moments in time.

Is it the autocovariance the measurament of HOW MUCH does an independent variable change itself from its mean in a given time period? So why wikipedia took two time measuremants? (t and s)?

Wikipedia said ''where t and s are two time periods or moments in time. '' What does this mean? Any real life example of a stochastic process, and how can I build an oscillator in R or Excell, to determine how much is the variable from its avearage? Thank you

2
2 comments
11

Main effect of subgroup 1: There was a reduction in rates of post-cesarean endometritis (=outcome) for women undergoing a cesarean (=participants) after being in labor (=subgroup 1) who received vaginal preparation (=intervention), from 11.1% in the control group to 4.7% in the vaginal preparation group (RR 0.41, 95% CI 0.19 to 0.89). No main effect in subgroup 2: There was not a clear difference in post-cesarean endometritis for women who were not in labor (subgroup 2) (RR 1.00, 95% CI 0.35 to 2.84).
However, there were no clear differences between these two subgroups as indicated by the subgroup interaction test (Test for subgroup differences: Chi² = 1.80, df = 1 (P = 0.18), I² = 44.3%).
How do I interpret these results?

11
3 comments
0

I'm working on a question, and I'm pulling out my hair because it doesn't seem to matter what I do, I cannot seem to input the values into my calculator correctly. This is the question data:

n1: 51 n2: 46

X-bar1: 3.6 x-bar2: 2.8

S1: 0.75 S2: 1.2

Formula 1; tdf: (xbar1-xbar2)/

SQRT of (S1^2/n1 + S2^2/n2)

= 3.6-2.8/SQRT of (0.75^2/51 + 1.2^2/46) = 3.8892

This is where I'm having problems:

Formula 2; df: (S1^2/n1 + S2^2/n2)^2/

(S1^2/n1)^2/(n1-1)+(S2^2/n2)^2/(n2-1)

(0.75^2/51 + 1.2^2/46)^2 = 0.0018

(0.75^2/51)^2/(51-1) + (1.2^2/46)^2/*46-1) = ... 0

It doesn't matter how I punch these numbers into my calculator (BAII, which I am required to use in my course), it always returns a result of 0. Punching the formula into my TI-84, exactly the same way returns a result of 4.6105.

I don't know what is happening. Please help :(

0
1 comment
2

I am a veterinary medical student doing a retrospective cohort study using data (more than 8000 test results) given to me by the lab I work at, which tests antibody levels in dogs. The goal of the project is to examine how antibody levels change over time and hopefully help shed light on an appropriate interval for rechecking antibody levels against these viruses. I suspect that age and time since last vaccination for the viruses will influence protection status. I am using Prism and SPSS, but have no mentor for statistics in my department.

1) I have sorted the dogs into groups based on time since vaccination (0-1 years after vaccination, 1-2 years, 2-3 years, etc.) and performed a Pearson's chi squared test to evaluate dogs that were protected versus unprotected for both viruses. My first question is: should I actually be using chi squared test for trend since the data is ordinal? I don't truly understand the difference between Pearson's chi squared test and the chi square test for trend.

2) Based on what I have done so far there was no difference in the time since vaccination groups for one virus (virus 1) but there was a relationship for the other virus (virus 2). For virus 2, there is one particular group that I strongly suspect is responsible for the significance (as when it is removed p becomes no longer significant). What is the best post hoc testing I can use to determine which groups have values that vary significantly from the expected values? Does the best post hoc testing change depending on whether I use Pearson's chi squared test and the chi square test for trend?

3) I am also using logistic regression to try to assess if age is a confounder and if there is interaction between age and time since vaccination. The results of the test tell me that for virus 1 aging is protective (which makes sense based on the pathology of this virus) and that time since vaccination is a risk factor for being unprotected. Does this contradict what my Pearson's chi square test told me (that there was no difference between expected and observed between groups) and have I made some mistake? Or is it that age and time since vaccination sort of "cancel one another out" because as time passes from vaccination (more at risk for unprotection) the dog also gets older (less at risk for unprotection) so it seems like there is no difference between groups?

4) Finally I want to move beyond protection status and assess the relationship between the quantitative values generated by the antibody tests and time since vaccination. I have plotted the same time since vaccination groups (0-1 years after vaccination, 1-2 years, 2-3 years, etc.) vs geometric mean of each group and used Spearman's correlation and there is a very clear relationship for both viruses (exponential decay). I would like to generate an equation of best fit for these data but I am not sure what form of regression to use? This is ordinal and continuous data instead of entirely continuous data, so I am feeling a bit confused.

Thank you so much to those who took the time to read this. It will hopefully help protect dogs from getting sick in the future.

2
11 comments
0

I can't put my head around the following concepts: 1. Bayesian inference, 2. Likelihood 3. Maximum likelihood

0
10 comments
0

Hi, after a linear / non linear regression is it necessary to make an analisy of residuals to know if the model were accurate? ù

Noob here, could you explaine me please, why you could do an analisy of residuals and for which purpose does it serve?

0
3 comments
4

I am a BI developer with 3 years of working experience. I have worked on ETL and Reporting. I think the next logical step for advancing in my career is Data Science. I have a fairly good understanding of business. My goal is to work in techno-business role in near future. I do not have any technical certifications and I think getting certified in Statistics will be beneficial no matter what course my career takes. I have an intermediate understanding of R and can work my through problems. Currently, planning to take the Statistics with R Specialization Statistics with R Specialization . Just want to know your views on the course syllabus and I am open to suggestions/ alternatives to these certification. Also, you can recommend any other technical certification that will be a plus for my goal.

2 points
4
4 comments
30

Well I've been job-hunting for almost a year, and still unable to gain a foothold in anything resembling a statistics career - data analyst, business analyst, statistician, data scientist, etc. I am over halfway finished towards my MS in statistics and fearful that I'm wasting my time on this degree. If I have no experience, who will hire me for what I'm really worth? I already went through that with my bachelors degree, and have been trying not to repeat it, but nothing I do is working. I'm not able to take an internship, as I would lose my health insurance (hooray USA). I live near NYC which is a major job hub, but have been rejected twice from positions in NYC (I'm not a Brooklyn-based hipster and don't have that aesthetic that they want).

30
30 comments
9

I'm trying to model the outbreak of a disease. I have data that tells me who got infected and when, and a bunch of interesting covariates (sex, household, etc).

The kind of model I'm fitting is a Reed-Frost epidemic model. The infection can be modeled at different levels -- the simplest is where you have a single parameter q describing the probability of transmission per day, for everyone in the entire community. Or you can have multiple parameters q_1, q_2, etc, for probability of transmission within different (possibly overlapping) subgroups (e.g. households, workplaces, etc).

One of the questions I'm trying to answer is how granular do we need to be to adequately describe the propagation of the disease? Do we need to account for every single subgroup, or will just a few suffice?

I read this article about Hierarchical Bayesian Modelling and it seemed to be relevant to what I'm trying to do. According to this, it seems as though I can postulate all my q's as coming from a common group distribution with some hyperparameters, which would let me evaluate the tradeoff between coarse and fine-grained modeling.

Is this correct? This is the first time I've done anything serious with Bayesian modeling and it's quite overwhelming.

9
6 comments
1

I'm trying to figure out what to do if I have access to a variable on which I can condition for one case but not the other. Here is the problem setup:

A test comes back telling me my cat has either an anxiety disorder, or an allergy. (Let's say the test has no false positive) Now, I'd like to estimate the odds of the two disorders. I know P(allergy) for cats in general. I know P(anxiety) for cats in general. So at first glance, I could just say P(anxiety|positive) = P(anxiety)/(P(allergy)+P(anxiety)).

But now, let's say I also know P(anxiety|female) and I know it's much lower than P(anxiety). It seems to me I can't just say:

P(anxiety|positive and female) = P(anxiety|female) / (P(anxiety|female) + P(allergy)).

Or can I do that?

1
2 comments
4

Not sure if this is the right subreddit sorry if it is not.

So i am on a site that has a clan ranking system. We unfortunately lost the formula when a member quit and all we have now is old rankings(a few years worth updated about twice a month) and the data that was put into them. Is it possible to use this to figure out the formula? If so how?

Any help is greatly appreciated.

Again sorry if this was posted in the wrong subreddit.

4
3 comments
2

Reading in Mintab website, I found somenthing I don't understand:

Example of the distribution of weights

The continuous normal distribution can describe the distribution of weight of adult males. For example, you can calculate the probability that a man weighs between 160 and 170 pounds.

https://support.minitab.com/en-us/minitab-express/1/distribution_plot_normal_weight_shade_middle.xml_Graph_cmd1o1.png

Distribution plot of the weight of adult males

The shaded region under the curve in this example represents the range from 160 and 170 pounds. The area of this range is 0.136; therefore, the probability that a randomly selected man weighs between 160 and 170 pounds is 13.6%. The entire area under the curve equals 1.0.

QUESTION: How has he calculated that the area of the (red) range is 1? How has he calculated the probability of 0.136/%???

2
1 comment
3

This is a study on the prevalence of various types of anxiety disorders in the population. In the article, Figure 2 has "Cumulative age of onset distribution" on the Y axis. I am unable to comprehend what the Y axis means, and was wondering if someone could explain it to me.

Article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3018839/

If this is the wrong place to post this, can someone guide me to the correct thread.

Thanks!

3
2 comments
2

https://imgur.com/a/TKr7DyA

Question (part C) and solutions are in above imgur link:

More specifically, I cannot understand why the integral was crafted as such, with limits for x from 1 to x and for y from 0 to 1. Can someone please explain how these limits (and hence the integral) were chosen and perhaps give me some insight as to how to craft the integrals for different probabilities.

2
1 comment
1

I am currently working on a paper that maeasures the impact of a scholarship program from a private school vs students who do not get the scholarship and have to graduate elsewhere.

My Hypothesis is that there is a rather strong relation between getting the scholarship and graduating.

What kind of association tests can I perform if my data is qualitative? Variables are

STATUS: 1-got scholarship; 2-no scholarship GRADUATED: 1-yes; 2-no ENROLLED IN COLLEGE: 1-yes; 2-no

PS. Edit: I am not trying, by any means, to get homework help. I just need clarification on which test is more appropriate so I can run things myself.

1
1 comment
12

I work in biotech, and thus frequently confront high-dimensional (little n large p) problems in my work.

My goto tool is generally elastic net logistic regression. I've continued to interpret the output of these models as probabilities in my analyses, however I'm starting to wonder if that's really appropriate.

In high-dimensional binary classification problems you always have perfect linear separation, which means that without regularization the coefficients of your linear model shoot off to ±∞. This means that the coefficients I obtain when I regularize are shrunk/offset by an infinite amount.

This, in practice (not sure about theoretically), completely invalidates what I think is the core component of logistic regression which allows you to interpret its output as probabilities (besides them being contained in the interval [0,1]), which is that the expected probability of a sample belonging to the non-dummy class with output 0 ≤ p ≤ 1 is p.

12
14 comments
11

What do you do when you have hit the jackpot and all 20 measured statistics are significant with a p<0.05?

Do you still divide by 20 and ignore all 20 because they're not significant at the corrected significance level?

11
32 comments
50

Very Helpful R codes, all at one place by studywalk

50
6 comments
6

Edited to include scatterplot

Up front: I can give more details if needed, but I think my question is pretty basic: did I accidentally skew my independent variables so bad that I can't use them?

I did a study where I dove along a river collecting mussels. Before we even look at the mussels in the study, I want to make sure I'm using the independent factors correctly. I recorded the depth I dove, and also what type of bottom there was (rock, sand...all standardized into a continuous scale). All of this was done to see if depth, bottom type, and river mile (distance along the river) had any impact on mussels.

What I found is that I unintentionally dove deeper at downstream sites than upstream. It looks like this

This is strange as I did not move in one direction (I dove upstream some weeks, bounced downstream, back to the middle...it was based on logistics). The regression shows an R2 of 0.106, and the analysis of variance shows a significant p value.

So my question is: am I unable to analyze the dependent mussel data (size, weight, %adults, etc) with depth and river mile as separate independent variables?

P.S. an ANOVA examining the effects of depth and river mile on bottom type resulted in a significant interaction.

6
23 comments
0

Is it true more often than not that:

the ith highest number in a distribution divided by the mean of the ith highest numbers < the jth highest number in the same distribution divided by the mean of the jth highest numbers, where i > j.

Is it always true for normal distributions?

0
14 comments
1

I'm presenting a report at work, I need a place to input all my findings and create bell curves/line graphs etc. Not too fancy, just simple and effective, preferable free or extremely cheap

1
3 comments
Community Details

57.3k

Subscribers

124

Online

This is a subreddit for discussion on all things dealing with statistical theory, software, and application. We welcome all researchers, students, professionals, and enthusiasts looking to be apart of an online Statistics Community.

Create Post
Cookies help us deliver our Services. By using our Services or clicking I agree, you agree to our use of cookies. Learn More.