The Mod Team has decided that it would be nice to put together a list of recommended books, similar to the podcast list.
Please post any books that you have found particularly interesting or helpful for learning during your career. Include the title with either an author or link.
Welcome to this week's 'Entering & Transitioning' thread!
This thread is a weekly sticky post meant for any questions about getting started, studying, or transitioning into the data science field.
This includes questions around learning and transitioning such as:
We encourage practicing Data Scientists to visit this thread often and sort by new.
You can find the last thread here:
The competition is so high and the opportunities so low in such big companies that for e.g. I realized Facebook doesn't let one to apply to more than 3 jobs in 3 months...
I'm a Product focussed generalist based in London, pretty good at lots of things (statistics, programming, analytics, engineering, getting sh*t done) but with no specialism.
Just started looking for a new job, but it seems pretty brutal out there. I've been outperforming my team mates, who many are specialists, ever since I started, but it seems that most companies only want specialists (think ML, NLP and CV etc) these days and don't see the value in generalists. What gives?
Have most companies already picked off all the low hanging fruit and are now looking for incremental gains in predictive models; or like they've been saying for a while, do they not actually know what they want?
And what does this mean for the future of generalists and Product focussed Data Scientists?
Been reading for while, and here goes my first post. I understand that this is more of a career-specific post but I hope this will cover questions for who are in similar shoes as well.
A bit about me: Currently a rising senior at a liberal arts college majoring in CS. International student, no tech internships but couple business internships (consulting/finance). Some experience with Python and R, math up to linear algebra. Have taken basic stats couses. Have done couple DS projects for class. No experience with ML.
I am hoping to land a full time job in DS, but I feel that there are limitations for entry level jobs. A few of my friends whove pursued DS have found themselves in DS team of big tech company, tech startup, and business analytics companies, but most went out of luck.
Given my background and lack of expertise, what and how should I be preparing to get a DS job? I originally had an internship for this summer which got cancelled lat minute due to company circimstanes, meaning I have a free summer as of rn.
Any advice will be helpful. Thanks.
I'm wondering how about go about doing this : In my dataframe, I have a column with missing values, how would I run a function learning from the data in that specific column and try to fill the gaps with the output resulting from that function. Let's say I'd use KNN and fill the missing data with KNN' output for both categorical data(simply using the mode seems to over-represent the majority category) or numerical data.
All of this in Python 3 if possible!
Thanks a lot for your help.
It seems to me like pretty much everyone agrees that data cleaning and wrangling represent the lion's share of the work for most data science projects, but it makes up a relatively small percentage of the content I see in forums and blogs.
So, here's a wide open thread on data cleaning and wrangling! Here are a few questions to kick things off:
What are your "must do" items when you get your hands on a big new data set?
What tools do you like for data cleaning?
Do you think of data cleaning and data wrangling (joining, aggregating, etc) as two different steps in your process, or as parts of a single "data preparation" step?
How has your approach to data cleaning and/or wrangling changed with experience? Any epiphanies or illustrative blunders that you'd like to share?
And my favorite:
What do you wonder how other people do?
I am going through this course as of now (auditing). I was wondering if there is any sub-red for that (or any other place for discussion etc.. coursera now asks for money for assignment etc (which is totally fair)).
If so plz let me know.
Hey, I'm trying to segment different socioeconomic groups using clustering based on some features.
After selecting the features that will be used I standardized all the features (z-score) and created the clusters. The next step is to analyze the clusters based on their means and compare them (right?), but i'm not sure on how to analyze their properties based on the zscore.
These are mean features on buildings, my questions are:
1) Would it be correct to mean the rows and do the analysis based on that? For example if the mean of the first row is 2, would it be correct to state that cluster 1 have almost none of those buildings?
2) How much z-score variation would you consider before stating that one type of building appears more than other?
Any advice or question, feel free to ask, thanks!
I'm tasked with building a baseline for credit repayments. Predicting exactly how much each segment of consumers will pay based on historical data so that we can calculate an uplift.
The current method is as follows : Split up consumers into segments, average their percent liquidation for each month of payment and assignment, and that is what you would expect to continue collecting with no interference.
I've been trying to wrap my head around how this method could possibly be improved, if at all possible. When I do a 70/30 training/test split, I get a 35% MSE on liquidation which is quite large (liquidation being 0-100% and the average consumer liquidates somewhere around 50). I have a hard time understanding how reliable this could be if it is that un-accurate.
However, given that I am predicting liquidation for 10s of thousands of people, I'm guessing that MSE will average out to 0? Or am I missing something.
The company has built a very successful business so obviously they are doing something right. I'm just lost as to how I can improve the forecast, if that is at all possible. I'm testing regressions, decision trees and random forests in order to inform my segmentation decisions, but still coming up with a large MSE of ~30% when I try and predict my test sets.
Do I just accept that financial analysis will always have large errors? I'm kind of stuck in my understanding here.
Hi all! I was offered a contract job opportunity from a company; they asked how much I will charge and I have no idea. I studied engineering, and I’m currently studying a master in data science at one of the top10 universities in the world. The job will essentially be to apply machine learning/other techniques to improve an algorithm they developed, assess the current data they have gathered with the algorithm to see what information can be extracted, and overseeing the feedback capture process from clients who have used the algorithm. I am based in London. How much do you think I could charge? Thanks!
I am currently working on a project for a small local company. I have created several scripts that take some data they create mostly by-hand as input and do various transformation on it using pandas. Many of the scripts also use Selenium browser automation. The scripts are only run every so often as-needed.
Currently they interface with the scripts locally using Jupyter notebooks with widget controls. However, I know this approach is inelegant and I would ultimately like for them to be able to use the scripts on their own through a simple UI without needing to install a development environment on their local machines.
What approach would you recommend for packaging these scripts as a simple application? Ideally not requiring any local install.
I have heard that many ppl have an inaccurate impression of what data scientist do in their everyday work... they spend only about 5% of their work doing math. Is that true?
I just made this real quick rough draft and I usually use spreadsheets for drafts of idea before coding stuff. If I like and use this it I might code. I did it in Apple's Numbers because A) I shamelessly love the simplicity and elegance of this app and B) it allows for checkboxes and star ratings, which excel doesn’t. If you don’t have a Mac you can use it free on iCloud.com
But anyway, it runs a cost benefit analysis on your tasks and then you get a nice weighted column of what you should work on first. I mean you’d have to tweak it. The formula I used was Importance of Task ÷ (Duration in hours × Value of your time) + Cost. You have to figure out what your time is worth. I think if you figure out your hourly rate at your job, that’s a good number to use. In other words if you average $50 hour, you say “If I weren’t doing this task that takes 1 hour, I could be making $50”
This is is also a good way to figure out if you should do a job yourself or outsource it. If it takes you an hour to cut your lawn but you can get some neighborhood kid to do it for $20, it’s a better use of your resources to outsource it.
Anyway, lemme know what you think and if anything should be added.
Anyone know of any anonymized and open EMR/EHR data that is available for researchers? Everything that I've found on line is either restricted to its own institutional employees or not containing all the data that one would expect to find.
I'm starting a blog where I go through (painstakingly) each step of various data science projects. Since there seems to be a lot of interest in what data scientists actually do, is that something that this sub would be interested in? I'm asking now because I genuinely don't want to blog spam you guys.
Hello, I will be more specific about the question here:
I am familiar with using R and some concepts such as tidyr, dplyr, ggplot2 and many things such as filtering through data and merging data sets.
That being said, what sort of work would I be able to find with this? I have no knowledge of statistics or the Math needed for it. I only know how to work with data sets, graph it but not really analyze it so well.
Thanks for any advice!
Hi everyone, long time lurker, first-time poster.
I'm a blogger in the personal finance community and I've been teaching myself some data science techniques. I finally learned how to scrape Twitter and collect a decent amount of data to run analysis on. I used R to scrape the tweets and Tableau to analyze the data.
I posted my work here: https://www.msolife.com/how-firey-is-your-city/
Do you have any comments or suggestions? Since I'm teaching myself, there's still a lot I don't know about. I'm excited to keep learning!
We currently do our reporting in Excel and I am trying to modernize/automate as much as I can. Our data is now in an online database (used to be in excel) and I can pull/clean/and tab the data in R. I'd rather write something in R to do all the calculations for our monthly reports than have the calculations done in excel(what we are currently doing)
Other than creating a variable for each item, is there a way to save multiple cross tabs into one sheet in excel?