The Mod Team has decided that it would be nice to put together a list of recommended books, similar to the podcast list.
Please post any books that you have found particularly interesting or helpful for learning during your career. Include the title with either an author or link.
Welcome to this week's 'Entering & Transitioning' thread!
This thread is a weekly sticky post meant for any questions about getting started, studying, or transitioning into the data science field.
This includes questions around learning and transitioning such as:
We encourage practicing Data Scientists to visit this thread often and sort by new.
You can find the last thread here:
hi, if someone is interested, could I ask for some feedback on my resume. I am currently looking for a data science internship. Particularly, I am targeting the data science internship at Microsoft: https://careers.microsoft.com/us/en/job/473873/Intern-opportunities-for-students-Data-Applied-Sciences
I have built my resume based on the job description provided by Microsoft. Here it is: https://imgur.com/a/oPcUjaJ
Any help will be appreciated! Thank you.
Following is the job post for the job position that I am looking at.
Review and manage structured and unstructured data
Inventory data received and record metadata
Transform data using scripting languages such as SQL, PowerShell, C#, or Python
Perform basic data imports into Microsoft SQL Server
Validate imported data and perform preliminary data manipulation and analysis
International compliance standards
Data privacy laws
I feel like descriptions are rather broad.
Also, this job position is for an internship.
I learned a bit of Python this past summer and I am reading a book titled Data Science from Scratch. So one can say that I already know very limited amount of Python and data science but I know them only in theory. SQL .. I know what it is and I learned a bit a couple of years ago.
The company is looking for someone educated in Management Information Systems or Computer Science or anything similar. I come from accounting background and I am strongly interested in doing "data analyst" type of jobs because I believe that nothing will be better than showing that I worked as data analyst to convince employers that I have some proficiency in data science along with my accounting background. I believe that will give me an advantage when I look for "financial analyst" type of jobs.
Strictly speaking, I don't qualify for most of the qualifications now. How long would it take for me to have enough knowledge to apply for job position as described by the job post above?
I am currently enrolled in a bootcamp (NYCDSA) that is to start in late September. The bootcamp has amazing reviews but is mostly taken by PhD/Master’s grads. I am starting to panic because this is not a cheap decision to make and I can’t afford to not receive a job offer shortly after the program ends. I am wondering if completing a Master’s program in CS or Stats would be a wiser investment? My end goal is to get a job as a data scientist. I would obviously hope to get a job as a data analyst after completion of the bootcamp and then eventually work up to a data scientist. Is a Master’s still more valuable?
I have been playing with this telco churn dataset from Kaggle. It was my first time analyzing churn and even "publishing" my first notebook (https://www.kaggle.com/larissaleite/basic-telco-churn-analysis). I have been searching ways to improve the classification for this dataset even after doing some feature engineering, hyperparameter tuning and cross-validation, the accuracy/ROC/F1-score stay roughly the same. Is there something I am missing? Can somebody give me some more insights on this? I had thought about trying downsampling of the majority class, but in the end I thought it wouldn't be worth it as the dataset is not that much imbalanced. Any tips or suggestions are highly appreciated!
Suppose I have two continuous targets (y_1 & y_2) that domain knowledge suggests should be affected by the same factors. I take a group of mostly numeric features (X), scale the numeric ones between 0-1, and build two linear regression models:
y_1 = XB + B_0
y_2 = XB + B_0
After checking the model's assumptions:
What valid comparisons can I make of a feature's coefficients in model 1 vs model 2?
How would comparisons be affected by a feature being significant in one model but not the other?
Thanks in advance for any help.
This is a bit of a hard question, I tried googling and searching within this subreddit and think I'm not asking the right way.
Are there any standards or tools for describing data sets using easy and consistent metadata and then visualizing, organizing, browsing, and navigating through them?
I work for a form that has many large datasets managed in various ways, but have fairly static metadata described using largely project open data . However, I and many team members create small analytical subsets all the time for a particular question or joins or whatnot. They are typically disconnected as the analysis is point in time. Sometimes they are built into database systems like sqlsrver that has it's own metadata and ways to search and describe the set. However, it's usually sas, or R, or csv, or some relatively small file from 100MB to a few gigs. However, this leads lots of difficulty in keeping track, collaborating with co-workers, reusing someone else's work, etc.
Some folks compensate by describing the extracts using datapackages or w3c or whatever individuals choose and storing the file where it can be indexed by an internal search engine (usually SOLR).
I'd like to find an easy or OSS tool that can consistently store metadata and point to the source dataset and allow browsing, searching, visualizing stats like linkages, user activity, etc. But that also works on an individual level to organize personal on individual collections that grow over time.
There are commercial data management products like collibra or socrata has some internal enterprise data products. But they are expensive and require a central install and purchase. I basically want github for data, but can't host externally.
So I'm attempting to fit and validate a model using the h2o automl function (in R) and running into an issue. Wondering if anyone could help shed some light on this.
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train,
leaderboard_frame = validate,
max_runtime_secs = 30
The prediction is binary (1/0) and the y variables are 5 categorical variables which I'm defining as factors and 2 numerical variables which I'm defining as.numeric.
My issue is a separate simple logistic regression model I'm running is out performing the output of the h2o model (looking at the confusion matrices and accuracy score). I think it's because the h2o automl is not playing nice with the numeric variables for some reason...? I've gone through this booklet and I found this excerpt
"To convert an integer into a non-ordered factor (also called an enum or categorical), use as.factor() with the name of the R reference object in parentheses, followed by the number of the column to convert in brackets."
Does the h2o.ai only work well if you define all of your variables as factors?
Appreciate any insight!
Link to article, it's HBR so has a monthly limit on the number of free articles.
I'm an analyst turned data scientist turned Product Manager turned Manager of a small team of data scientists, all in the tech industry. I've been in the field for a little over 10 years, so what the article said resonates with me a lot.
I've gone from an era where we had to transfer large datasets in CSV through FTP and running our models in a remote SAS server to scalable machine learning in the cloud, but the one thing that hasn't changed is the need to translate complex business problems into data problems and then translating them back into useful insights to convince the suits to make changes.
I see data scientists fresh out of school who want nothing to do with exploratory data analysis, they want to get to "building models". To them, the article has an answer that's phrased much better than I could have.
These days a great deal of machine learning and deep learning is being automated, as we learned when we dedicated an episode to automated machine learning, and heard from Randal Olson, lead data scientist at Life Epigenetics. One result of this rapid change is that the vast majority of my guests tell us that the key skills for data scientists are not the abilities to build and use deep-learning infrastructures. Instead they are the abilities to learn on the fly and to communicate well in order to answer business questions, explaining complex results to nontechnical stakeholders. Aspiring data scientists, then, should focus less on techniques than on questions. New techniques come and go, but critical thinking and quantitative, domain-specific skills will remain in demand.
I started as a ML engineer (first job), and I mostly work on my own. My boss had to take a couple months leave because he just had a kid, so I've been left to my own devices. I have a CS BS with an AI concentration, so I feel quite limited in what I can do, even though I think I understand the basics well. The previous few weeks have been good, and I've been making steady progress in learning about gcloud services and frameworks I didn't use in school. However, I'm at the stage where I need to improve my model, and I feel like 1) I'm blowing money on experiments that haven't amounted to anything 2) running out of ideas to test and 3) have somewhat high expectations from my coworkers in other teams.
I was thinking about this yesterday, but I think I'm not being methodical enough in how I'm setting up my experiments. I've done a research internship before where my supervisor would sit down, do analysis with me, and plan out experiments for me to try, and I'm realizing how much I wish I had that now. My past week has been learning how to use ML Engine and running maybe 20-30 experiments trying different parameters, models, and data distributions to little avail. I feel like I'm sitting in a control room with thousands of buttons that I don't recognize or understand. My plan today is to write out an experiment schedule and take notes on everything. My question is, how do you develop models in a productive way?
I am doing MS in analytics. I have done a good amount of projects on ML. While applying for Data Science jobs, I noticed that almost all of the roles require SQL. I have done a MOOC on SQL but I am missing a project from my resume which I believe is important to showcase my skills in SQL during a interview. I can't seem to find a good project idea and what kind of analysis I can do with it. Any project ideas that might help me fill the gap?
I am trying to learn Bayesian and wanted to understand the difference as well. If anyone could lead me to a good resource, would be helpful.
We're looking to build out an in-house data-curation team to create labelled data for our ML models, and are a bit stuck on what job title to advertise for. I want it to be accurate but also relatable. Possible ideas:
I'm thinking of steering away from "data entry" since the work is a little different (and will possibly require more training) than what people might think of with these jobs ... a previous attempt at hiring from the "data entry" tag on Upwork was a major fail...
I’m not sure if this is the right sub for this so please suggest somewhere else to post if not.
I have a set of data for 10 people who took the same test of 10 questions twice. Each question has a score (let’s say out of 100 for example). They took the test before and after an event, so the idea is to see how the event affected their answers. Each question is significantly different so there is no sense displaying each persons total score for the whole test.
Can anyone suggest some good ways to visualise this? I’m thinking of a plot for each person showing their scores for each question, maybe a plot showing average scores for each question.. this could quite quickly turn into many plots or some complex/hard to understand.
The data is currently sitting in excel but I’m fairly comfortable with python and pretty good with Java/JavaFx if there are better ways to use those?
Many thanks in advanced.
I lurk on this subreddit and find most of the posts helpful. I have saved all the interesting posts for when I'm ready to do data science. Thank you for all of that. It's a wonderful community.
I graduated with a Computer Science and Engineering degree in 2015. And due to the soul crushing pedagogy of my relatively poor college, I switched to art. I made short films, wrote poetry and did everything to avoid regular employment. Sometime, in the turn of this year. I realized I actually loved computer science before college and thought of giving it another chance. Applied to German Unis, got summarily rejected.
Now I'm self learning. And my progress has been okay. Not got to the Machine Learning part yet. But have been doing EDA and visualization on R.
My concern is by the time I start implementing ML, the market will have moved on to something else. As we can already see SPSS and other packages offering point and click sales prediction and stuff.
Is it worth it, objectively? This is not an existential crisis. Just an honest question on whether I should consider academia? Because that's the only field where machines and software can't automate human resources.
We always hear about all the things beginner data science are missing but never hear about this angle
EDIT: Maybe I asked the question wrong but it seems to have devolved to another generic “what do new data folks don’t understand that experienced folks do” . Although some folks seem to find useful it has been done to death
Background: I live in New Orleans. For those unfamiliar, it is more like a northern Caribbean city than it is southern U.S. Wild shit happens here daily that would break a normal city's news cycle for weeks.
ANYWAY, among the many things we offer, we are known for supremely horrible road conditions. New Orleans' potholes have almost won multiple naming competitions for sports teams (look it up).
Hypothesis: Because of this, I decided to explore the geography of our worst roads compared to our best. I have a theory that the best roads, and those that are fixed the fastest, happen to line the wealthiest Burroughs while those of lower income are neglected longer.
HOW I NEED HELP: I have found useful data sources on this (http://www.city-data.com/zipmaps/New-Orleans-Louisiana.html; https://roadwork.nola.gov/home/) but need some help in how to best incorporate them for an interactive analysis.
TL;DR: I'd like to test whether/to what degree wealth of an area impacts how likely road projects are 1) to be planned, 2) worked on, and 3) completed (as well as average time of completion by zip code. Finished product would ideally be an interactive graph overlaying zip code wealth and roads highlighted for work. Open to collaboration as a noob. PLZ HELPPP
My background: have methodological background (PhD student in epidemiology and biostatistics), very knowledgeable of R and importing/cleaning/managing data, not so with mapping or GIS
I'm working on a project where the goal is to automatically update a tableau dashboard hosted on tableau server.
Currently, reports are created 1am every night and are stored on an AWS server. I want to streamline a process to pull the data from the server, scrub it with R, and upload it to the proper tableau dashboard.
My idea was to create a small linux server using a raspberry pi and have that run the initial script to pull the data and run a script to scrub the data, and store the raw data and clean data on a server with a date stamp, and then a generic "Current Data" that Tableau will read from and refresh. Is that even possible before I go down this rabbit hole?
Have you encountered a data product that you find useful for your work, play or that you just admire? Google search comes to mind as the epitome of putting data to use but I’m also thinking about Ahrefs for marketing/SEO and Datadog for stack insights. What are some other products or categories of products you’ve seen and been impressed by?
I just saw these videos from a conference the guy from Data School made and I really liked the format, looks like a great way to learn how to think like a data scientist.
Just to be clear, I am not looking for tutorials: I like the format where someone has to analyze a dataset and explains all the steps he follows. Any ideas on how can I find more material like this?
I am planning to do demo for reporting capabilities for an Enterprise customer, but they are not willing to share their internal data into our platform. However, they allow me to join them onsite with their DBA.
Is there a way to run data discovery/profiling on their Postgres 10 within a particular schema? And then use different tools to generate fake data that match those data profiles? With this, we can load that fake data to our reporting platform and address their security concerns.
I just did my first data science internship and don't really have an idea about the industy other than the company I interned with.
So I wanted to know if there are companies that are generally more desirable to work at (Example, in compsci it would be Google, Microsoft, etc) and if those don't really exist in the data science industry then how should I decide which company to work at?
I just started using RMarkdown (have been working with R for a year now, so familiar with basic code writing in R); it's really handy but my documents don't look nice and structured. For example:
- How can you change the color/thickness of line page breaks?
- How can you change color of certain text pieces (to emphasize a word/sentence)
Any other tips are also welcome :)
Hello, I'm starting grad school in the fall and will be taking some statistics, data mining, and machine learning classes over the next year. However they generally have pre-requisites of undergrad-level linear algebra and probability and statistics. While I'm familiar with the topics I never took formal classes in either, so was wondering if you all had any recommendations for online resources covering the basics that I could use to brush up for the coming year? Thanks.