Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
2
Stickied postModerator of r/datasets

Show off, complain, and generally have a chat here.
Discuss whatever you've been playing with lately(datasets, visualisations, mining projects etc).
Also feel free to share/ask for tips suggestions and in general talk about services/tools/sites you find interesting.

P.S: Suggestions for this subreddit are always welcome.

2
3 comments
42

Just came across this huge list which has everything I have ever seen and more!

https://dreamtolearn.com/ryan/1001_datasets

42
1 comment
6
Comments are locked

Featured in Forbes, Esquire, Entrepreneur and just voted #1 Men's Box Service. Join today and get unique high-quality watches delivered every month - to keep! All are guaranteed 2x-5x your fee. Save 10% off 1st month with code: RDT10

6
comment
3

I need hotel data for a majority of the top 1000 cities in the US. All I need is city and average hotel cost, although it'd be great if it was also separated out by star. Any ideas? If there's a website I could scrape/crawl that would be acceptable too. Thank you!

3
1 comment
0

Hi, I was wondering if anyone knows where I can find longitudinal data on the size of New Orleans (in terms of area). Ideally this would be like annually from 2000 to 2015 ish. I have only been able to find data for the most recent year.

The purpose of this is to track shoreline changes over the past 15 years

0
comment
3

I found an old timer in my shed, and its not turning on at night, but daytime rather, curious to see if my theory is right. Any help is appreciated.

Maybe looking for the diffence in sunset/rise times over the years
Any time lost between ?
Im honestly not sure how to even start this.

3
2 comments
1

I have a crawler that was working greater before the last patch.

My search creates urls and feeds it specific naics/funding agency combinations. However, as of last night it just loops through the first combination indefinitely.

For example,

FPDS-NG search results for
<![CDATA[
: PRINCIPAL_NAICS_CODE:541990 FUNDING_AGENCY_ID:3600 SIGNED_DATE:[2018/07/01,2018/07/22]
]]>
</title>
<link rel="alternate" type="text/html" href="https://www.fpds.gov/ezsearch/search.do?s=FPDS&indexName=awardfull&templateName=1.5.1&q=PRINCIPAL_NAICS_CODE%3A541990+FUNDING_AGENCY_ID%3A3600+SIGNED_DATE%3A%5B2018%2F07%2F01%2C2018%2F07%2F22%5D&start=5000"/>
<link rel="first" type="text/html" href="https://www.fpds.gov/ezsearch/FEEDS/ATOM?s=FPDS&FEEDNAME=PUBLIC&VERSION=1.5.1&q=PRINCIPAL_NAICS_CODE%3A541990+FUNDING_AGENCY_ID%3A3600+SIGNED_DATE%3A%5B2018%2F07%2F01%2C2018%2F07%2F22%5D&start=0"/>
<link rel="last" type="text/html" href="https://www.fpds.gov/ezsearch/FEEDS/ATOM?s=FPDS&FEEDNAME=PUBLIC&VERSION=1.5.1&q=PRINCIPAL_NAICS_CODE%3A541990+FUNDING_AGENCY_ID%3A3600+SIGNED_DATE%3A%5B2018%2F07%2F01%2C2018%2F07%2F22%5D&start=40"/>
<link rel="previous" type="text/html" href="https://www.fpds.gov/ezsearch/FEEDS/ATOM?s=FPDS&FEEDNAME=PUBLIC&VERSION=1.5.1&q=PRINCIPAL_NAICS_CODE%3A541990+FUNDING_AGENCY_ID%3A3600+SIGNED_DATE%3A%5B2018%2F07%2F01%2C2018%2F07%2F22%5D&start=4990"/>
<modified/>

As you can see start=0 is the first page and start=40 is in ending page. If you look this query up on fpds.gov using their front end there are 43 entries. So start=0 is 10 results. My crawler should end at start=40 and go on to the next naics/funding combo. This has been working perfectly for over a year and a half but after yesterday's update it isn't working and I was wondering if anyone else has see something similiar.

1
comment
1

Hi. I've downloaded hyperGAN and I'm looking for a dataset with some theme-based images (preferably 128x128, 64x64 or 32x32 since I don't have a very strong GPU) to train it. Any help would be appreciated.

1
comment
7

I'm looking for as much data as I can about religious populations. I found some on Pew Research, but I was hoping to be able to get some that is year-to-year breakdown for every country. Any help, would be much appreciated!

7
2 comments
3

I have searched the internet for anything. I am not looking for CT or MRI scans, etc. Just cancer cells under micro scopes.

I could use cancer types like

Squamous Cell Carcinoma(Skin cancer)

Glioblastoma Mulltiforme(Tumor)

and things like that!

Something like this..

Thanks in advance!

3
4 comments
9

Sorry this should be so simple but I've struck out after 2 hours looking. I've found charts galore but not the raw data.

I'm looking for a simple chart of the global income or wealth distribution by percentile. So 1st percentile 0.01% second percentile 0.02% etc...

Or if I could just get the income wealth by percentile I could do the distribution myself.

I found this for the UK which is brilliant. I just want the global version of it.

9
2 comments
4

Hi!

I'm gathering info about botnet traffic detection for my computer engineering conclusion work.

I have read some articles on this subject already. But almost every one of them works on a private dataset, usually collected from private networks.

So I'd like to have a dataset on which I could work on botnet traffic detection, and maybe botnet classification.

Thanks in advance, see ya!

4
4 comments
5

Any place where I can find a dataset that has defaults as a measurement? I'm trying to forecast default probabilities, so it would be imperative that there are defaults as well as non defaults

5
5 comments
5
Comments are locked

We’ll put a brand new watch on your wrist every month. Discover amazing watches, no contracts, cancel anytime. Join the World’s #1 Watch Club and save 10% on your first month - Enter code: RDT10

5
comment
0

The more info the better. If it has APR info, the introductory period of no interest, would be great. Basically every credit card company or financial institution. XML, CSV, Json or any format is fine. Thanks

0
3 comments
38

Scraped the fifa.com statistics page for each game into CSV.

https://gitlab.com/djh_or/2018-world-cup-stats/blob/master/world_cup_2018_stats.csv

Columns are: `Game,Group,Team,Opponent,Home/Away,Score,WDL,Pens?,Goals For,Goals Against,Pen Shootout For,Pen Shootout Against,Attempts,On-Target,Off-Target,Blocked,Woodwork,Corners,Offsides,Ball possession %,Pass Accuracy %,Passes,Passes Completed,Distance Covered km,Balls recovered,Tackles,Blocks,Clearances,Yellow cards,Red Cards,Second Yellow Card leading to Red Card,Fouls Committed`

There's a little ruby scraper in that repo along with the source pages, but the above CSV is what is created.

38
8 comments
0

I'm doing a research project and I'm looking for more (or better), resources. I'm part of a publication being written on Expat compensation rates around the world and the hassle factor. I'm taking a Data science approach looking at each country and what expatriates get annually. I bit on what expatriates are, they are people who leave their countries to pursue education or work, and then return to their original country. What I'm looking for is, data bases to access (ex. Gov websites, university data bases), that contain survey data, pay rate analytics, and anything that can be a use to my research. I'll be glad to explain more if you have any questions as most people don't know what expatriates, self initiated Patriots, and returnees are. Thank you :).

0
comment
6
Crossposted by
pushshift.io
3 days ago

Edit:

I have received a lot of great advice so far and have created a new Patreon page for Pushshift. This will help keep track of the amount of donations that Pushshift receives (which I feel should be transparent for the community). My first goal is $1,500 per month which would be sufficient to pay the bills and for the daily maintenance necessary to keep things running smoothly.

The Patreon page is located here: https://www.patreon.com/pushshift


Hello! I am not always the best when it comes to fund-raising and pursuing the best avenues for getting donations so I will reach out to you guys. I am reaching out for ideas on how to raise money to keep these services alive and healthy (and also to continue to improve the API and add more features).

The Pushshift.io API and the data dumps I provide (both for Reddit, Twitter and other data sources) requires a significant time investment from me and also requires a significant amount of funding. Just for the hardware maintenance and purchasing new hardware to keep up with the level of data I ingest, I have spent over $25,000+. There are also re-occurring monthly expenses for power, bandwidth, etc.

Unfortunately, donations have been sporadic lately. For the previous 4 weeks, I've gotten less than $100 in donations which isn't enough just for the monthly ISP bill.

To give some insight into my commitment to this project (the original primary aim was to help academic institutions and researchers interested in researching social media discourse, etc.), I left my full-time job with the National Democratic Institute last year around August to focus on this project full-time. I simply love data and helping out the academic community and wanted to spend more time focusing on open-source projects and getting involved in other projects that focus on making our world a better place. I spent some time late last year and earlier this year working with the CivilServant project. I had a family emergency earlier this year which caused me to have to leave that project (quick note -- CivilServant, run by Nathan Matias, is an amazing project and I highly suggest checking it out!).

My goal is to raise $3-5k monthly to both maintain the current services that Pushshift.io offers and also to improve the existing services and add new ones as well. I am currently not even averaging 1/10th of that amount. The largest donation I have received was from the Pineapple Fund which generously contributed $10,000 towards the project (that was a huge help -- thank you to whoever you are!) A bare-minimum of $1.5k per month would be enough to keep the present project alive, though.

If I cannot find some means to increase funding for this project, I will sadly have to shut-down the project at some point (If it comes to that, I will do my best to give some advance notice so that others who depend on this service can transition off of it). I am reaching out to the community for ideas on how to get more serious in raising funds for this project and would greatly appreciate any suggestions that you have.

Thank you!

  • Jason Baumgartner
18 points
6
comment
2

Hi,

I'm after a dataset of around 1-2k records for the #CROvENG match? I don't have the technical skills to utilize the API's so I'm hoping someone here does.

Mostly after just username, tweet, date, number of likes and retweets. Just a general

2
comment
5

Looking for something with more emphasis on race results than betting data but anything is welcomed. Thanks!

5
4 comments
2

I am interested in data from companies like Ticketmaster and Pollstar that track ticket sales, pricing, attendance, etc. for live shows. I am pretty sure the financial data for any publicly traded company is available online, but I am more curious about the datasets that are used internally.

2
comment
1

hi, my project is to create sql injection detection using machine learning. I need some data set for SQL injection tautologies only. If i can't get the data set, can you give tips how to generate dataset my self?

1
2 comments
3

Hey guys, I'm looking for a dataset that consists of anywhere from 500-1000 images of faces, along with a description of the face (anywhere from 1-2 sentences, or even simple keywords). For example, an image like this one:

https://i.redd.it/zjqgz1cyyya11.jpg

Could be paired with a description such as: "Black male, short goatee, short moustache, wide face, left eye blue, brown right eye, bushy eyebrows, black hair", or something along the lines of that. Does anyone know of a dataset like this? Thanks.

3
2 comments
19

I crawled the results of the soccer magazine Kicker and gathered the data, described here

http://www.camminady.org/gathering-every-bundesliga-game-ever-played-in-a-single-csv-file/

Edit: I updated the file. It contains the team names coded as unique IDs. Same for day of the week. Date and time were converted to integer formats. Thus all data is in a numeric form and can be used in some machine learning algorithm.

19
5 comments
Community Details

39.0k

Subscribers

136

Online

A place to share, find, and discuss Datasets.

Create Post
r/datasets Rules
1.
Self-promotion / no disclosure
2.
Not original source
3.
Inappropriate Survey
4.
Low Effort
Cookies help us deliver our Services. By using our Services or clicking I agree, you agree to our use of cookies. Learn More.