MDS Newsletter #20
We have crossed 1000+ subscribers all thanks to your constant love & support!
We have something amazing lined up for you all. We are launching the inaugural version of the "MDS Rocketship Awards" 🚀 tomorrow. Stay tuned on MDS Twitter to know who are the "MDS Rocketships". Let's go 🚀
Let's dive into this week's edition!
Featured category this week
'Customer is the king', but how well do you know your 'king' helps you to take your game up a notch.
The best way to know your customer is by leveraging their data and using it for future customer interactions. This is where Customer Data Platforms come into the picture.
Read an amazing article and tweet thread by Brian Lu, Director of Product Rudder Stack explains what Customer Data Platform is, what they do, and how to evaluate them.
Featured tools this week
Here are this week's featured companies from the Modern Data Stack.
- Observable is a collaborative data canvas powered by the community that helps to explore, analyze and explain data.
Observable reimagines how we can make sense of data for this modern, connected world, making it more approachable, accessible, social and fun.
Headquartered in San Francisco, California, Observable was founded in 2017.
Observable has raised a total funding of $46.1M in funding over 2 rounds. Their latest funding was raised on Jan 13, 2022, from a Series B round.
- Amundsen is an open-source data discovery and metadata-driven application for improving the productivity of data analysts, data scientists, and engineers when interacting with data.
Headquartered in San Francisco, California, Amundsen was founded in 2019.
Reverse ETL wars
Reverse ETL space is getting hot & spicy 🔥🌶, in a very literal sense.
It started in Nov'21 when Census published their benchmark on how they are faster than Hightouch in their two part benchmark reports - Pt 1 & Pt 2, where it was claimed that Census is able to sync 4,444 records/second whereas Hightouch is able to sync only 102 records/second (and hence was 44x slower).
Hightouch refuted the claims made by Census and published their own benchmark where it was claimed that the numbers reported by Census on Hightouch's performance did not match their own evaluation of the platform and Hightouch is infact 27x faster.
Lastly, Castled data - an open source Reverse ETL platform, published their own benchmarks, where they claimed they are 642% faster than Census and 1324% faster than Hightouch 👀
This reminds us of the Database benchmark wars and one should see these benchmarks with a caution.
Good reads & resources
Keep pace with the latest developments in the modern data stack. Always be learning!
- Modern Data Stack for Startups: There are many things that are unpredictable with an early-stage startup and its growth, and how the volume of data will scale in the future is one of them. And this makes it difficult for startups to choose the right tools to build their data stack. In this article, Tejas Agrawal has shared his viewpoints on how early-stage startups should go about their adoption strategy for choosing the right tool for their data stack.
- Good Data Citizenship Doesn’t Work: Just like responsible citizens alone can't create a prosperous society if the leaders are not willing to put in the work, a great leader can't make a difference if citizens are not helping by being responsible. The same goes for "Data Citizenship" & "Data Leadership".
In this article, Ben Stancil has talked briefly about how "Data Citizenship" alone doesn't work. He discussed in detail how with a mix of good citizenship, good product, and good processes — we can build a data society that works for and is trusted by, everyone.
- Top 8 SQL Functions to Clean Raw Data: Working with messy data is not an easy task. Applying multiple transformations just to make this data ready to be used in your model is a tedious and time taking process.
However painful this process is one can't take it lightly, as "messy data in, messy data out"! Your data models are only as effective and concise as the data you’re loading into them.
In this article, Madison Schott discusses briefly how you can easily extract raw data from business APIs using Airbyte and ingest them into a data warehouse. She also points out some common data quality issues when extracting raw data from Google Sheets to a data warehouse and shared some popular SQL string functions to clean this data and make it ready to be used by your analysts.
- Data Engineers don’t need to be Superman: It's a misconception that data engineers need to know the nitty-gritty of how every data tool works to be good at their job. In this article, Andreas Kretz shared how a data engineer doesn't require to have superman abilities. With the help of a blueprint he explained how at each phase - "Connect", "Store", & "Visualize", knowing how to work well with only one tool can help you do the job well done,
- Do You Really Need a Feature Store?: Though it appears that ML teams at all the world-class organisations like Netflix, Uber, Airbnb, etc. have built their own "Feature Store", it's no longer necessary for you to build an in house feature store, as there are a number of solutions available, Tecton.ai, Hopsworks, SageMaker by Amazon to name a few. But why have a Feature Store at all? Is it an absolute necessity for every ML team? Or is it overkill most of the time?
In this article, Lak Lakshmanan talks about how in most cases, feature stores add unnecessary complexity. He briefly discussed situations that require the use of a "Feature Store" and clearly mentioned the "concrete situations" where a feature store is not required. Using a decision chart he shared a better way to help you go about deciding whether you need a feature store or not.
Latest funding news
The latest happenings in the VC world for data stack companies.
- RudderStack raises $56 M in Series B Round of Funding!
Rudder stack is a Customer Data Platform helping businesses to improve their analytics and marketing efforts.
This round of funding was led by Insight Partners with continued support from Kleiner Perkins and S28 Capital. Read here.
- Onehouse, a pre-built lakehouse foundation, raises $8 M in seed funding!
Onehouse delivers a new bedrock for your data, through a cloud-native managed lakehouse service built on Apache hudi.
This round of funding was co-led by Greylock Venture and Additional Capital. Read here.
- Popsink raised € 880 k to help data teams build real-time data services.
Popsink is building the next generation of data processing tools for the modern data stack. Come discover the power of real-time ETLs and automate your decision-making.
This round of funding was led by Seedcamp to accelerate Popsink's mission to make building real-time data services easy and accessible to any data team. Read here.
Upcoming data stack events & webinars
Upcoming conferences, summits, and webinars for you. Start networking!
- Tecton is hosting "apply ()", an online ML data engineering event on February 10, 2022, 8:30 am – 1:30 pm PT
"apply()" is an event for data and ML teams to discuss the practical data engineering challenges faced when building ML for the real world. Participants will share best practices, tools, and emerging architectures they use to successfully build and manage production ML applications.
Register here for the event.
- TDWI is organising a virtual event on 'Managing Cloud Data Platforms' from 16-17th Feb 2022, 8:30 AM - 1:00 PM PT
The speakers show you how to manage cloud data platforms to support agile data delivery for business decision-making. TDWI research will expose trends and opportunities based on the latest quantitative research, with expert analysis and recommended best practices.
Register here for the event.
- Twilio is hiring a 'Data and BI Engineer'
Location: Remote- US
Stacks: Looker, Tableau, Datadog, Power BI
- Patreon is hiring a 'Database Reliability Engineer'
Stacks: AWS, Linux
- Chargebee is hiring a 'Senior Data Engineer'
Stacks: S3, Flink,Hive, Kafka
- Mythical Games is hiring a 'Data Infrastructure Engineer'
Location: Remote- US
Stack: Kafka, CockroachDB, Snowplow.
- Dbt is hiring an 'Analytics Engineering Manager'
Stack: Lyft, Monzo, GitLab, Snowflake
What's 🔥 on Twitter
What's trending on Twitter in the Data Stack world!
Just for fun
If you dig this newsletter, share it with your friends in the data space. It will take 10 seconds for you to share this, but took us 10 hours to prepare. Send us some love 💖
Do you have any suggestions, want to feature an article, list a data engineering job, or have a fun meme that you made for the data community to enjoy, hit us up! If its good we'll include it in our next edition😎
We're building a platform to bring together people in the data community to learn everything about building and operating a Modern Data Stack. It's pretty cool - do check it out :)