As a startup, there are many areas that demand the focus from founders. Depending on the phase of the start-up, data science may be treated with different levels for importance. However, early investments in data science has always proven to be having high impact on profitability. This article, we will discuss we will review the possibilities of using data science technology for startups. We will evaluate how startups can use data pipelining and leverage data platform in order to harness the power of data.

Data science in start-ups, your benefits!

Business is getting data centric. But the biggest challenge the start-ups could face is to get the data. For startups, data scientists have to build the architecture from scratch. As compared to the larger industries, start-ups may not be flush with data accumulated over time. The first step is to have a dedicated person or service provider to set-up and build the data acquisition architecture for the start-up business. The first steps include

  • Sources of data extraction
  • Strategy and tools to build  Data Pipelines
  • Developing KPIs for data
  • Visualizing tools for developing insights
  • Building models
  • Testing and Validating to improve performance

Sources of data extraction

The user base and the number of event logs that access the application are the two starting points for data extraction. The user base can be further divided into active users and their sessions, inactive users and their drop-off points, and the details of the events/transactions that the active users are utilising. The data that must be collected is based on the above parameters.  Additionally, certain domain-specific attributes are required to gauge the number of users an their usage pattern. Even the simple insights on dropout rate of users are highly useful to make the solution better improve engagement.

Trackers are critical to acquiring this data in an organised manner. The best measure to carry this out is through writing tracking specifications in order to identify attributes and take appropriate steps to implement events. The tracking events are essential on the client side as they send data to the server which is for analysis and for the development of your data products. Early stage startups usually suffer from data starvation. Therefore, in order to make products better, embedding event trackers in your product is the best approach towards collecting data at a dynamic pace.

Strategy and tools to build  Data Pipelines

A data pipeline helps to process the collected data for quick and meaningful analysis. A good and healthy data pipeline has several distinct characteristics:

  • Near ‘real-time’ delivery – access and process data in minutes or seconds
  • Flexible querying – support longer batch queries or quick but interactive queries
  • Scalability – Since, start-ups are expected to add and accumulate data as they grow
  • Alerts and errors – timely alerts and errors for syndication or reception errors, no reception etc.
  • Testing for speed – the pipeline should be easy to test for performance, anonymously, including database connections

Developing KPIs for data

A strong pipeline is a result of recognising the type of data.

  • Raw Data – The raw data does not have any schema applied to them are do not have a particular format attached to them. The events are tracked as raw data is shared, and schema applied at a much later stage.
  • Processed Data – With the implementation of schemas over the raw data, it becomes processed data. It is encoded in specified formats and is stored in a different location in the data pipeline.
  • Cooked Data — A summary of the processed data  which can contain multiple attributes based on usage data.

KPIs or key performance indicators captures engagement, retention, growth in order to determine the usefulness of the changes applied to the product or business model of the start-up. This also involves data engineering and standalone analysis. However, the one should  focus on implementation of reproducible reporting events and dashboards that track product or business performance. The KPIs are then available on demand and not required to be compiled manually, every time they are required.

Visualizing tools for developing insights

Generating Reports

R is the most popular programming language for data science. While R is used widely in data science for creating plots and building web-applications, it is also used for automated report generation. Some of the useful approaches towards building reports with R is using R to directly create the base plots, generating reports with R Markdown and using Shiny to create interactive visualizations.

ETLs for Data Transformation

ETL stands for Extract, Transform and Load. The main role of ETL is to transform raw data into processed data and processed data into cooked data. This cooked data is present in the form of aggregated data. One of the key components of a pipeline is the raw events table. The ETL processors can be set up to transform raw data into processed data. We can also create cooked data from processed data using ETLs. We can schedule the collection of ETLs to run on the data pipeline. There are various tools that can assist in monitoring and managing complex data.

Exploratory Data Analysis for your Data Product

After setting up your data pipeline, the next step is to explore the data and gain insights about improving your product. With Exploratory Data Analysis or EDA, you can understand the shape of your data, find relationships between data features and gain insights about the data.

Some of the methods of analyzing the data are –

Summary Statistics – To better understand the dataset with mean, median, mode, variance, quartiles etc.

Data Plotting – method of providing a graphical overview of the data through line charts, histograms, bar-plots, pie charts. or applying log-transforms to data not present in normally distributed forms

Correlation of Labels – Find which features are correlated within the dataset by comparing each feature of the dataset with the goal of finding a correlation between a single feature.

Building Statistical Models

Machine Learning is used to make predictions by programmatic classification of the data. With predictive modeling tools user behavior is forecasted and further tailor the products or business model based on how the user behavior.

For example, if the startup has identifying recommendation system as an opportunity, then a predictive model to recommend products or content to the user based on their buying or watch history is possible. Here again, there are two prevalent methods:

  • Supervised Learning – the development of a prediction model based on labeled data mostly using regression and classification techniques. Regression is used to predict continuous values, classification categorizes the values in classes to identify the likelihood of the outcome of a variable.
  • Unsupervised Learning – applied where data is not explicitly arranged in labels using clustering and segmentation techniques.

The eager model and lazy model are used to apply machine learning on the data sets. The eager model forms rulesets dynamically at the training time itself. The lazy model generates rulesets during the training time and are therefore more preferred in building real-time application systems as the model is updated with modifications or changes in data.

Crafsol has extensive experience in running machine learning tools with prediction models are Weka, BigML, R and Scikit-learn (Python).

Testing and Validating to improve performance

The data warehouses and marts are not static entities and must be re-architectured from time to time. However, the biggest measure of the success of Data Science in an start-up is its use and benefits. While every organisation that takes up data science stands a risk of low utilisation either due to lack of alignment in the insights or their timely unavailability. This is true especially for start-ups which are in a continuous turmoil of change at multiple levels – business model, and data acqusition.


Data science is essential to make better products and improve customer experience. Startups should invest in ensuring the quality data acquisition, its systematic processing from the very beginning. Essentials, such as building data pipelines to assist in faster processing of the data, are equally important to ensure a strong foundation for data-driven decisions. A strong initial investment can go a long way in creating a sustainable competitive edge for the start-ups business model and solution. It also shows the scientific approach in making decisions when interacting with key stakeholders including customers or investors.

Crafsol has been advising and consulting start-ups on use of machine learning and business intelligence to improve customer experience. We work as a partner with fast growing start-ups in India, USA and Australia to help them establish a strong data science practice early on in their business phase.