7 Min Read

How We Built End-to-End Analytics for a $100M Ecommerce Company

Caio Velasco

Data Engineer

Data flows from left to right. From upstream to downstream. Upstream means closer to data sources, and downstream leads to insights and dashboards.

We recently completed an end-to-end data pipeline project for a $100M+ ecommerce retailer. This architecture follows what we use for most projects because we typically work end-to-end from a data flow perspective.

Here's the complete technical walkthrough of how we built their data pipeline.

Key takeaways

Scattered reporting across tools and unclear Key Performance Indicators (KPIs) created decision-making bottlenecks for this $100M+ retailer
Three-layer architecture (raw, intermediate, marts) preserves source data while creating business-ready analytics
Extract, Load, Transform (ELT) approach loads data first, then transforms using Snowflake's compute power
Portable replaced Fivetran for better cost-effectiveness while data build tool (dbt) Core handles all transformations
Dimensional modeling with facts and dimensions enables analysts to extract meaningful business insights using Structured Query Language (SQL)

‍

Problems we were solving

Before building this architecture, the $100M+ ecommerce retailer faced critical data challenges that were slowing down decision-making across the organization.

Reporting scattered across various dashboards and tools

Teams were pulling data from Shopify dashboards, Amazon Seller Central, Gorgias support metrics, Klaviyo email reports, and various Google Sheets. Each tool had its own metrics and timeframes. Getting a complete picture required manually combining information from multiple sources.

Unclear definitions for core Key Performance Indicators (KPIs)

Revenue numbers differed between platforms. Customer counts varied depending on which system you checked. Support teams and sales teams had different definitions for customer satisfaction metrics. Without consistent definitions, strategic discussions often became debates about which numbers were "correct."

Slow speed to acquire and low faith in existing reporting

Generating monthly reports took days of manual work. By the time insights were ready, market conditions had already changed. Teams questioned data accuracy because they couldn't trace how numbers were calculated or which sources they came from.

These problems meant the company was making decisions based on incomplete, inconsistent, or outdated information.

‍

Understanding the Data Sources

This ecommerce company operates across multiple platforms where they conduct business and want to get insights from their data.

E-commerce Sales Platforms

The company sells products through Shopify and Amazon. They connect to these platforms through Application Programming Interfaces (APIs) to get transaction data, though each provides data in different formats.

Customer Experience Data

Understanding customer satisfaction requires additional data sources beyond sales platforms.

Okendo provides customer reviews and ratings. Gorgias handles customer support data. The same customers who buy products also contact support and leave reviews, but this data doesn't exist in Shopify or Amazon. It requires third-party tools like Okendo and Gorgias.

Subscription and Marketing Data

Recharge manages subscription data for their recurring revenue business model. Klaviyo tracks email marketing campaigns and customer engagement. Northbeam provides advertising attribution data.

Internal Business Data

Some business-specific information that defines their operations lives in Google Sheets. This represents another type of data source that needs integration.

You want all these sources because that's what defines your business. The goal is bringing everything into a structured place.

‍

Data Ingestion Strategy

You can bring data from sources manually, but you can also use data ingestion tools to facilitate the process.

This client originally had Fivetran, but Portable made more sense after analysis. As a consulting company, we determined that Portable provided better value for their specific data sources and requirements.

Portable handles the ingestion process of bringing data from data sources into the data warehouse.

‍

Three-Layer Architecture in Snowflake

We use Snowflake as the data warehouse. Data gets ingested into Snowflake, though you can ingest data in many ways and many places.

Raw Layer

We usually ingest data into a raw database. Technically, it's not even a database but a schema inside a database.

Inside the main database (called Analytics or the company name), we have multiple schemas. The raw schema stores data exactly as it comes from sources.

Shopify data arrives in structured format, like Excel with organized cells in table format. Amazon doesn't use tables - they provide data in other formats, maybe JavaScript Object Notation (JSON) files. So we're ingesting unstructured data as JSON files in the raw schema.

Raw data means you don't do anything to the data. You just bring it from the source without modifications.

Intermediate Layer

After raw ingestion, we do modeling work. In Extract, Transform, Load (ETL) terminology, you've already extracted from sources using ingestion tools, and now you're in the transformation layer.

We follow almost like a medallion architecture where data moves from raw to intermediate. The intermediate layer prepares data in a better way. It's more curated.

The unstructured JSON data from Amazon becomes organized tables. Data that was scattered across different formats gets standardized.

Marts Layer

From intermediate, we continue transformation into a third layer called marts. We separate by different mart categories based on client needs.

We might have a sales mart with information from both Shopify and Amazon sources. Customer support marts combine support data. Logistics marts handle fulfillment data. Summary tables provide high-level metrics.

When data reaches this marts layer, it's very curated and ready for business use.

‍

ELT Not ETL Pipeline

This data flow is more correctly defined as ELT, not ETL. It's Extract, Load, then Transform.

You first extract from data sources and load into Snowflake's raw schema without changing anything. Then you perform the transformations to create the other two layers (intermediate and marts).

The transformation happens inside the data warehouse using its compute power, rather than processing data before loading it.

Transformation with dbt

data build tool (dbt) handles the transformation work from raw to intermediate and from intermediate to marts.

Since we use dbt Core, the transformation work can be quite precise. We can implement comprehensive data testing and quality checks. We can choose how to materialize tables - as actual tables, views, or other formats depending on requirements.

dbt manages all the transformation logic using SQL and provides testing capabilities to ensure data quality throughout the pipeline.

‍

Data Models and Relationships

When data reaches the marts layer, we define connections and relationships between tables using what we call dimensional modeling - facts and dimensions.

Data models define how data connects in the marts layer. Facts and dimensions have relationships to each other. This structure enables analysts to write Structured Query Language (SQL) queries that join tables together to extract interesting business information.

These relationships define how different pieces of data relate to each other for analysis purposes.

Version Control with GitHub

All transformation work gets represented in a versioned system. We use GitHub for version control, treating data engineering work like software development.

This approach ensures that changes to transformation logic are tracked, reviewed, and can be rolled back if needed. Multiple team members can collaborate on the pipeline without conflicts.

From Data to Dashboards

The final step connects the curated data to dashboards and reports where business insights are extracted from the prepared data.

Everything flows to dashboards or reports that teams use for decision-making. The marts layer provides the foundation for these analytical tools.

‍

Technical Architecture Summary

This end-to-end data flow represents a modern approach to data engineering:

Sources: Multiple platforms (Shopify, Amazon, Okendo, Gorgias, Recharge, Klaviyo, Northbeam, Google Sheets)

Ingestion: Portable replaces Fivetran for better cost-effectiveness

Warehouse: Snowflake handles both structured and unstructured data

Transformation: dbt Core manages the three-layer transformation process

Version Control: GitHub tracks all changes and enables collaboration

Output: Business intelligence dashboards and reports

The three-layer medallion architecture (raw, intermediate, marts) ensures that source data is preserved while creating business-ready analytics. ELT methodology leverages cloud warehouse compute for efficient processing.

‍

FAQ

Why use ELT instead of ETL for this pipeline?

ELT extracts and loads data into the raw schema first, then transforms using the data warehouse's compute power. This preserves source data and leverages Snowflake's processing capabilities more efficiently.

What's the difference between the three layers in the architecture?

Raw stores data exactly as it comes from sources without changes. Intermediate applies transformations to make data more curated and structured. Marts organize data by business function like sales, support, or logistics.

Why did you choose Portable over Fivetran?

After analysis, Portable made more sense for this client's specific data sources and provided better value. It handled the exact connectors needed for their platforms.

How does dimensional modeling work in the marts layer?

We use facts and dimensions with defined relationships. This enables analysts to write SQL queries that join tables together to extract meaningful business information.

What role does dbt play in this architecture?

dbt handles all transformations from raw to intermediate and intermediate to marts. It provides precise control, data testing capabilities, and multiple materialization options for tables and views.

‍

Summary

This data pipeline architecture demonstrates how modern data stacks handle complex ecommerce requirements. The ELT approach preserves raw data while enabling flexible transformations using cloud warehouse capabilities.

Multiple data sources flow through standardized layers that separate concerns - raw preservation, technical transformation, and business organization.

The combination of Portable, Snowflake, dbt Core, and GitHub creates a maintainable system that handles diverse data formats while delivering reliable business insights.

This approach works for most data projects because it follows proven patterns for moving data from sources to insights in a scalable, governed way.

‍

Schedule a free strategy audit