Skip to content

big-data-spaces/data-virtualization-primer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Virtualization Primer

Data Virtualization

According to Wikipedia:

Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located 1 and can provide a single customer view  (or single view of any other entity) of the overall data 2.

Related Topics

Advantages of Data Virtualization

  • Keeps control by the source party (department, company)
  • Reduces the (even: legal) risk of redundant, old or erroneous data
  • Reduces the workload for copying/moving data
  • Does not presuppose or impose a single data model / schema / format
  • Supports transactional reading and writing
  • Allows to analyze across multiple domains

Challenges of Data Virtualization

  • Differences in source and consumer formats and semantics
  • Declarative approach based on the target outcome
  • Adapting to changing environments (source and target)
  • Networking stability to the source systems
  • Variying qualities of source data
  • Managing the security and privacy of the source data

Data Virtualization Concepts

Abstraction

Abstract the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology.

Virtualized Data Access

Connect to different data sources and make them accessible from a common logical data access point.

Transformation

Transform, improve quality, reformat, aggregate etc. source data for consumer use.

Federation

Combine result sets from across multiple source systems.

Delivery

Publish result sets as views and/or data services executed by client application or users when requested.

Lets make an example

Run Data and Virtualization Services

Type in the following command in your terminal

docker compose up

Terminal

If you run docker on your local computer, you may open http://localhost:9047 in your browser. If you run docker in Codespaces, you should open the forwarded address for port 9047.

Codespaces Port Forwarding

Setup the Data Virtualizer

When you run dremio for the first time, you have to enter some admin account details.

Dremio Admin Account

Then you will see an empty workspace which has no

  • Spaces (which is the equivalent of "views" or "output schemas")
  • Sources (which is the equivalent of "tables" or "input schemas")

Dremio Workspace

Add a (Public, Internet, Big) Data Source

Next we will create a source which points to an open dataset

Dremio Add Source

which is located in a public S3 bucket

Dremio S3 Source

For each source (connection) you can enable various options to optimize the scanning (both meta-data and data). These settings allow to specifiy a sweet spot between changing sources and caching the raw data access.

Dremio Source Options

Further Dremio Source Options

Finally, we declare a folder or a file to contain a data (table) and propose a format.

Mount Folder as Table

Choose Source Format

Afterwards, we can immediately preview the data which describes individual taxi trips in New York City.

Dremio Parquet Preview

Add a (Internal, Classic, Small) Data Source

Now, let us add another source, this time to a "local" database.

Dremio Add DB Source

Dremio PostgreSQL Source

Again, you can browse through all available tables

Dremio Browse Tables

and preview the content of the table which describes the weather conditions on individual days in New York City.

Dremio Table Preview

Transform and Publish Federated Views

To publish a special analysis which we will perform to bring together the (public) taxi trips and the (private) weather conditions, we first create a new output space.

Dremio Add Space

In that space, we will then create a view which uses SQL to mangle the two sources

CREATE OR REPLACE VIEW "New York"."TripWeather" AS 
    SELECT trips.*, weather.*
      FROM "New York Taxi Trips"."samples.dremio.com"."NYC-taxi-trips" AS trips
      JOIN "New York City Weather".public."nyc_weather" AS weather
        ON (weather."date" = DATE_TRUNC('DAY',trips."pickup_datetime"))

Dremio Create View

Since the resulting view will be used in analytic settings, we may specify "dimensional" reflections (i.e. pre-aggregated cache tables) which will speedup any query afterwards. Note that reflections will be refreshed (upon some settings) when the sources change.

Dremio Add Reflection

Finally, we can make a correlation analysis between the maximal temperature at a certain day and the average cost (including tips) of the taxi trips. There is a maximum at 68.0 degree fahrenheit.

SELECT 
    FLOOR("tempmax"/2.0)*2.0 AS "TEMP", 
    SUM("total_amount")/COUNT(*) AS AVG_INCOME 
  FROM "New York".TripWeather 
 GROUP BY FLOOR("tempmax"/2.0)
 ORDER BY FLOOR("tempmax"/2.0) ASC

Products and Resources

Products

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages