Schema-on-Write vs Schema-on-Read

Since the inception of Relational Databases in the 70’s, schema on write has be the defacto procedure for storing data to be analyzed. However recently there has been a shift to use a schema on read approach, which has led to the exploding popularity of Big Data platforms and NoSQL databases.
Any data management system belongs to one of two types:

Schema-on-write:

Probably a lot of you have already worked with relational databases and you understand that once we have configured the schemas, created the tables, we can begin to ingest the data. Remember just because the data is structured doesn’t mean it starts out that way. It is likely to be something like bulk upload data from a text or csv file whose structure we know in advance because it somehow matches the schema of the tables, and once the data is loaded into the table, we can begin to execute analytical queries on our tables. This is the schema-on-write, the approach in which we define the columns, data format, relationships of columns, etc. before the actual data can be injected.

Here, we can't inject data until the table is created and we can't create tables until we understand the schema of the data that will be in this table. This is impossible until we understand the entities that this data represents to correctly reflect their relationships in the tables. This also leads to problems with changing the data. For example, a source text file has changed, or someone has added data or changed the column type. Then we need to drop the table and load all the data again. This is normal when we are talking about a small amount of data, it is normal if there are no foreign keys. But if there are external keys, and we have, for example, 800 GB of data? That would be a real problem, and with that approach, it would take days to make such simple changes. Also, the downside of the strictly defined schema is that the data has been modified and structured for a specified limited purpose and cannot be reused for future uses that we do not know yet.

Schema-on-read:

With the problems with traditional relational database methods and growing data volumes, another approach was born. Here we upload data as it arrives without any changes or transformations. It is a schema-on-read, it has fast data ingestion because data shouldn't follow any internal schema. it's just copying/moving files. This type of data handling is more flexible in case of big data, unstructured data, or frequent schema changes. So if we analyze the data and try to understand its structure and figure out another way to interpret it, we can simply change the code that is handling the data. We do not need to change the schemas and reload all the data in the data storage. But since the data does not go through strict ETLs and transformation into strict data storage schemas, there can be a lot of missing or invalid data, duplicates, and many other problems that can lead to inaccurate or incomplete query results. However, you should understand that some level of schema design is inevitable. One way or another, it is necessary both for understanding the data structure in order to search for information in the data, to check the incoming data, and to manage the data. But not all data falls under this process.

As always, there is no silver bullet. Like most things in engineering / development, it depends on the use case. Is the workload mostly data supporting a dashboard where the results need to be fast and repetitive? It’s going to need to use a schema on write database. Will there be a lot of unknowns with the data and constant new sources? Sounds like a schema on read will work. e.t.c.

Key Differences: Since schema on read allows for data to be inserted without applying a schema should it become the defacto database? No, there are pros and cons for schema on read and schema on write. For example when structure of the data is known schema on write is perfect because it can return results quickly. See the comparison below for a quick overview:


If you still have questions or just want to chat about Tech stuffs, contact me and i will be glad to help!.

Comments

Popular posts from this blog

AWS: Benefits of using Amazon S3

Google To Acquire Looker For $2.6 Billion

Python - GUI - Tkinter(Bar & Pie Chart)