Pushpender Garg, Author at Bitwise

Best File Formats for Loading Data into Snowflake on Cloud

Pushpender Garg — Thu, 31 Oct 2019 14:27:00 +0000

Snowflake Load Performance and File Formats

Snowflake supports multiple file formats for loading data, including CSV, JSON, AVRO, ORC, PARQUET, and XML. For our benchmarking, we considered only CSV, AVRO, PARQUET, and ORC. Since our core objective was to migrate traditional warehouses which are flat in nature, it did not make sense to use JSON or XML.

All file formats were used to load the same amount of data (number of records) with the same amount of resources from the same S3 bucket. Since we are going to do a relative comparison, the actual data size, cluster size, etc. doesn’t matter. The below diagram shows the load time comparison of the file formats.

CSV is the winner, but Avro is close second. Frankly, we were expecting Avro to perform better than CSV, but Avro was a bit slower. The only explanation that we can think of for this is probably because of additional cast operations (variant to target type) that you need to perform in case of Avro.

Parquet and ORC are 150 to 200% slower. Again just a guess, it’s probably because even though Snowflake is columnar store, its load function is accessing data row-based. Since Parquet and ORC are not as good for row-based access, there is slowness.

Before finally deciding on the file format that works best for your requirement, you might want to consider these factors as well:

File Size – For each file format we found that the best load performance was at 80MB file chunk. It may vary in your case. As per Snowflake recommendations, it should be between 10 to 100MB for best performance.
Complex and binary data types – If you have complex and binary (varbyte, BLOB, etc.) datatypes that you want to migrate, then you will have to do proper encoding for CSV. Please check Snowflake documentation for supported encodings (Hex worked best in our tests). Avro will be straighter in this case because it won’t require any encoding.
Validation errors – If you want to capture load error messages and records using Snowflake’s validation function, then CSV is the only option. That’s because all other formats load data in variant columns on which you need to apply conversion functions when loading from stage to target. These conversions will be considered as transforms and such loads are not supported by validation.

Recap on Findings for Snowflake Load

Based on our experience, we recommend that CSV and Avro should be the preferred formats for loading data into Snowflake. Even if you are planning to keep a copy of data on object storage (S3, etc.) in ORC or Parquet format for direct query, it would be advisable to create another temporary copy in CSV or Avro just for Snowflake load.

We hope that these findings are helpful in your efforts to migrate data from your on-premise data warehouses to Snowflake. Bitwise offers a full range of cloud migration services to efficiently move your data and applications to the cloud, including a framework to help expedite your data migration from legacy database applications to Snowflake by using the best of breed technologies from the leading cloud providers and Bitwise in-built utilities to fill the gaps of cloud provider tools and services.

The post Best File Formats for Loading Data into Snowflake on Cloud appeared first on Bitwise.

Practical Challenges of Large-Scale Cloud Data Migrations

Pushpender Garg — Thu, 24 Oct 2019 11:48:00 +0000

Observed Gaps in Data Migration Tools

When you are assessing these tools its looks like a straightforward thing, but when you start large-scale implementations you might start observing some gaps. Some of these gaps may impact timelines, efforts, and the overall quality for your migration. Below is a listing of the gaps that we observed. You can have a look and see if you want to include these in your assessment phase itself so that your implementation runs smoothly, or at least you are better prepared.

1. Throttling and Elasticity

You need to apply throttling on data extraction processes because of the limited availability of resources on your source database. Since you want to migrate, you don’t want to spend more to add resources on your source system and at the same time, you don’t want your data migration processes to go all guns out on source database and impact the performance of your production applications/loads. You can control this by limiting the parallelism of extraction processes but that’s not an optimal solution. Most warehouses have variable load throughout the day which means I shouldn’t be running a constant number of processes all the time.

To optimally use the source database, you need a data migration tool to sense the load of the source database and accordingly throttle data extraction processes while scaling up or down. In practice, we had to create scripts to throttle the extraction process as none of the tools provided such a feature.

2. Efficient Data Validation

Data validation provides assurance that data is loaded without any consistency, truncation, formatting, or data type issues. If your source and target are different databases, then it becomes all the more important to perform data validation. You can’t validate thousands of tables manually. In our opinion it is always good to perform full data validation, however, it depends upon how critical the data is. We have observed the following issues with tools regarding data validation:

While many tools don’t provide any data validation features, some tools provide limited (as explained in the below points) data validation.
They could validate data only if tables had primary keys.
Even if some tools can validate on the primary key, they perform so slow that it’s not possible to use this feature, and furthermore, they add double load on the source database.

The right tool should provide efficient data validation without putting extra load on the source database, irrespective of whether the table has a primary key or not.

3. Automatic Configuration and Scheduling

While most tools provide certain features to configure and schedule data migration, it still takes a considerable amount of manual effort for configuration. Imagine that you have thousands of tables and for each table you need to manually setup pipeline, specify primary keys (for validation), watermark column (for CDC), load modes (based on partitions), etc. and schedule these tables at particular time or priorities. It would be nice if the tool can get these configurations automatically, otherwise again it is required to create custom scripts to automate these things.

4. Fault Tolerance and Restartability

Most tools work well when they are on a smooth path and nothing is failing, but it becomes very difficult to manage when you have many processes that start to fail. We have observed the following issues for fault tolerance and restartability:

On restart, it doesn’t start from the point of failure but right from the extraction.
There were instances where duplicates were loaded in case of restart. It required a lot of manual hand-holding due to these issues in case it got failed.
It was very hard or not possible to get a detailed error message to pinpoint the cause of failure. Only generic messages were available.

The right tool should perform automatic restart from the point of failure and it should ensure that there is no duplicate loading of records.

5. Upserts for CDC

Data syncing is almost always required since you want to continue with the source system until your target system is stable enough. Change data capture (CDC) is needed for data syncing and this feature is available in many tools. However, many tools don’t support updates/upserts on target for data synching, as they only support append. In data warehouses, there are often a large number of dimensions which are typically upsert tables, so it required to develop extra scripts to perform upserts. The right tool should have options to configure tables for append/upsert/replace mode for CDC.

6. Generate/Respond to Events for Integration

Sometimes data migration and syncing is just one part of your overall pipeline. You have some process upstream that should trigger the data synching process or you want to trigger/notify a downstream process once data synching is done. As trivial as it may sound, many tools do not provide such integration endpoints, nor can they be integrated with an external scheduler.

7. Data Type Support

Depending upon your use case of source and target database there might be some tricky datatypes that do not have a direct mapping in the target database. Many tools also have limitations of data type support per source or target database. You should run a metadata query on your source database and make sure that you have support for all data types.

Closing the Migration Tool Gaps

We hope that after sharing our experiences with large-scale cloud data migration initiatives, you are better equipped to assess your migration tool. At Bitwise, we realized that these gaps are making the process of migrating on-premise data to the cloud difficult for our customers, so we developed utilities and accelerators to streamline cloud data migrations.

Learn more about our cloud migration experiences and the solutions we developed to close the gaps with available tools.

The post Practical Challenges of Large-Scale Cloud Data Migrations appeared first on Bitwise.

Reduce Data Latency and Refine Processes with Hadoop Data Ingestion

Pushpender Garg — Wed, 18 May 2016 09:11:00 +0000

Hadoop data ingestion has challenges like

There could be different source types like OLTP systems generating events, batch systems generating files, RDBMS systems, web based APIs, and more
Data may be available in different formats like ASCII text, EBCDIC and COMPs from Mainframes, JSON and AVRO
Data is often required to be transformed before persisting on Hadoop. Some of the common transformations could be data masking, converting data to standard format, applying data quality rules, encryption etc.
As more and more data is ingested into Hadoop, metadata plays an important role. There is no point in having large volumes of data without the knowledge of what is available. Discovery of data and other key aspects like format, schema, owner, refresh rate, source and security policy should be kept simple and easy. Features like custom tagging, data set registry, searchable repository can make life much easier. The need of the hour is a data set registry and data governance tool that can communicate with data ingestion tool to pass and use this metadata.

At present, there are many tools available for ingesting data into Hadoop. Some tools are good for specific use cases, for example Apache Sqoop is a great tool to export/import data from RDBMS systems, Apache Falcon is a good option for data set registry, Apache Flume is preferred to ingest real-time event stream of data and there are many more commercial alternatives as well. Few of the tools available are for general purposes like Spring XD (now spring cloud data flow) and Gobblin. The selection of options can be overwhelming and you certainly need the right tool for your job.

But none of these tools are capable of solving all the challenges, so enterprises have to use multiple tools for data ingestion. Overtime they also create custom tools or wrapper on top of existing tools to solve their needs. Furthermore all these tools have text based configuration files (mostly XML) which is not very convenient and user friendly to work with. All this results in lot of complexity and overhead to maintain data ingestion applications.

Looking at these gaps and to enable our clients to streamline Hadoop adoption, Bitwise has developed a GUI based tool for data ingestion and transformation on Hadoop. With convenient drag/drop GUI, it enables developers to quickly develop end to end data pipelines all through from single tool. Apart from multiple source and target options, it also has many pre-built transformations that ranges from usual data warehousing to machine learning and sentiment analysis. The tool is loaded with the following data ingestion features:

Pluggable Source and Targets – As new source and target systems emerge, it’s convenient to integrate them with ingestion framework
Scalability – It’s scalable to ingest huge amounts of data at a higher velocity
Masking and Transforming On The Fly – It’s possible to apply transformations like masking and encryption on the fly as data can be ingested swiftly in the pipeline
Data Quality – data quality checkpoints can be checked before data is published
Data Lineage and Provenance – detailed data lineage and provenance can be tracked
Searchable Metadata – datasets and their metadata can be searchable along with the option to apply custom tags

Bitwise’s Hadoop Data Ingestion and transformation tool can save enormous effort to develop and maintain data pipelines. Stay tuned for subsequent features that explore the other phases of the data value chain.

The post Reduce Data Latency and Refine Processes with Hadoop Data Ingestion appeared first on Bitwise.