Hadoop Archive - Bitwise https://www.bitwiseglobal.com/en-us/blog/tag/hadoop/ Technology Consulting and Data Management Services Mon, 01 Jul 2024 06:19:20 +0000 en-US hourly 1 https://cdn2.bitwiseglobal.com/bwglobalprod-cdn/2022/12/cropped-cropped-bitwise-favicon-32x32.png Hadoop Archive - Bitwise https://www.bitwiseglobal.com/en-us/blog/tag/hadoop/ 32 32 Traditional ETL vs ELT on Hadoop https://www.bitwiseglobal.com/en-us/blog/traditional-etl-vs-elt-on-hadoop/ https://www.bitwiseglobal.com/en-us/blog/traditional-etl-vs-elt-on-hadoop/#respond Tue, 04 Jul 2017 07:22:00 +0000 https://www.bitwiseglobal.com/en-us/traditional-etl-vs-elt-on-hadoop/ ETL ETL stands for Extract, Transform and Load. The ETL process typically extracts data from the source / transactional systems, transforms it to fit the model of data warehouse and finally loads it to the data warehouse. The transformation process involves cleansing, enriching, and applying transformations to create the desired output. Data is usually dumped ... Read more

The post Traditional ETL vs ELT on Hadoop appeared first on Bitwise.

]]>

ETL

ETL stands for Extract, Transform and Load. The ETL process typically extracts data from the source / transactional systems, transforms it to fit the model of data warehouse and finally loads it to the data warehouse.

The transformation process involves cleansing, enriching, and applying transformations to create the desired output.

Data is usually dumped to a staging area after extraction. In some cases, the transformations might be applied on the fly and loaded to the target system without the intermediate staging area.

The diagram below illustrates a typical ETL process.

ETL Process

The development process usually starts from the output, backward, as the data model for the target system (i.e. data warehouse) is predefined.

Since the data model for the data warehouse is predefined, only the relevant and important data is pulled from the source system and loaded to the data warehouse.

Advantages of ETL Process

  • Ease of development: Since the process usually involves development from the output-backward and loading just the relevant data, it reduces the complexity and time involved in development.
  • Process maturity: This process has been the norm for data warehouse development and has been in practice for over two decades. The ETL process is quite mature with multiple production implementations and well-defined best practices and processes.
  • Tools availability: A prolific number of tools are available that implement ETL. This provides flexibility in choosing the most appropriate tool.
  • Availability of expertise: The decades of existence and extensive adoption of the ETL process across the board have ensured the abundant availability of ETL experts.

Disadvantages of ETL Process

  • Flexibility: The ETL process loads only the important data, as identified at design time. If there is a need to add an additional data attribute, or if a new data attribute is introduced in the system, it would involve updating and re-engineering the entire ETL routine. This adds to the time and cost involved in the development and maintenance of the ETL process.
  • Hardware: Most ETL tools come with their own hardware requirements. They have proprietary execution engines which do not use the existing data warehouse hardware. This leads to additional costs.
  • Cost: The maintenance, hardware and licensing costs of the ETL tools add up to the total cost of operating and maintaining the ETL process.
  • Limited to relational data: Traditional ETL tools are mostly limited to processing relational data. They are unable to process semi-structured and unstructured data like social media feeds, log files, etc.

ELT

ELT stands for Extract, Load, and Transform.

As opposed to loading just the transformed data in the target systems, the ELT process loads the entire data into the data lake.

This results in faster load times. Optionally, the load process can also perform some basic validations and data cleansing rules.

The data is then transformed for analytical reporting as per demand. Though the ELT process has been in practice for some time, it is only getting popular now with the rise of Hadoop.

The diagram below illustrates a typical ELT process on Hadoop.

ELT Process on Hadoop

Advantages of ELT Process

  • Separation of concerns: The ELT process separates the loading and transformation tasks into independent blocks and thereby minimizes the interdependencies between these processes. This makes project management easier as the project can be broken down into manageable chunks. This also minimizes the risks as a problem in one area does not affect the other.
  • Flexible and future-proof: In ELT implementation, entire data from the source systems are already available in the data lake. This, combined with the isolation of the transformation process, guarantees that future requirements can easily be incorporated into the warehouse structure.
  • Utilizes existing hardware: Hadoop uses the same hardware for storage as well as for processing. This helps in cutting down additional hardware costs.
  • Cost-effective: All the points mentioned above in addition to the open-source Hadoop framework cuts the considerable cost of operating and maintaining the ELT process.
  • Not limited to relational data: With Hadoop, the ELT processes can process semi-structured and unstructured data.

Disadvantages of ELT Process

  • Process maturity: Though the ELT process has been there for a while, it has not been widely adopted. However, the ELT process is gaining popularity and adoption with the rise of Hadoop. The collaboration across the industry for implementing best practices in ELT is increasing.
  • Tools availability: As a result of limited adoption, the number of tools available to implement ELT processes on Hadoop is currently limited. One tool aimed at overcoming this limitation is Hydrograph, which was created specifically for developing ELT processes in the big data ecosystem.
  • Availability of expertise: The limited adoption of ELT technology again has an impact on the availability of experts on ELT. The experts for ELT on Hadoop are currently scarce. However, this is changing fast. The immense popularity and adoption of Hadoop and ELT on Hadoop are increasing the number of people working on these technologies.

The Way Forward

Though the ETL process and traditional ETL tools have been serving the data warehouse needs, the changing nature of data and its rapidly growing volume have stressed the need to move to Hadoop.

Apart from the obvious benefits of cost-effectiveness and scalability of Hadoop, ELT on Hadoop provides flexibility in the data processing environment.

Transitioning from traditional ETL tools and traditional data warehouse environments to ELT on Hadoop is a big challenge – a challenge almost all enterprises are currently facing.

Apart from being a change in environment and technical skillset, it requires a change in mindset and approach.

ELT is not as simple as rearranging the letters. On one hand, you have developers with years of ETL tool experience and business knowledge; on the other hand, you have the long-term benefit of moving to ELT on Hadoop.

Training the existing workforce, who is conversant with the drag-drop GUI-based tools, to work on java programming is a time-consuming challenge.

In order to bridge this technology gap, Bitwise contributed to the development of Hydrograph, an open-source ELT tool on Hadoop.

Hydrograph

Hydrograph is a desktop-based ELT tool with drag-drop functionalities to create data processing pipelines like any other legacy ETL tool. However, the biggest differentiator for Hydrograph is that it is built solely for ELT on the Hadoop ecosystem (including engines such as Spark and Flink).

Hydrograph has a lean learning curve for existing ETL developers which enables enterprises to quickly migrate to ELT processing on Hadoop or Spark. Hydrograph’s plug-and-play architecture makes the data processing pipelines independent of the underlying execution engine, thus making the ETL processes obsolescence proof.

To learn more about Hydrograph, check out our on-demand webinar. If you are ready to take a deeper dive, access Hydrograph on GitHub now.

The post Traditional ETL vs ELT on Hadoop appeared first on Bitwise.

]]>
https://www.bitwiseglobal.com/en-us/blog/traditional-etl-vs-elt-on-hadoop/feed/ 0
Empower your Data and Ensure Continuity of Operations with Hadoop Administration https://www.bitwiseglobal.com/en-us/blog/empower-your-data-and-ensure-continuity-of-operations-with-hadoop-administration/ https://www.bitwiseglobal.com/en-us/blog/empower-your-data-and-ensure-continuity-of-operations-with-hadoop-administration/#respond Sat, 18 Mar 2017 10:47:00 +0000 https://www.bitwiseglobal.com/en-us/empower-your-data-and-ensure-continuity-of-operations-with-hadoop-administration/ Planning A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to ... Read more

The post Empower your Data and Ensure Continuity of Operations with Hadoop Administration appeared first on Bitwise.

]]>

Planning

A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to choose PAAS options.

For any organization, understanding of the estimated investments is mandatory in the initial phases. Capacity planning/estimation is the next step after successful completion of the POC. Choosing the right combination of storage and computing hardware, interconnected network, operating system, storage configuration/disk performance, network setup etc. play an important role on the overall cluster performance. Similarly, special considerations are required for the master and slave node hardware configuration. The right balance of needs vs. greed can be achieved only after years of implementation experience.

Deployment

Once you have the hardware defined and in place, the next stage is the planning and deployment of the Hadoop cluster. This involves configuring the OS with the recommended configuration changes to suite the Hadoop stack, configuration of SSH and Disk, choosing and installing a Hadoop distribution (Cloudera, Hortonworks, MapR or Apache Hadoop) as per the requirements, meeting the configuration requirements for Hadoop daemons for optimized performance. All of these setups vary based on the size of your cluster, so it’s imperative that you configure and deploy after covering all the aspects and pre-requisites.

Another important aspect is designing of cluster from development perspective, various environments (Dev, QA, Prod etc.) and usage perspective, i.e. access security and data security.

Managing a Hadoop Cluster

After implementation of the Hadoop cluster, the Hadoop admin team needs to maintain the health and availability of the cluster round the clock. Some of the common tasks include management of the name node, data nodes, HDFS and Mapreduce jobs which forms the core of the Hadoop eco system. Impact to any of the components can negatively affect the cluster performance. For e.g. unavailability of a data node, say due to a network issue, will cause the HDFS to replicate the under-replicated blocks which will bring a lot of overhead and cause the cluster to slow down or even make it inaccessible in case of multiple data node disconnections.

Name node is another important component in a Hadoop cluster and acts as a single point of failure. Consequently, it is important that a backup of fsimage and editlogs are taken periodically using the secondary name node so as to recover from a name node failure. The other administrative tasks include:

  • Managing HDFS quota at application or user level
  • Configuring scheduler (FIFO, Fair or Capacity) and resource allocation to different services like YARN, HIVE, HBASE, HDFS etc.
  • Upgrading and applying patches
  • Configuring logging for effective debugging in case of failures or performance issues
  • Commissioning and decommissioning nodes
  • User management

Hardening your Hadoop Cluster

Productionization of a Hadoop cluster mandates implementation of hardening measures. Hardening of Hadoop typically covers:

  1. Configuring Security: This is one of the most crucial and required configuration to make your cluster enterprise ready and can be classified at user and data level.
    1. User Level: User security addresses the authentication (who am I) and authorization (what can I do) part of the security implementation along with configuring access control over resource. Kerberos takes care of the authentication protocol between the client/server applications and is majorly used to sync with LDAP for better management. Different distribution recommends different authorization mechanism. For e.g. Cloudera has good integration with Sentry that provides a fine grained row level security to Hive and Impala. Further integration with HDFS ACL’s percolates the same access to other services like Pig, HBASE, etc.
    2. Data Level: Data Security, HDFS transparent encryption provides another level of security for data at rest. This is one of the mandatory requirements for some of the organizations to be complied with different government and financial regulatory bodies. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
  2. High Availability: Name node as mentioned earlier is a single point of failure and unavailability of the same results in making the whole cluster unavailable, which is not a recommended approach for a production cluster. Name node HA helps to mitigate this risk by having a standby node which automatically takes over from the primary name node in cases of failure.
  3. Name Node Scaling: This is mostly applicable in case of a large cluster. As name node stores data in memory with large volume of files, name node memory can become a bottleneck. HDFS federation helps in resolving the issues by facilitating multiple name nodes with each name node managing a part of the HDFS namespace.

Monitoring

Proactive monitoring is essential to maintain the health and availability of the cluster. General monitoring tasks includes monitoring cluster nodes and networks for CPU, memory, network bottlenecks, and more. The Hadoop administrator should be competent to track the health of the system, monitor workloads and work with the development team to implement new functionality. Failure to do so can have severe impact on the health of the system, quality of data and ultimately will affect the business user’s ease of access and decision making capability.

Performance Optimization and Tuning

Performance tuning and identifying bottlenecks is one of the most vital tasks for a Hadoop Administrator. Considering the distributed nature of the system and a manifold of configuration files and parameters, it may take hours to days to identify and resolve a bottleneck, if not get started in the right direction. Often it is found that the root cause is at a different end of the system rather that what is pointed out by the application. This can be counterbalanced with the help of an expert who can assist with a detailed understanding of the Hadoop ecosystem along with the application. Moreover, an optimized resource (CPU, Memory) is essential for an effective utilization of the cluster and aids in the distribution between different Hadoop components like HDFS, YARN, and HBASE etc. To overcome such challenges, it’s important to have the statistics in place in the form of benchmarks, tuning of the configuration parameters for best performance, strategies and tools in place for rapid resolutions.

This blog is part of the Hadoop administration blog series and aims to provide a high level overview of Hadoop administration, associated roles, responsibilities and challenges a Hadoop admin faces. In the future editions, we will dwell further into the above mentioned points, various aspects of Hadoop Infrastructure Management responsibilities and further understand how each phase plays an important role in administering an enterprise Hadoop cluster. For more on how we can help, visit

The post Empower your Data and Ensure Continuity of Operations with Hadoop Administration appeared first on Bitwise.

]]>
https://www.bitwiseglobal.com/en-us/blog/empower-your-data-and-ensure-continuity-of-operations-with-hadoop-administration/feed/ 0
Language for Business Made Easy with Ab Initio Express> IT https://www.bitwiseglobal.com/en-us/blog/language-for-business-made-easy-with-ab-initio-expressit/ https://www.bitwiseglobal.com/en-us/blog/language-for-business-made-easy-with-ab-initio-expressit/#respond Sat, 18 Mar 2017 10:37:00 +0000 https://www.bitwiseglobal.com/en-us/language-for-business-made-easy-with-ab-initio-expressit/ Ab Initio Express>IT Architecture To leverage the benefits of Express>IT, first let’s understand the architecture and the components of this product with the help of an example. IT Flow”> Rule Generator Utility A business user creates a text/excel file with the rules that he has to implement. After analysis of the text file by the ... Read more

The post Language for Business Made Easy with Ab Initio Express> IT appeared first on Bitwise.

]]>

Ab Initio Express>IT Architecture

To leverage the benefits of Express>IT, first let’s understand the architecture and the components of this product with the help of an example.

Ab Initio Express > IT Architecture”></p>
<p class=If a business wants to tap the reporting of all the employees in the organization having a salary greater than $100K p.a., the Ab Initio technical user needs to use the Ab Initio ACE component to set the framework where sources will be defined in the form of an Employee Payroll table, required datasets, reference tables, target table file and the relevant fields required. As a result of this, an ACE Template is set.

Subsequently, the business user can access these tables, fields, and elements through this ACE template using the Ab Initio BRE component. The business user can select additional desired fields he wants to use along with the business rule of salary > 100 K which gives the desired output.

Furthermore, the BRE puts the business user in the driver’s seat when it comes to verifying the business rules. The business users are not only able to put the rules directly into the system, they are also able to immediately see the results of applying those rules to the test data. If they don’t like what they see, they can instantly change the rules, saving an enormous amount of time.

How Bitwise can help?

Bitwise has built an integrated component framework on top of Express>IT which complements the Express>IT functionality and aids the business user to play in an environment conducive to producing speedy output. The framework offers a complete suite of Rule Management & Governance systems.

Let’s dwell deeper to understand the flow and how it works:

Ab Initio Express > IT Flow”></p>
<ul class=

    • Rule Generator Utility
      • A business user creates a text/excel file with the rules that he has to implement. After analysis of the text file by the generator, build transformation is imported by BRE
    • Validation Template
      • It’s a GUI-based tool which enables business users to compare the output results from Express>IT against any source of data i.e. table/file. This utility provides a reconciliation facility to reconcile against data from the legacy systems
    • Publishing results on the forum
      • It’s a graphical User Interface UI based utility which provides a business user the flexibility to post their output along with their Ruleset on any given portal/ forum
    • Upload Utility
      • The feature enables businesses to upload files from either Windows or Unix and helps the user to create their own version of the file to be used in the code
    • Test Bed
      It’s an automated process to slice the production data. It enables:

      • Flexibility to run the business rulesets multiple times with a snapshot of the production data
      • Availability of ample number of months’ data at any given point in time
      • Email notification feature to be sent to business users
    • Backup and Recovery
      • This feature provides businesses a way to maintain results along with rule sets and helps the user to not run the same rule if it’s already being run earlier
      • It’s also a centralized data repository for O/P and rulesets

    As much as the world has become a smaller place, technology is also giving opportunities for people with diverse skill sets to come together and work for a common objective. Ab Initio in the form of Business Rule Engine has given this platform for a business user to “Express IT”.

    Indeed language for business has been made easy by Ab Initio !!!!

  • The post Language for Business Made Easy with Ab Initio Express> IT appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/language-for-business-made-easy-with-ab-initio-expressit/feed/ 0
    Unlock the Best Value Out of Your Big Data Hadoop https://www.bitwiseglobal.com/en-us/blog/unlock-the-best-value-out-of-your-big-data-hadoop/ https://www.bitwiseglobal.com/en-us/blog/unlock-the-best-value-out-of-your-big-data-hadoop/#respond Sat, 21 May 2016 09:09:00 +0000 https://www.bitwiseglobal.com/en-us/unlock-the-best-value-out-of-your-big-data-hadoop/ Planning A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to ... Read more

    The post Unlock the Best Value Out of Your Big Data Hadoop appeared first on Bitwise.

    ]]>

    Planning

    A Hadoop administration team’s responsibilities starts when a company kick-starts with the Hadoop POC. An experienced team like Bitwise comes up with a roadmap right at the beginning to help scale from POC to production with minimal wastage of initial investment and effective guidance on investment decisions, be it in-house infrastructure, POC or to choose PAAS options.

    For any organization, understanding of the estimated investments is mandatory in the initial phases.
    Capacity planning/estimation is the next step after successful completion of the POC. Choosing the right combination of storage and computing hardware, interconnected network, operating system, storage configuration/disk performance, network setup etc. play an important role on the overall cluster performance. Similarly, special considerations are required for the master and slave node hardware configuration. The right balance of needs vs. greed can be achieved only after years of implementation experience.

    Deployment

    Once you have the hardware defined and in place, the next stage is the planning and deployment of the Hadoop cluster. This involves configuring the OS with the recommended configuration changes to suite the Hadoop stack, configuration of SSH and Disk, choosing and installing a Hadoop distribution (Cloudera, Hortonworks, MapR or Apache Hadoop) as per the requirements, meeting the configuration requirements for Hadoop daemons for optimized performance. All of these setups vary based on the size of your cluster, so it’s imperative that you configure and deploy after covering all the aspects and pre-requisites.

    Another important aspect is designing of cluster from development perspective, various environments (Dev, QA, Prod etc.) and usage perspective, i.e. access security and data security.

    Managing a Hadoop Cluster

    After implementation of the Hadoop cluster, the Hadoop admin team needs to maintain the health and availability of the cluster round the clock. Some of the common tasks include management of the name node, data nodes, HDFS and Mapreduce jobs which forms the core of the Hadoop eco system. Impact to any of the components can negatively affect the cluster performance. For e.g. unavailability of a data node, say due to a network issue, will cause the HDFS to replicate the under-replicated blocks which will bring a lot of overhead and cause the cluster to slow down or even make it inaccessible in case of multiple data node disconnections.

    Name node is another important component in a Hadoop cluster and acts as a single point of failure. Consequently, it is important that a backup of fsimage and editlogs are taken periodically using the secondary name node so as to recover from a name node failure. The other administrative tasks include:

    • Managing HDFS quota at application or user level
    • Configuring scheduler (FIFO, Fair or Capacity) and resource allocation to different services like YARN, HIVE, HBASE, HDFS etc.
    • Upgrading and applying patches
    • Configuring logging for effective debugging in case of failures or performance issues
    • Commissioning and decommissioning nodes
    • User management

    Hardening your Hadoop Cluster

    Productionization of a Hadoop cluster mandates implementation of hardening measures. Hardening of Hadoop typically covers:

    1. Configuring Security: This is one of the most crucial and required configuration to make your cluster enterprise ready and can be classified at user and data level.
      1. User Level: User security addresses the authentication (who am I) and authorization (what can I do) part of the security implementation along with configuring access control over resource. Kerberos takes care of the authentication protocol between the client/server applications and is majorly used to sync with LDAP for better management. Different distribution recommends different authorization mechanism. For e.g. Cloudera has good integration with Sentry that provides a fine grained row level security to Hive and Impala. Further integration with HDFS ACL’s percolates the same access to other services like Pig, HBASE, etc.
      2. Data Level: Data Security, HDFS transparent encryption provides another level of security for data at rest. This is one of the mandatory requirements for some of the organizations to be complied with different government and financial regulatory bodies. Having transparent encryption built into HDFS makes it easier for organizations to comply with these regulations.
    1. High Availability: Name node as mentioned earlier is a single point of failure and unavailability of the same results in making the whole cluster unavailable, which is not a recommended approach for a production cluster. Name node HA helps to mitigate this risk by having a standby node which automatically takes over from the primary name node in cases of failure.
    1. Name Node Scaling: This is mostly applicable in case of a large cluster. As name node stores data in memory with large volume of files, name node memory can become a bottleneck. HDFS federation helps in resolving the issues by facilitating multiple name nodes with each name node managing a part of the HDFS namespace.

    Monitoring

    Proactive monitoring is essential to maintain the health and availability of the cluster. General monitoring tasks includes monitoring cluster nodes and networks for CPU, memory, network bottlenecks, and more. The Hadoop administrator should be competent to track the health of the system, monitor workloads and work with the development team to implement new functionality. Failure to do so can have severe impact on the health of the system, quality of data and ultimately will affect the business user’s ease of access and decision making capability.

    Performance Optimization and Tuning

    Performance tuning and identifying bottlenecks is one of the most vital tasks for a Hadoop Administrator. Considering the distributed nature of the system and a manifold of configuration files and parameters, it may take hours to days to identify and resolve a bottleneck, if not get started in the right direction. Often it is found that the root cause is at a different end of the system rather that what is pointed out by the application. This can be counterbalanced with the help of an expert who can assist with a detailed understanding of the Hadoop ecosystem along with the application. Moreover, an optimized resource (CPU, Memory) is essential for an effective utilization of the cluster and aids in the distribution between different Hadoop components like HDFS, YARN, and HBASE etc. To overcome such challenges, it’s important to have the statistics in place in the form of benchmarks, tuning of the configuration parameters for best performance, strategies and tools in place for rapid resolutions.

    This blog is part of the Hadoop administration blog series and aims to provide a high level overview of Hadoop administration, associated roles, responsibilities and challenges a Hadoop admin faces. In the future editions, we will dwell further into the above mentioned points, various aspects of Hadoop Infrastructure Management responsibilities and further understand how each phase plays an important role in administering an enterprise Hadoop cluster. For more on how we can help, visit.

     

    The post Unlock the Best Value Out of Your Big Data Hadoop appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/unlock-the-best-value-out-of-your-big-data-hadoop/feed/ 0
    Reduce Data Latency and Refine Processes with Hadoop Data Ingestion https://www.bitwiseglobal.com/en-us/blog/reduce-data-latency-and-refine-processes-with-hadoop-data-ingestion/ https://www.bitwiseglobal.com/en-us/blog/reduce-data-latency-and-refine-processes-with-hadoop-data-ingestion/#respond Wed, 18 May 2016 09:11:00 +0000 https://www.bitwiseglobal.com/en-us/reduce-data-latency-and-refine-processes-with-hadoop-data-ingestion/ Hadoop data ingestion has challenges like There could be different source types like OLTP systems generating events, batch systems generating files, RDBMS systems, web based APIs, and more Data may be available in different formats like ASCII text, EBCDIC and COMPs from Mainframes, JSON and AVRO Data is often required to be transformed before persisting ... Read more

    The post Reduce Data Latency and Refine Processes with Hadoop Data Ingestion appeared first on Bitwise.

    ]]>

    Hadoop data ingestion has challenges like

    1. There could be different source types like OLTP systems generating events, batch systems generating files, RDBMS systems, web based APIs, and more
    2. Data may be available in different formats like ASCII text, EBCDIC and COMPs from Mainframes, JSON and AVRO
    3. Data is often required to be transformed before persisting on Hadoop. Some of the common transformations could be data masking, converting data to standard format, applying data quality rules, encryption etc.
    4. As more and more data is ingested into Hadoop, metadata plays an important role. There is no point in having large volumes of data without the knowledge of what is available. Discovery of data and other key aspects like format, schema, owner, refresh rate, source and security policy should be kept simple and easy. Features like custom tagging, data set registry, searchable repository can make life much easier. The need of the hour is a data set registry and data governance tool that can communicate with data ingestion tool to pass and use this metadata.

    At present, there are many tools available for ingesting data into Hadoop. Some tools are good for specific use cases, for example Apache Sqoop is a great tool to export/import data from RDBMS systems, Apache Falcon is a good option for data set registry, Apache Flume is preferred to ingest real-time event stream of data and there are many more commercial alternatives as well. Few of the tools available are for general purposes like Spring XD (now spring cloud data flow) and Gobblin. The selection of options can be overwhelming and you certainly need the right tool for your job.

    But none of these tools are capable of solving all the challenges, so enterprises have to use multiple tools for data ingestion. Overtime they also create custom tools or wrapper on top of existing tools to solve their needs. Furthermore all these tools have text based configuration files (mostly XML) which is not very convenient and user friendly to work with. All this results in lot of complexity and overhead to maintain data ingestion applications.

    Looking at these gaps and to enable our clients to streamline Hadoop adoption, Bitwise has developed a GUI based tool for data ingestion and transformation on Hadoop. With convenient drag/drop GUI, it enables developers to quickly develop end to end data pipelines all through from single tool. Apart from multiple source and target options, it also has many pre-built transformations that ranges from usual data warehousing to machine learning and sentiment analysis. The tool is loaded with the following data ingestion features:

    • Pluggable Source and Targets – As new source and target systems emerge, it’s convenient to integrate them with ingestion framework
    • Scalability – It’s scalable to ingest huge amounts of data at a higher velocity
    • Masking and Transforming On The Fly – It’s possible to apply transformations like masking and encryption on the fly as data can be ingested swiftly in the pipeline
    • Data Quality – data quality checkpoints can be checked before data is published
    • Data Lineage and Provenance – detailed data lineage and provenance can be tracked
    • Searchable Metadata – datasets and their metadata can be searchable along with the option to apply custom tags

    Bitwise’s Hadoop Data Ingestion and transformation tool can save enormous effort to develop and maintain data pipelines. Stay tuned for subsequent features that explore the other phases of the data value chain.

    The post Reduce Data Latency and Refine Processes with Hadoop Data Ingestion appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/reduce-data-latency-and-refine-processes-with-hadoop-data-ingestion/feed/ 0
    Understanding the Hadoop Adoption Roadmap https://www.bitwiseglobal.com/en-us/blog/understanding-the-hadoop-adoption-roadmap/ https://www.bitwiseglobal.com/en-us/blog/understanding-the-hadoop-adoption-roadmap/#respond Wed, 18 May 2016 08:48:00 +0000 https://www.bitwiseglobal.com/en-us/understanding-the-hadoop-adoption-roadmap/ Stage 1: Understanding and Identifying Business Cases As with every technology switch, the first stage is often understanding the new technology and tool stack as well as propagating the benefits that the end user and the organization sees. At this stage looking at your current system with a close eye helps to identify the business ... Read more

    The post Understanding the Hadoop Adoption Roadmap appeared first on Bitwise.

    ]]>

    Stage 1: Understanding and Identifying Business Cases

    As with every technology switch, the first stage is often understanding the new technology and tool stack as well as propagating the benefits that the end user and the organization sees. At this stage looking at your current system with a close eye helps to identify the business cases that are redundant or can be merged together to bring in only the things that matter. This also helps build a priority list of projects that reflect definite business use cases. You need to define the key indicators of success here. The performance and success criteria of the legacy system you are currently running need to be revamped completely as well. The business stakeholders are key here to refine the business SLAs.

    Stage 2: Warming Up to the Technology Stack

    Next up, bring in the technology stack for people to familiarize themselves with. Build a playground or a dirty development environment where developers and analysts can experiment and innovate without the fear of bringing down the business. This will allow the Data Modelers and DBAs to build the most optimal warehouse as well as enable the ETL developers to learn the pit falls. This will ensure they build the best practices before heading into full-fledged project work.

    Stage 3: Converting the Old to New

    A key element of Hadoop Adoption is running an efficient conversion of the old to new. Identify early on the dark or missing data elements, build coding standards and optimization techniques, automate as much as possible to reduce the conversion errors and validate against the legacy the correctness of the conversion. Bitwise recommends a Proof -> Pilot -> Production path to conversion where we nibble away at the legacy applications and build a repeatable framework before biting a big chunk of business requirement.

    Stage 4: Maintenance and Support

    Once in production, Hadoop needs what every production system in the world needs – maintenance and support. Things breakdown and undergo upgradation or deprecation. What is needed is a dedicated team to keep track of the Hadoop ecosystem. Besides the regular application production, a support team structure is required to ensure availability and reliability of the environment.

    Backed by extensive experience and having worked with Fortune 500 companies, we at Bitwise have ensured a walk in the park for our clients adopting Hadoop and have unraveled effective usage of Hadoop to meet their ELT and Analytics needs. Have a look at our Excellerators and get to know how organizations worldwide are unlocking the real value of Hadoop with our proven methodology.

    The post Understanding the Hadoop Adoption Roadmap appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/understanding-the-hadoop-adoption-roadmap/feed/ 0
    Crossing Over Big Data’s Trough of Disillusionment https://www.bitwiseglobal.com/en-us/blog/crossing-over-big-datas-trough-of-disillusionment/ https://www.bitwiseglobal.com/en-us/blog/crossing-over-big-datas-trough-of-disillusionment/#respond Mon, 24 Aug 2015 15:33:00 +0000 https://www.bitwiseglobal.com/en-us/crossing-over-big-datas-trough-of-disillusionment/ Defining this Trough of Disillusionment Enterprises are feeling the pressure that they should be doing “something” with Big Data. There are a few organizations that have figured it out and are creating breakthrough insights. However, there’s a much larger set that has maybe reached the stage of installing say 10 Hadoop nodes and are wondering ... Read more

    The post Crossing Over Big Data’s Trough of Disillusionment appeared first on Bitwise.

    ]]>

    Defining this Trough of Disillusionment

    Enterprises are feeling the pressure that they should be doing “something” with Big Data. There are a few organizations that have figured it out and are creating breakthrough insights. However, there’s a much larger set that has maybe reached the stage of installing say 10 Hadoop nodes and are wondering “now what?”

    Per Gartner, this is the phase where excitement over the latest technology gives way to confusion or ambiguity – referred to as the “Trough of Disillusionment.”

    Data Democracy – The Foundation for Big Data

    Use cases involving Analytics or Data Mining with an integrated Social Media component are being thrown at enterprise executives. These use cases appear “cool” and compelling upfront but upon thorough analysis reveal that they are missing some necessary considerations such as Data/Info Security, Privacy regulations, Data Lineage from an implementation perspective, and in addition fail to build a compelling ROI case.

    One needs to realize that for any “cool” use case to generate eventual ROI, it is very important to focus on Big Data Integration (i.e. Access, Preparation, and Availability of the data – see firms must not overlook importance of big data integration). Doing so essentially will empower the enterprises to implement ANY use case that makes the most sense to their particular business.

    “Data Democracy” should be the focus. This focus also helps address the technology challenge of handling ever-growing enterprise data efficiently and leverage the scalable and cost-effective nature of these technologies – and an instant ROI!

    Concept to Realization – Real Issues

    Once this is understood, the next step is to figure out a way to introduce the use of these new technologies to achieve the above goals and doing so in the least disruptive and most cost-effective way. In fact, enterprises are looking at ETL as a standard use case for Big Data technologies like Hadoop. Using Hadoop as a Data Integration or ETL platform requires developing Data Integration applications using programming languages such as Map Reduce. This presents a new challenge in combining of Java skillsets with the expertise of ETL design and implementation. Most ETL designers do not have Java skills as they are used to working in a tool environment and most Java developers do not have experience in handling large volumes of data resulting in massive overheads of training, maintaining and “firefighting“ coding issues. This can cause massive delays and soak up valuable resources for organizations to solve half the problem.

    Moreover, while making the investments in the form of hardware and skillsets like Map Reduce, when the underlying technology platforms inevitably would advance, development teams would be forced to rewrite the application to leverage these advancements.

    Concept to Realization – a Possibility?

    Yes it is. One of the key criteria for any data integration development environment on Hadoop is code abstraction to allow users to specify the data integration logic as a series of transformations chained together in a directed acyclic graph that models how users think about data movement making it significantly simpler to comprehend and change than a series of Map Reduce scripts.

    Another important feature to look out for is technology insulation – provisions in the design to change the run-time environments such as Hadoop with any future technologies prevalent at that time.

    Conclusion

    The “3 V’s” in Big Data implementations are well defined – Volume, Variety, and Velocity – and relatively quantifiable. We should begin to define a 4th ‘V’, for “Value.” The fourth is equally important, or more important in some cases, but less tangible and less quantifiable.

    Having said that, jumping off a diving board into a pool of Big Data doesn’t have to be a lonely job. The recommended approach would be to seek help from Big Data experts like Bitwise to assess whether you really need Big Data. If yes, what business areas will you target for the first use case, which DI platform will you use? And lastly, how will you calculate the ROI of the Big Data initiative?

    The post Crossing Over Big Data’s Trough of Disillusionment appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/crossing-over-big-datas-trough-of-disillusionment/feed/ 0
    Why Do ETL Tools Still Have a HeartBeat https://www.bitwiseglobal.com/en-us/blog/why-do-etl-tools-still-have-a-heart-beat/ https://www.bitwiseglobal.com/en-us/blog/why-do-etl-tools-still-have-a-heart-beat/#respond Mon, 24 Aug 2015 15:21:00 +0000 https://www.bitwiseglobal.com/en-us/why-do-etl-tools-still-have-a-heart-beat/ ETL is a well-known and effective technique for integrating data. ETL tools have been available for a while, and data integration projects frequently employ them. Over time, they have improved and developed to include cutting-edge capabilities like automation, scheduling, and error handling. ETL tools are now a well-established and dependable way of data integration as ... Read more

    The post Why Do ETL Tools Still Have a HeartBeat appeared first on Bitwise.

    ]]>

    ETL is a well-known and effective technique for integrating data.

    ETL tools have been available for a while, and data integration projects frequently employ them. Over time, they have improved and developed to include cutting-edge capabilities like automation, scheduling, and error handling. ETL tools are now a well-established and dependable way of data integration as a result.

    A variety of data sources and objectives are supported by ETL tools.

    Databases, cloud storage, APIs, and files are just a few examples of the numerous data sources and objectives available to modern businesses. ETL solutions may readily connect to these systems using standardized protocols and APIs because they are made to function with a wide variety of data sources and targets. Data from various sources can be more easily integrated since ETL systems also offer the necessary transformations to change the data’s format.

    ETL software offers a complete data integration solution.

    Data extraction, data transformation, and data loading are all handled by ETL technologies, which offer a comprehensive solution for data integration. Additionally, these solutions provide several features for resolving errors, validating data, and managing data quality.

    ETL solutions are therefore an all-in-one data integration solution, making them perfect for large-scale data integration projects.

    In some situations, ETL tools perform better than ELT.

    In a more recent method of data integration called ELT (Extract, Load, Transform), the data is first loaded into the target system before being transformed. Even though ELT has grown in acceptance recently, ETL is still preferred in some circumstances. For instance, ETL can offer greater performance if the data source is big since it can filter, combine, and transform the data at the source. As a result, processing times are sped up because fewer data needs to be put into the target system.

    Other data integration methods, such as data streaming and data virtualization, can also be integrated with ETL tools. ETL tools, for instance, can be used to load data from a legacy system into a data warehouse. At the same time, real-time data from the same system can be obtained using data streaming and integrated with the data warehouse using ETL. This enables businesses to employ the most effective data integration method for each scenario.

    Summary

    In summary, ETL tools will still be in demand in 2023 since they offer a dependable and tested means of data integration. These technologies handle a variety of data sources and targets and provide an all-inclusive solution for data integration along with other data integration.

    The post Why Do ETL Tools Still Have a HeartBeat appeared first on Bitwise.

    ]]>
    https://www.bitwiseglobal.com/en-us/blog/why-do-etl-tools-still-have-a-heart-beat/feed/ 0