Data continues to grow in importance for customer insights, projecting trends, and training artificial intelligence (AI) or machine learning (ML) algorithms. In a quest to fully encompass all data sources, data researchers maximize the scale and scope of data available by dumping all corporate data into one location.
On the other hand, having all that critical data in one place can be an attractive target for hackers, who continuously probe defenses looking for weaknesses, and the penalties for data breaches can be enormous. IT security teams need a system that allows for security to differentiate between different categories of data to isolate and secure it against misuse.
Data lakes provide the current solution to maximizing data availability and protection. For large enterprises, their data managers and data security teams can choose from many different data lake vendors to suit their needs.
However, while anyone can create a data lake, not everyone will have the resources to achieve scale, extract value, and protect their resources on their own. Fortunately, vendors offer robust tools that permit smaller teams to obtain the benefits of a data lake without requiring the same resources to manage them.
See the Top Data Lake Solutions
What are Data Lakes?
Data lakes create a single repository for an organization’s raw data. Data feeds bring in data from databases, SaaS platforms, web crawlers, and even edge devices such as security cameras or industrial heat pumps.
Similar to a giant hard drive, data lakes also can incorporate folder structures and apply security to specific folders to limit access, read/write privileges, and deletion privileges to users and applications. However, unlike a hard drive, data lakes should be able to grow in size forever and never require a deletion of data because of space restrictions.
Data lakes support all data types, scale automatically, and support a wide range of analytics, from built-in features to external tools supported by APIs. Analytic tools can perform metadata or content searches or categorize data without changing the underlying data itself.
Self-service Data Lake Tools
Technically, if a company can fit all of its data onto a single hard drive, that is the equivalent of a data lake. However, most organizations have astronomically more data than that, and large enterprises need huge repositories.
Some organizations create their own data lakes in their own data centers. This endeavor requires much more investment in:
- Capital expense: buildings, hardware, software, access control systems
- Operational expense: electrical power, cooling systems, high-capacity internet/network connections, maintenance and repair costs
- Labor expense: IT and IT security employees to maintain the hardware, physical security
Vendors in this category provide tools needed for a team to create their own data lake. Organizations choosing these options will need to supply more time, expenses, and expertise to build, integrate, and secure their data lakes.
Apache: Hadoop & Spark
The Apache open-source projects provide the basis for many cloud computing tools. To create a data lake, an organization could combine Hadoop and Spark to create the base infrastructure and then consider related projects or third-party tools in the ecosystem to build out capabilities.
Apache Hadoop provides scalable distributed processing of large data sets with unstructured or structured data content. Hadoop provides the storage solution and basic search and analysis tools for data.
Apache Spark provides a scalable open-source engine that batches data, streams data, performs SQL analytics, trains machine learning algorithms, and performs exploratory data analysis (EDA) on huge data sets. Apache Spark provides deep analysis tools for more sophisticated examinations of the data than available in the basic Hadoop deployment.
Hewlett Packard Enterprise (HPE) GreenLake
The HPE GreenLake service provides pre-integrated hardware and software that can be deployed in internal data centers or in colocation facilities. HPE handles the heavy lifting for the deployment and charges clients based upon their usage.
HPE will monitor usage and scale the deployment of the Hadoop data lake based upon need and provide support for design and deployment of other applications. This service turbo-charges a typical internal-deployment of Hadoop by outsourcing some of the labor and expertise to HPE.
Cloud Data Lake Tools
Cloud data lake tools provide the infrastructure and the basic tools needed to provide a turn-key data lake. Customers use built-in tools to attach data feeds, storage, security, and APIs to access and explore the data.
After selecting options, some software packages will already be integrated into the data lake upon launch. When a customer selects a cloud option, it will immediately be ready to intake data and will not need to wait for shipping, hardware installation, software installation, etc.
However, in an attempt to maximize the customizability of the data lake, these tools tend to push more responsibility to the customer. Connecting data feeds, external data analytics, or applying security will be more manual a process than compared with full-service solutions.
Some data lake vendors provide data lakehouse tools to attach to the data lake and provide an interface for data analysis and transfer. There may also be other add-on tools available that provide the features available in full-service solutions.
Customers can choose either the bare-bones data lake and then do more heavy lifting or pay extra for features that create the more full-service version. These vendors also tend not to encourage multi-cloud development and focus on driving more business towards their own cloud platforms.
Amazon Web Services (AWS) Data Lake
AWS provides enormous options for cloud infrastructure. Their data lake offering provides an automatically-configured collection of core AWS services to store and process raw data.
Incorporated tools permit users or apps to analyze, govern, search, share, tag, and transform subsets of data internally or with external users. Federated templates integrate with Microsoft Active Directory to incorporate existing data segregation rules already deployed internally within a company.
Google Cloud
Google offers data lake solutions that can house an entire data lake or simply help process a data lake workload from an external source (typically internal data centers). Google Cloud claims that moving from an on-premises Hadoop deployment to a Google Cloud-hosted deployment can lower costs by 54%.
Google offers its own BigQuery analytics that captures data in real-time using a streaming ingestion feature. Google supports Apache Spark and Hadoop migration, integrated data science and analytics, and cost management tools.
Microsoft Azure
Microsoft’s Azure Data Lake solution deploys Apache Spark and Apache Hadoop as fully-managed cloud offerings as well as other analytic clusters such as Hive, Storm, and Kafka. Azure data lake includes Microsoft solutions for enterprise-grade security, auditing, and support.
Azure Data Lake integrates easily with other Microsoft products or existing IT infrastructure and is fully scalable. Customers can define and launch a data lake very quickly and use their familiarity with other Microsoft products to intuitively navigate through options.
See the Top Big Data Storage Tools
Full-service Data Lake Tools
Full-service data lake vendors add layers of security, user-friendly GUIs, and constrain some features in favor of ease-of-use. These vendors may provide additional analysis features built into their offerings to provide additional value.
Some companies cannot or strategically choose not to store all of their data with a single cloud provider. Other data managers may simply want a flexible platform or might be trying to stitch together data resources from acquired subsidiaries that used different cloud vendors.
Most of the vendors in this category do not offer data hosting and act as agnostic data managers and promote using multi-cloud data lakes. However, some of these vendors offer their own cloud solutions and offer a fully integrated full-service offering that can access multiple clouds or transition the data to their fully-controlled platform.
Cloudera Cloud Platform
Cloudera’s Data Platform provides a unifying software to ingest and manage a data lake potentially spread across public and private cloud resources. Cloudera optimizes workloads based on analytics and machine learning as well as provides integrated interfaces to secure and govern platform data and metadata with integrated interfaces.
Cohesity
Cohesity’s Helios platform offers a unified platform that provides data lake and analysis capabilities. The platform may be licensed as a SaaS solution, as software for self-hosted data lakes, or for partner-managed data lakes.
Databricks
Databricks provides data lake house and data lake solutions built on open source technology with integrated security and data governance. Customers can explore data, build models collaboratively, and access preconfigured ML environments. Databricks works across multiple cloud vendors and manages the data repositories through a consolidated interface.
Domo
Domo provides a platform that enables a full range of data lake solutions from storage to application development. Domo augments existing data lakes or customers can host data on the Domo cloud.
IBM
IBM cloud-based data lakes can be deployed on any cloud and builds governance, integration, and virtualization into the core principles of their solution. IBM data lakes can access IBM’s pioneering Watson AI for analysis as well as access many other IBM tools for queries, scalability, and more.
Oracle
Oracle’s Big Data Service deploys a private version of Cloudera’s cloud platform and integration with their own Data Lakehouse solution and the Oracle cloud platform. Oracle builds on their mastery of database technology to provide strong tools for data queries, data management, security, governance, and AI development.
Snowflake
Snowflake provides a full service data lake solution that can integrate storage and computing solutions from AWS, Microsoft, or Google. Data managers do not need to know how to set up, maintain, or support servers and networks and therefore can use Snowflake without previously establishing any cloud databases.
Also read: Snowflake vs. Databricks: Big Data Platform Comparison
Choosing a Data Lake Strategy and Architecture
Data analytics continues to rise in importance as companies find more uses for wider varieties of data. Data lakes provide an option to store, manage, and analyze all data sources for an organization even as they try to figure out what is important and useful.
This article provides an overview of different strategies to deploy data lakes and different technologies available. The list of vendors is not comprehensive and new competitors are constantly entering the market.
Don’t start by selecting a vendor. First start with an understanding of company resources available to support a data lake.
If the available resources are small, the company will likely need to pursue a full-service option over an in-house data center. However, many other important characteristics play a role in determining the optimal vendor, such as:
- Business use case
- AI compatibility
- Searchability
- Compatibility with data lakehouse or other data searching tools
- Security
- Data governance
Once established, data lakes can be moved, but this could be a very expensive proposition since most data lakes will be enormous. Organizations should take their time and try test runs on a smaller scale before they commit fully to a single vendor or platform.
Read next: 10 Top Data Companies