Business Intelligence Archives | IT Business Edge

How Revolutionary Are Meta’s AI Efforts?

Kashyap Vyas — Mon, 08 Aug 2022 18:45:59 +0000

Mark Zuckerberg introduced Facebook’s rebranding to Meta at the company’s annual Connect event last year to reposition the company for the “new internet,” the metaverse.

The metaverse has been around for some time as a kind of urban legend, perhaps aptly described in the 2011 science fantasy book Ready Player One. That was until some of the biggest names in tech started investing heavily in related technologies, including virtual and augmented reality (VR/AR), Internet of Things (IoT), and artificial intelligence (AI).

Today AI is one of the most exciting technology fields to work on. Zuckerberg said the metaverse is something he’s wanted to work on since even before the conception of Facebook. And the company’s Meta AI research lab is on the cutting edge of both AI and the metaverse.

The new direction was met with some trepidation. The social media site Facebook, which retains its name, is already infamous for its black-box algorithms, which are going to grow more and more complex under the Meta AI efforts.

While this news is undoubtedly exciting for enthusiasts and futurists, what are the practical real-world implications beyond the metaverse hype?

See the Top Artificial Intelligence (AI) Software

How Far Along is Meta AI?

Despite the initial excitement, Meta AI is years away from reaching its ultimate goal of having a fully operational metaverse. Meta AI is Meta’s longest scheduled project, perhaps as big in development time as it is in ambition. But that doesn’t mean it can be ignored, as it has many partially developed applications.

Meta is presently leading the AI race, with many of its applications in direct competition with giants across different categories. Meta is working on a voice interface like Apple’s Siri or Google Voice Assistant. Meta is also competing with VR gaming consoles such as Hololens with the acquisition of Oculus.

Additionally, some of Meta’s projects include an AI-based image tagging and image search algorithm that was shown to beat the FBI’s image recognition ability at the IEEE Computer Vision Conference. Meta AI can also generate text predictions in messages like Google assistant through DeepText.

Meta AI Controversies

Facebook has been guilty of a number of ethical failings over the years and has been called out for prioritizing profit over safety by Frances Haugen, a former product manager at the company. On more than one occasion, it has been reported that this prioritization has led to the spread of misinformation and hate speech, which tends to engage more people, and the company has been accused of ignoring the negative effects of social media on teens and other groups.

A huge chunk of Meta revenue comes from Facebook ads. However, the ads are only valuable if the content you find on the site is engaging. Reports suggest that Facebook algorithms unilaterally favor engagement and, as a result, might promote misinformation and hate speech.

In response to these claims, Meta announced several changes in June 2022, including:

The Responsible AI organization is set to join the Social Impact team.
The AI for Product teams working to protect users on any of the Meta-owned platforms will move to the product engineering team and focus on improving recommendations and making content more relevant, including ads and commerce services.
The Meta AI research team, FAIR, will join the Reality Labs Research to serve as foundation and support, with a mission to drive breakthroughs in AI through “research excellence, open science, and broad collaboration.”

The Latest Developments in Meta AI

Meta has made the news multiple times since its rechristening, and not all of it has been troublesome. Just this May, Meta announced that it has created a massive new language model that will be freely available to researchers around the globe.

In a move that is not typical of a for-profit organization, Meta surprised its followers, saying that the idea was “democratizing access to large-scale language models, to maintain the integrity and prevent misuse.”

This will be the first time a fully trained language model of this size and scope will be accessible to researchers. The move has been well-received by critics of privately owned and funded research on AI models.

Meta has been full of surprises this year, releasing its AI Research SuperCluster (RSC), the world’s fastest AI supercomputers, earlier this year. The supercomputer will accelerate AI research and help Meta build the metaverse. New announcements from the AI effort come frequently, the latest an AI chatbot.

Also read: Python for Machine Learning: A Tutorial

What Areas Will Meta AI Influence?

What influence might Meta AI have on the evolving virtual world of the metaverse?

Marketing

Your meta life could involve digital clothing and world-building, and marketers would have to consider it while strategizing on sales through the metaverse. Game developers and marketers are no strangers to in-game marketing and have displayed custom skins, virtual locations, in-game advertisements, and promotional items. The metaverse would be no different.

Culture

The metaverse brings together an alternate online subculture pushed forward by gamers, buyers, and the brands serving them.

In December 2019, GTA V allowed players to dress up as protesters in Hong Kong. Gamers put on black clothes, with yellow hard hats and gas masks to start a riot in-game. Chinese players reciprocated by dressing up as the police and fighting back.

The GTA incident was a unique phenomenon, showing unexpected ways people might use the metaverse. People from different cultures can unite as groups driven by a shared sentiment, creating a unique cultural phenomenon.

Economy

The metaverse is the advent of a shared, virtual economy. People might behave differently in the metaverse than in real life, creating alternate spending patterns.

Whenever the metaverse rolls towards global adoption, people from different economies will be on a shared platform. The metaverse might facilitate in-game trade. Brands might turn into pseudo-super-powers. Only time will tell how a metaverse might affect what we know about economics.

Why We Need to Take Meta AI Seriously

If it is not Meta AI, it could be any other AI project, but the virtual world platform race is on. What makes Meta unique is its application-oriented approach. Meta has the benefit of being the parent company to the world’s most prominent content churning machine, which allows Meta to leverage user-generated data to build its AI.

However, there don’t seem to be any signs of significant disruption, and growth appears to be moving slowly and steadily. So when someone says Meta AI is a big deal, it is so in scope but perhaps more evolutionary than revolutionary. Only time will tell how world-changing it will be as we wait for more updates from the social media – and now AI – giant.

Data Lake Strategy Options: From Self-Service to Full-Service

Chad Kime — Mon, 08 Aug 2022 14:21:00 +0000

Data continues to grow in importance for customer insights, projecting trends, and training artificial intelligence (AI) or machine learning (ML) algorithms. In a quest to fully encompass all data sources, data researchers maximize the scale and scope of data available by dumping all corporate data into one location.

On the other hand, having all that critical data in one place can be an attractive target for hackers, who continuously probe defenses looking for weaknesses, and the penalties for data breaches can be enormous. IT security teams need a system that allows for security to differentiate between different categories of data to isolate and secure it against misuse.

Data lakes provide the current solution to maximizing data availability and protection. For large enterprises, their data managers and data security teams can choose from many different data lake vendors to suit their needs.

However, while anyone can create a data lake, not everyone will have the resources to achieve scale, extract value, and protect their resources on their own. Fortunately, vendors offer robust tools that permit smaller teams to obtain the benefits of a data lake without requiring the same resources to manage them.

See the Top Data Lake Solutions

What are Data Lakes?

Data lakes create a single repository for an organization’s raw data. Data feeds bring in data from databases, SaaS platforms, web crawlers, and even edge devices such as security cameras or industrial heat pumps.

Similar to a giant hard drive, data lakes also can incorporate folder structures and apply security to specific folders to limit access, read/write privileges, and deletion privileges to users and applications. However, unlike a hard drive, data lakes should be able to grow in size forever and never require a deletion of data because of space restrictions.

Data lakes support all data types, scale automatically, and support a wide range of analytics, from built-in features to external tools supported by APIs. Analytic tools can perform metadata or content searches or categorize data without changing the underlying data itself.

Self-service Data Lake Tools

Technically, if a company can fit all of its data onto a single hard drive, that is the equivalent of a data lake. However, most organizations have astronomically more data than that, and large enterprises need huge repositories.

Some organizations create their own data lakes in their own data centers. This endeavor requires much more investment in:

Capital expense: buildings, hardware, software, access control systems
Operational expense: electrical power, cooling systems, high-capacity internet/network connections, maintenance and repair costs
Labor expense: IT and IT security employees to maintain the hardware, physical security

Vendors in this category provide tools needed for a team to create their own data lake. Organizations choosing these options will need to supply more time, expenses, and expertise to build, integrate, and secure their data lakes.

Apache: Hadoop & Spark

The Apache open-source projects provide the basis for many cloud computing tools. To create a data lake, an organization could combine Hadoop and Spark to create the base infrastructure and then consider related projects or third-party tools in the ecosystem to build out capabilities.

Apache Hadoop provides scalable distributed processing of large data sets with unstructured or structured data content. Hadoop provides the storage solution and basic search and analysis tools for data.

Apache Spark provides a scalable open-source engine that batches data, streams data, performs SQL analytics, trains machine learning algorithms, and performs exploratory data analysis (EDA) on huge data sets. Apache Spark provides deep analysis tools for more sophisticated examinations of the data than available in the basic Hadoop deployment.

Hewlett Packard Enterprise (HPE) GreenLake

The HPE GreenLake service provides pre-integrated hardware and software that can be deployed in internal data centers or in colocation facilities. HPE handles the heavy lifting for the deployment and charges clients based upon their usage.

HPE will monitor usage and scale the deployment of the Hadoop data lake based upon need and provide support for design and deployment of other applications. This service turbo-charges a typical internal-deployment of Hadoop by outsourcing some of the labor and expertise to HPE.

Cloud Data Lake Tools

Cloud data lake tools provide the infrastructure and the basic tools needed to provide a turn-key data lake. Customers use built-in tools to attach data feeds, storage, security, and APIs to access and explore the data.

After selecting options, some software packages will already be integrated into the data lake upon launch. When a customer selects a cloud option, it will immediately be ready to intake data and will not need to wait for shipping, hardware installation, software installation, etc.

However, in an attempt to maximize the customizability of the data lake, these tools tend to push more responsibility to the customer. Connecting data feeds, external data analytics, or applying security will be more manual a process than compared with full-service solutions.

Some data lake vendors provide data lakehouse tools to attach to the data lake and provide an interface for data analysis and transfer. There may also be other add-on tools available that provide the features available in full-service solutions.

Customers can choose either the bare-bones data lake and then do more heavy lifting or pay extra for features that create the more full-service version. These vendors also tend not to encourage multi-cloud development and focus on driving more business towards their own cloud platforms.

Amazon Web Services (AWS) Data Lake

AWS provides enormous options for cloud infrastructure. Their data lake offering provides an automatically-configured collection of core AWS services to store and process raw data.

Incorporated tools permit users or apps to analyze, govern, search, share, tag, and transform subsets of data internally or with external users. Federated templates integrate with Microsoft Active Directory to incorporate existing data segregation rules already deployed internally within a company.

Google Cloud

Google offers data lake solutions that can house an entire data lake or simply help process a data lake workload from an external source (typically internal data centers). Google Cloud claims that moving from an on-premises Hadoop deployment to a Google Cloud-hosted deployment can lower costs by 54%.

Google offers its own BigQuery analytics that captures data in real-time using a streaming ingestion feature. Google supports Apache Spark and Hadoop migration, integrated data science and analytics, and cost management tools.

Microsoft Azure

Microsoft’s Azure Data Lake solution deploys Apache Spark and Apache Hadoop as fully-managed cloud offerings as well as other analytic clusters such as Hive, Storm, and Kafka. Azure data lake includes Microsoft solutions for enterprise-grade security, auditing, and support.

Azure Data Lake integrates easily with other Microsoft products or existing IT infrastructure and is fully scalable. Customers can define and launch a data lake very quickly and use their familiarity with other Microsoft products to intuitively navigate through options.

See the Top Big Data Storage Tools

Full-service Data Lake Tools

Full-service data lake vendors add layers of security, user-friendly GUIs, and constrain some features in favor of ease-of-use. These vendors may provide additional analysis features built into their offerings to provide additional value.

Some companies cannot or strategically choose not to store all of their data with a single cloud provider. Other data managers may simply want a flexible platform or might be trying to stitch together data resources from acquired subsidiaries that used different cloud vendors.

Most of the vendors in this category do not offer data hosting and act as agnostic data managers and promote using multi-cloud data lakes. However, some of these vendors offer their own cloud solutions and offer a fully integrated full-service offering that can access multiple clouds or transition the data to their fully-controlled platform.

Cloudera Cloud Platform

Cloudera’s Data Platform provides a unifying software to ingest and manage a data lake potentially spread across public and private cloud resources. Cloudera optimizes workloads based on analytics and machine learning as well as provides integrated interfaces to secure and govern platform data and metadata with integrated interfaces.

Cohesity

Cohesity’s Helios platform offers a unified platform that provides data lake and analysis capabilities. The platform may be licensed as a SaaS solution, as software for self-hosted data lakes, or for partner-managed data lakes.

Databricks

Databricks provides data lake house and data lake solutions built on open source technology with integrated security and data governance. Customers can explore data, build models collaboratively, and access preconfigured ML environments. Databricks works across multiple cloud vendors and manages the data repositories through a consolidated interface.

Domo

Domo provides a platform that enables a full range of data lake solutions from storage to application development. Domo augments existing data lakes or customers can host data on the Domo cloud.

IBM

IBM cloud-based data lakes can be deployed on any cloud and builds governance, integration, and virtualization into the core principles of their solution. IBM data lakes can access IBM’s pioneering Watson AI for analysis as well as access many other IBM tools for queries, scalability, and more.

Oracle

Oracle’s Big Data Service deploys a private version of Cloudera’s cloud platform and integration with their own Data Lakehouse solution and the Oracle cloud platform. Oracle builds on their mastery of database technology to provide strong tools for data queries, data management, security, governance, and AI development.

Snowflake

Snowflake provides a full service data lake solution that can integrate storage and computing solutions from AWS, Microsoft, or Google. Data managers do not need to know how to set up, maintain, or support servers and networks and therefore can use Snowflake without previously establishing any cloud databases.

Also read: Snowflake vs. Databricks: Big Data Platform Comparison

Choosing a Data Lake Strategy and Architecture

Data analytics continues to rise in importance as companies find more uses for wider varieties of data. Data lakes provide an option to store, manage, and analyze all data sources for an organization even as they try to figure out what is important and useful.

This article provides an overview of different strategies to deploy data lakes and different technologies available. The list of vendors is not comprehensive and new competitors are constantly entering the market.

Don’t start by selecting a vendor. First start with an understanding of company resources available to support a data lake.

If the available resources are small, the company will likely need to pursue a full-service option over an in-house data center. However, many other important characteristics play a role in determining the optimal vendor, such as:

Business use case
AI compatibility
Searchability
Compatibility with data lakehouse or other data searching tools
Security
Data governance

Once established, data lakes can be moved, but this could be a very expensive proposition since most data lakes will be enormous. Organizations should take their time and try test runs on a smaller scale before they commit fully to a single vendor or platform.

What’s New With Google Vertex AI?

Kashyap Vyas — Tue, 26 Jul 2022 15:00:00 +0000

Sundar Pichai introduced Vertex AI to the world during the Google I/O 2021 conference last year, placing it against managed AI platforms such as Amazon Web Services (AWS) and Azure in the global AI market.

The Alphabet CEO once said, “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything.”

A November 2020 study by Gartner predicted a near-20% growth rate for managed services like Vertex AI. Gartner said that as enterprises invest more in mobility and remote collaboration technologies and infrastructure, growth in the public cloud industry will be sustained through 2024.

Vertex AI replaces legacy services like AI Platform Training and Prediction, AI Platform Data Labeling, AutoML Natural Language, AutoML Vision, AutoML Video, AutoML Tables, and Deep Learning Containers. Let’s take a look at how the platform has fared and what’s changed over the last year.

Also read: Top Artificial Intelligence (AI) Software

What Is Google Vertex AI?

Google Vertex AI is a cloud-based third-party machine learning (ML) platform for deploying and maintaining artificial intelligence (AI) models. The machine learning operations (MLOps) platform blends automated machine learning (AutoML) and AI Platform into a unified application programming interface (API), client library, and user interface (UI).

Previously, data scientists had to run millions of datasets to train algorithms. But the Vertex technology stack does the heavy lifting now. It has the computing power to solve complex problems and easily do billions of iterations. Vertex also comes up with the best algorithms for specific needs.

Vertex AI uses a standard ML workflow consisting of stages like data collection, data preparation, training, evaluation, deployment, and prediction. Although Vertex AI has many features, we’ll look at some of its key features here.

Whole ML Workflow Under a Unified UI Umbrella: Vertex AI comes with a unified UI and API for every Google Cloud service based on AI.
Integrates With Common Open-Source Frameworks: Vertex AI blends easily with commonly used open-source frameworks like PyTorch and TensorFlow and supports other ML tools through custom containers.
Access to Pretrained APIs for Different Datasets: Vertex AI makes it easy to integrate video, images, translation, and natural language processing (NLP) with existing applications. It empowers people with minimal expertise and effort to train ML models to meet their business needs.
End-to-End Data and AI Integration: Vertex AI Workbench enables Vertex AI to integrate natively with Dataproc, Dataflow, and BigQuery. As a result, users can either develop or run ML models in BigQuery or export data from BigQuery and execute ML models from Vertex AI Workbench.

Also read: The Future of Natural Language Processing is Bright

What’s Included in the Latest Update?

Google understands research is the only way to become an AI-first organization. Many of Google’s product offerings initially started as internal research projects. DeepMind’s AlphaFold project led to running protein prediction models in Vertex AI.

Similarly, researching neural networks provided the groundwork for Vertex AI NAS, which allows data science teams to train models with lower latency and power requirements. Therefore, empathy plays a significant role when AI use cases are considered. Some of the latest offerings within Vertex AI from Google include:

Reduction Server

According to Google, the AI training Reduction Server is an advanced technology that optimizes the latency and bandwidth of multisystem distributed training, which is a way of diversifying ML training across multiple machines, GPUs (graphics processing units), CPUs (central processing units), or custom chips. As a result, it reduces time and uses fewer resources to complete the training.

Tabular Workflows

This feature aims to customize the ML model creation process. Tabular Workflows let the users decide which parts of the workflow they want AutoML technology to handle and which side they like to engineer themselves.

Vertex AI lets elements of Tabular Workflow be integrated into existing pipelines. Google also added the latest managed algorithms, including advanced research models like TabNet, advanced algorithms for feature selection, model distillation, and many more functions.

Serverless Apache Spark

Vertex AI has been integrated with serverless Apache Spark, a unified open-source yet large-scale data analytics engine. Vertex AI users can easily engage in a serverless Spark session for interactive code development.

The partnership of Google and Neo4j enables Vertex users to analyze data features in Neo4j’s platform and then deploy ML models with Vertex. Similarly, the collaboration between Labelbox and Google made it possible to access Labelbox’s data-labeling services for various datasets—images and text among the few—from the Vertex dashboard.

Example-based Explanations

When data turns into mislabelled data, Example-based Explanations offer a better solution. The new feature of Vertex leverages Example-based Explanations to diagnose and solve data issues.

Problem-Solving With Vertex AI

Google claims that Vertex AI requires 80% fewer lines of coding than other platforms to train AI/ML models with custom libraries, and its custom tools support advanced ML coding. Vertex AI’s MLOps tools eliminate the complexity of self-service model maintenance, streamlining ML pipeline operations and Vertex Feature Store to serve, share, and use advanced ML features.

Data scientists with no formal AI/ML training can use Vertex AI, as it offers tools to manage data, create prototypes, experiment, and deploy ML models. It also allows them to interpret and monitor the AI/ML models in production.

A year after the launch of Vertex, Google is aligning itself toward real-world applications. The company’s mission is solving human problems, as showcased at Google I/O. This likely means that its efforts will be directed toward finding a transformative way of doing things through AI.

The post What’s New With Google Vertex AI? appeared first on IT Business Edge.

Data Lake vs. Data Warehouse: What’s the Difference?

Aminu Abdullahi — Mon, 25 Jul 2022 15:00:00 +0000

Data lakes and data warehouses are two of the most popular forms of data storage and processing platforms, both of which can be employed to improve a business’s use of information.

However, these tools are designed to accomplish different tasks, so their functions are not exactly the same. We’ll go over those differences here, so you have a clear idea of what each one entails and choose which would suit your business needs.

See the Top Data Lake Solutions and Top Data Warehouses

What is a data lake?

A data lake is a storage repository that holds vast raw data in its native format until it is needed. It uses a flat architecture to store data, which makes it easier and faster to query data.

Data lakes are usually used for storing big datasets. They’re ideal for large files and great at integrating diverse datasets from different sources because they have no schema or structure to bind them together.

How does a data lake work?

A data lake is a central repository where all types of data can be stored in their native format. Any application or analysis can then access the data without the need for transformation.

The data in a data lake can be from multiple sources and structured, semi-structured, or unstructured. This makes data lakes very flexible, as they can accommodate any data. In addition, data lakes are scalable, so they can grow as a company’s needs change. And because data lakes store files in their original formats, there’s no need to worry about conversions when accessing that information.

Moreover, most companies using a data lake have found they can use more sophisticated tools and processing techniques on their data than traditional databases. A data lake makes accessing enterprise information easier by enabling the storage of less frequently accessed information close to where it will be accessed. It also eliminates the need to perform additional steps to prepare the data before analyzing it. This adds up to much faster query response times and better analytical performance.

Also read: Snowflake vs. Databricks: Big Data Platform Comparison

What is a data warehouse?

A data warehouse is designed to store structured data that has been processed, cleansed, integrated, and transformed into a consistent format that supports historical reporting and analysis. It is a database used for reporting and data analysis and acts as a central repository of integrated data from one or more disparate sources that can be accessed by multiple users.

A data warehouse typically contains historical data that can be used to generate reports and analyze trends over time and is usually built with large amounts of data taken from various sources. The goal is to give decision-makers an at-a-glance view of the company’s overall performance.

How does a data warehouse work?

A data warehouse is a system that stores and analyzes data from multiple sources. It helps organizations make better decisions by providing a centralized view of their data. Data warehouses are typically used for reporting, analysis, predictive modeling, and machine learning.

To build a data warehouse, data must first be extracted and transformed from an organization’s various sources. Then, the data must be loaded into the database in a structured format. Finally, an ETL tool (extract, transform, load) will be needed to put all the pieces together and prepare them for use in analytics tools. Once it’s ready, a software program runs reports or analyses on this data.

Data warehouses may also include dashboards, which are interactive displays with graphical representations of information collected over time. These displays give people working in the company real-time insights into business operations, so they can take action quickly when necessary.

Also read: Top Big Data Storage Products

Differences between data lake and data warehouse

When storing big data, data lakes and data warehouses have different features. Data warehouses store traditional transactional databases and store data in one table with structured columns. Comparatively, a data lake is used for big data analytics. It stores raw unstructured data that can be analyzed later for insights.

Parameters	Data lake	Data warehouse
Data type	Unstructured data	Processed data
Storage	Data are stored in their raw form regardless of the source	Data is analyzed and transformed
Purpose	Big data analytics	Structured data analysis
Database schema	Schema-on-read	Schema-on-write
Target user group	Data scientist	Business or data analysts
Size	Stores all data	Only structured data

Data type: Unstructured data vs. processed data

The main difference between the two is that in a data lake, the data is not processed before it is stored, while in a data warehouse it is. A data lake is a place to store all structured and unstructured data, and a data warehouse is a place to store only structured data. This means that a data lake can be used for big data analytics and machine learning, while a data warehouse can only be used for more limited data analysis and reporting.

Storage: Stored raw vs. clean and transformed

The data storage method is another important difference between a data lake and a data warehouse. A data lake stores raw information to make it easier to search through or analyze. On the other hand, a data warehouse stores clean, processed information, making it easier to find what is needed and make changes as necessary. Some companies use a hybrid approach, in which they have a data lake and an analytical database that complement each other.

Purpose: Undetermined vs. determined

The purposes of a data lake’s data are undetermined. Businesses can use the data for any purpose, whereas data warehouse data is already determined and in use. Hence why data lakes have more flexible data structures compared to data warehouses.

Where data lakes are flexible, data warehouses have more structured data. In a warehouse, data is pre-structured to fit a specific purpose. The nature of these structures depends on business operations. Moreover, a warehouse may contain structured data from an existing application, such as an enterprise resource planning (ERP) system, or it may be structured by hand based on user needs.

Database schema: Schema-on-read vs schema-on-write

A data warehouse follows a schema-on-write approach, whereas a data lake follows a schema-on-read approach. In the schema-on-write model, tables are created ahead of time to store data. If how the table is organized has to be changed or if columns need to be added later on, it’s difficult because all of the queries using that table will need to be updated.

On the other hand, schema changes are expensive and take a lot of time to complete. The schema-on-read model of a data lake allows a database to store any information in any column it wants. New data types can be addcolumns, and existing columns can be changed at any time without affecting the running systemed as new . However, if specific rows need to be found quickly, this could become more difficult than schema-on-write systems.

Users: Data scientist vs. business or data analysts

A data warehouse is designed to answer specific business questions, whereas a data lake is designed to be a storage repository for all of an organization’s data with no particular purpose. In a data warehouse, business users or analysts can interact with the data in a way that helps them find the answers they need to gain valuable insight into their operation.

On the other hand, there are no restrictions on how information can be used in a data lake because it is not intended to serve one single use case. Users must take responsibility for curating the data themselves before any analysis takes place and ensuring it’s of good quality before storing it in this format.

Size: All data up to petabytes of space vs. only structured data

The size difference is due to the data warehouse storing only structured data instead of all data. The two types of storage differ in many ways, but they are the most prevalent. The first way they differ is in their purpose: Data lakes store all data, while warehouses store only structured data.

Awareness of what type of storage is needed can help determine if a company should start with a data lake or a warehouse. A company may start with an enterprise-wide information hub for raw data and then use a more focused solution for datasets that have undergone additional processing steps.

Data lake vs. data warehouse: Which is right for me?

A data lake is a centralized repository that allows companies to store all of its structured and unstructured data at any scale, whereas a data warehouse is a relational database designed for query and analysis.

Determining which is the most suitable will depend on a company’s needs. If large amounts of data needs to be stored quickly, then a data lake is the way. However, a data warehouse is more appropriate if there is a need for analytics or insights into specific application data.

A successful strategy will likely involve implementing both models. A data lake can be used for storing big volumes of unstructured and high-volume data while a data warehouse can be used to analyze specific structured data.

The post Data Lake vs. Data Warehouse: What’s the Difference? appeared first on IT Business Edge.

10 Top Data Companies

Tom Taulli — Sun, 24 Jul 2022 11:38:00 +0000

The term “data company” is certainly broad. It could easily include giant social networks like Meta. The company has perhaps one of the world’s most valuable data sets, which includes about 2.94 billion monthly active users (MAUs). Meta also has many of the world’s elite data scientists on its staff.

But for purposes of this article, the term will be narrower. The focus will be on those operators that build platforms and tools to leverage data – one of the most important technologies in enterprises these days.

Yet even this category still has many companies. For example, if you do a search for data analytics on G2, you will see results for over 2,200 products.

So when coming up with a list of top data companies, it will be, well, imperfect. Regardless, there are companies that are really in a league of their own, from established names to fast-growing startups, publicly traded and privately held. Let’s take a look at 10 of them.

Also see out picks for Top Data Startups.

Databricks

In 2012, a group of computer scientists at the University of California, Berkeley, created the open source project, Apache Spark. The goal was to develop a distributed system for data over a cluster of machines.

From the start, the project saw lots of traction, as there was a huge demand for sophisticated applications like deep learning. The project’s founders would then go on to create a company called Databricks.

The platform combines a data warehouse and data lakes, which are natively in the cloud. This allows for much more powerful analytics and artificial intelligence applications. There are more than 7,000 paying customers, such as H&M Group, Regeneron and Shell. Last summer, the ARR (annual recurring revenue) hit $600 million.

About the same time, Databricks raised $1.6 billion in a Series H funding and the valuation was set at a stunning $38 billion. Some of the investors included Andreessen Horowitz, Franklin Templeton and T. Rowe Price Associates. An IPO is expected at some point, but even before the current tech stock downturn, the company seemed in no hurry to test the public markets.

We’ve included Databricks on our lists of the Top Data Lake Solutions, Top DataOps Tools and the Top Big Data Storage Products.

SAS

SAS (Statistical Analysis System), long a private company, is one of the pioneers of data analytics. The origins of the company actually go back to 1966 at North Carolina State University. Professors created a program that performed statistical functions using the IBM System/360 mainframe. But when government funding dried up, SAS would become a company.

It was certainly a good move. SAS would go on to become the gold standard for data analytics. Its platform allows for AI, machine learning, predictive analytics, risk management, data quality and fraud management.

Currently, there are 80,800 customers, which includes 88 of the Top 100 on the Fortune 500. There are 11,764 employees and revenues hit $3.2 billion last year.

SAS is one of the world’s largest privately-held software companies. Last summer, SAS was in talks to sell to Broadcom for $15 billion to $20 billion. But the co-founders decided to stay independent and despite having remained private since the company’s 1976 founding, are planning an IPO by 2024.

It should surprise absolutely no one that SAS made our list of the top data analytics products.

Snowflake

Snowflake, which operates a cloud-based data platform, pulled off the largest IPO for a software company in late 2020. It raised a whopping $3.4 billion. The offering price was $120 and it surged to $254 on the first day of trading, bringing the market value to over $70 billion. Not bad for a company that was about eight years old.

Snowflake stock would eventually go above $350. But of course, with the plunge in tech stocks, the company’s stock price would also come under extreme pressure. It would hit a low of $110 a few weeks ago.

Despite all this, Snowflake continues to grow at a blistering pace. In the latest quarter, the company reported an 85% spike in revenues to $422.4 million and the net retention rate was an impressive 174%. The customer base, which was over 6,300, had 206 companies with capacity arrangements that led to more than $1 million in product revenue in the past 12 months.

Snowflake started as a data warehouse. But the company has since expanded on its offerings to include data lakes, cybersecurity, collaboration, and data science applications. Snowflake has also been moving into on-premises storage, such as querying S3-compatible systems without moving data.

Snowflake is actually in the early stages of the opportunity. According to its latest investor presentation, the total addressable market is about $248 billion.

Like Databricks, Snowflake made our lists of the best Data Lake, DataOps and Big Data Storage tools.

Splunk

Founded in 2003, Splunk is the pioneer in collecting and analyzing large amounts of machine-generated data. This makes it possible to create highly useful reports and dashboards.

A key to the success of Splunk is its vibrant ecosystem, which includes more than 2,400 partners. There is also a marketplace that has over 2,400 apps.

A good part of the focus for Splunk has been on cybersecurity. By using real-time log analysis, a company can detect outliers or unusual activities.

Yet the Splunk platform has shown success in many other categories. For example, the technology helps with cloud migration, application modernization, and IT modernization.

In March, Splunk announced a new CEO, Gary Steele. Prior to this, he was CEO of Proofpoint, a fast-growing cloud-based security company.

On Steele’s first earnings report, he said: “Splunk is a system of record that’s deeply embedded within customers’ businesses and provides the foundation for security and resilience so that they can innovate with speed and agility. All of this translated to a massive, untapped, unique opportunity, from which I believe we can drive long-term durable growth while progressively increasing operating margins and cash flow.”

Cloudera

While there is a secular change towards the cloud, the reality is that many large enterprises still have significant on-premises footprints. A key reason for this is compliance. There is a need to have much more control over data because of privacy requirements.

But there are other areas where data fragmentation is inevitable. This is the case for edge devices and streaming from third parties and partners.

For Cloudera – another one of our top data lake solutions – the company has built a platform that is for the hybrid data strategy. This means that customers can take full advantage of their data everywhere.

Holger Mueller at Constellation Research praises Cloudera’s reliance on the open source Apache Iceberg technology for the Cloudera Data Platform.

“Open source is key when it comes to most infrastructure-as-a-service and platform-as-a-service offerings, which is why Cloudera has decided to embrace Apache Iceberg,” Mueller said. “Cloudera could have gone down a proprietary path, but adopting Iceberg is a triple win. First and foremost, it’s a win for customers, who can store their very large analytical tables in a standards-based, open-source format, while being able to access them with a standard language. It’s also a win for Cloudera, as it provides a key feature on an accelerated timeline while supporting an open-source standard. Last, it’s a win for Apache, as it gets another vendor uptake.”

Last year, Cloudera reported revenues over $1 billion. Among its thousands of customers, they include over 400 governments, the top ten global telcos and nine of the top ten healthcare companies.

Also read: Top Artificial Intelligence (AI) Software for 2022

MongoDB

The founders of MongoDB were not from the database industry. Instead, they were pioneers of Internet ad networks. The team – which included Dwight Merriman, Eliot Horowitz and Kevin Ryan – created DoubleClick, which launched in 1996. As the company quickly grew, they had to create their own custom data stores and realized that traditional relational databases were not up to the job.

There needed to be a new type of approach, which would scale and allow for quick innovation. So when they left DoubleClick after selling the company to Google for $3.1 billion, they went on to develop their own database system. It was based on an open source model and this allowed for quick distribution.

The underlying technology relied on a document model and was called NoSQL. It provided for a more flexible way for developers to code their applications. It was also optimized for enormous transactional workloads.

The MongoDB database has since been downloaded more than 265 million times. The company has also added the types of features required by enterprises, such as high performance and security.

During the latest quarter, revenues hit $285.4 million, up 57% on a year-over-year basis. There are over 33,000 customers.

To keep up the growth, MongoDB is focused on taking market share away from the traditional players like Oracle, IBM and Microsoft. To this end, the company has built the Relational Migrator. It visually analyzes relational schemas and transforms them into NoSQL databases.

Confluent

When engineers Jay Kreps, Jun Rao and Neha Narkhede worked at LinkedIn, they had difficulties creating infrastructure that could handle data in real time. They evaluated off-the-shelf solutions but nothing was up to the job.

So the LinkedIn engineers created their own software platform. It was called Apache Kafka and it was open sourced. The software allowed for high-throughput, low latency data feeds.

From the start, Apache Kafka was popular. And the LinkedIn engineers saw an opportunity to build a company around this technology in 2014. They called it Confluent.

The open source strategy was certainly spot on. Over 70% of the Fortune 500 use Apache Kafka.

But Confluent has also been smart in building a thriving developer ecosystem. There are over 60,000 meet-up members across the globe. The result is that developers outside Confluent have continued to build connectors, new functions and patches.

In the most recent quarter, Confluent reported a 64% increase in revenues to $126 million. There were also 791 customers with $100,000 or more in ARR (Annual Recurring revenue), up 41% on a year-over-year basis.

Datadog

Founded in 2010, Datadog started as an operator of a real-time unified data platform. But this certainly was not the last of its new applications.

The company has been an innovator – and has also been quite successful getting adoption for its technologies. The other categories Datadog has entered include infrastructure monitoring, application performance monitoring, log analysis, user experience monitoring, and security. The result is that the company is one of the top players in the fast-growing market for observability.

Datadog’s software is not just for large enterprises. In fact, it is available for companies of any size.

Thus, it should be no surprise that Datadog has been a super-fast grower. In the latest quarter, revenues soared by 83% to $363 million. There were also about 2,250 customers with more than $100,000 in ARR, up from 1,406 a year ago.

A key success factor for Datadog has been its focus on breaking down data silos. This has meant much more visibility across organizations. It has also allowed for better AI.

The opportunity for Datadog is still in the early stages. According to analysis from Gartner, spending on observability is expected to go from $38 billion in 2021 to $53 billion by 2025.

See the Top Observability Tools & Platforms

Fivetran

Traditional data integration tools rely on Extract, Transform and Load (ETL) tools. But this approach really does not handle modern challenges, such as the sprawl of cloud applications and storage.

What to do? Well, entrepreneurs George Fraser and Taylor Brown sought out to create a better way. In 2013, they cofounded Fivetran and got the backing of the famed Y Combinator program.

Interestingly enough, they originally built a tool for Business Intelligence (BI). But they quickly realized that the ETL market was ripe for disruption.

In terms of the product development, the founders wanted to greatly simplify the configuration. The goal was to accelerate the time to value for analytics projects. Actually, they came up with the concept of zero configuration and maintenance. The vision for Fivetran is to make “business data as accessible as electricity.”

Last September, Fivetran announced a stunning round of $565 million in venture capital. The valuation was set at $5.6 billion and the investors included Andreessen Horowitz, General Catalyst, CEAS Investments, and Matrix Partners.

Tecton

Kevin Stumpf and Mike Del Balso met at Uber in 2016 and worked on the company’s AI platform, which was called Michelangelo ML. The technology allowed the company to scale thousands of models in production. Just some of the use cases included fraud detection, arrival predictions and real-time pricing.

This was based on the first feature store. It allowed for quickly spinning up ML features that were based on complex data structures.

However, this technology still relied on a large staff of data engineers and scientists. In other words, a feature store was mostly for the mega tech operators.

But Stumpf and Del Balso thought there was an opportunity to democratize the technology. This became the focus of their startup, Tecton, which they launched in 2019.

The platform has gone through various iterations. Currently, it is essentially a platform to manage the complete lifecycle of ML features. The system handles storing, sharing and reusing feature store capabilities. This allows for the automation of pipelines for batch, streaming and real-time data.

In July, Tecton announced a Series C funding round for $100 million. The lead investor was Kleiner Perkins. There was also participation from Snowflake and Databricks.

The post 10 Top Data Companies appeared first on IT Business Edge.

The Toll Facial Recognition Systems Might Take on Our Privacy and Humanity

Zephin Livingston — Fri, 22 Jul 2022 18:54:44 +0000

Artificial intelligence really is everywhere in our day-to-day lives, and one area that’s drawn a lot of attention is its use in facial recognition systems (FRS). This controversial collection of technology is one of the most hotly-debated among data privacy activists, government officials, and proponents of tougher measures on crime.

Enough ink has been spilled on the topic to fill libraries, but this article is meant to distill some of the key arguments, viewpoints, and general information related to facial recognition systems and the impacts they can have on our privacy today.

What Are Facial Recognition Systems?

The actual technology behind FRS and who develops them can be complicated. It’s best to have a basic idea of how these systems work before diving into the ethical and privacy-related concerns related to using them.

How Do Facial Recognition Systems Work?

On a basic level, facial recognition systems operate on a three-step process. First, the hardware, such as a security camera or smartphone, records a photo or video of a person.

That photo or video is then fed into an AI program, which then maps and analyzes the geometry of a person’s face, such as the distance between eyes or the contours of the face. The AI also identifies specific facial landmarks, like forehead, eye sockets, eyes, or lips.

Finally, all these landmarks and measurements come together to create a digital signature which the AI compares against its database of digital signatures to see if there is a match or to verify someone’s identity. That digital signature is then stored on the database for future reference.

Use Cases of Facial Recognition Systems

A technology like facial recognition is broadly applicable to a number of different industries. Two of the most obvious are law enforcement and security.

With facial recognition software, law enforcement agencies can track suspects and offenders unfortunate enough to be caught on camera, while security firms can utilize it as part of their access control measures, checking people’s faces as easily as they check people’s ID cards or badges.

Access control in general is the most common use case for facial recognition so far. It generally relies on a smaller database (i.e. the people allowed inside a specific building), meaning the AI is less likely to hit a false positive or a similar error. Plus, it’s such a broad use case that almost any industry imaginable could find a reason to implement the technology.

Facial recognition is also a hot topic in the education field, especially in the U.S. where vendors pitch facial recognition surveillance systems as a potential solution to the school shootings that plague the country more than any other. It has additional uses in virtual classroom platforms as a way to track student activity and other metrics.

In healthcare, facial recognition can theoretically be combined with emergent tech like emotion recognition for improved patient insights, such as being able to detect pain or monitor their health status. It can also be used during the check-in process as a no-contact alternative to traditional check-in procedures.

The world of banking saw an increase in facial recognition adoption during the COVID-19 pandemic, as financial institutions looked for new ways to safely verify customers’ identities.

Some workplaces already use facial recognition as part of their clock-in-clock-out procedures. It’s also seen as a way to monitor employee productivity and activity, preventing folks from “sleeping on the job,” as it were.

Companies like HireVue were developing software using facial recognition that can determine the hireability of prospective employees. However, it discontinued the facial analysis portion of its software in 2021. In a statement, the firm cited public concerns over AI and a growing devaluation of visual components to the software’s effectiveness.

Retailers who sell age-restricted products, such as bars or grocery stores with liquor licenses, could use facial recognition to better prevent underaged customers from buying these products.

Who Develops Facial Recognition Systems?

The people developing FRS are many of the same usual suspects who push other areas of tech research forward. As always, academics are some of the primary contributors to facial recognition innovation. The field was started in academia in the 1950s by researchers like Woody Bledsoe.

In a modern day example, The Chinese University of Hong Kong created the GaussianFace algorithm in 2014, which its researchers reported had surpassed human levels of facial recognition. The algorithm scored 98.52% accuracy compared to the 97.53% accuracy of human performance.

In the corporate world, tech giants like Google, Facebook, Microsoft, IBM, and Amazon have been just some of the names leading the charge.

Google’s facial recognition is utilized in its Photos app, which infamously mislabeled a picture of software engineer Jacky Alciné and his friend, both of whom are black, as “gorillas” in 2015. To combat this, the company simply blocked “gorilla” and similar categories like “chimpanzee” and “monkey” on Photos.

Amazon was even selling its facial recognition system, Rekognition, to law enforcement agencies until 2020, when they banned the use of the software by police. The ban is still in effect as of this writing.

Facebook used facial recognition technology on its social media platform for much of the platform’s lifespan. However, the company shuttered the software in late 2021 as “part of a company-wide move to limit the use of facial recognition in [its] products.”

Additionally, there are firms who specialize in facial recognition software like Kairos, Clearview AI, and Face First who are contributing their knowledge and expertise to the field.

Is This a Problem?

To answer the question of “should we be worried about facial recognition systems,” it will be best to look at some of the arguments that proponents and opponents of facial recognition commonly use.

Why Use Facial Recognition?

The most common argument in favor of facial recognition software is that it provides more security for everyone involved. In enterprise use cases, employers can better manage access control, while lowering the chance of employees becoming victims of identity theft.

Law enforcement officials say the use of FRS can aid their investigative abilities to make sure they catch perpetrators quickly and more accurately. It can also be used to track victims of human trafficking, as well as individuals who might not be able to communicate such as people with dementia. This, in theory, could reduce the number of police-caused deaths in cases involving these individuals.

Human trafficking and sex-related crimes are an oft-spoken refrain from proponents of this technology in law enforcement. Vermont, the state with the strictest bans on facial recognition, peeled back their ban slightly to allow for its use in investigating child sex crimes.

For banks, facial recognition could reduce the likelihood and frequency of fraud. With biometric data like what facial recognition requires, criminals can’t simply steal a password or a PIN and gain full access to your entire life savings. This would go a long way in stopping a crime for which the FTC received 2.8 million reports from consumers in 2021 alone.

Finally, some proponents say, the technology is so accurate now that the worries over false positives and negatives should barely be a concern. According to a 2022 report by the National Institute of Standards and Technology, top facial recognition algorithms can have a success rate of over 99%, depending on the circumstances.

With accuracy that good and use cases that strong, facial recognition might just be one of the fairest and most effective technologies we can use in education, business, and law enforcement, right? Not so fast, say the technology’s critics.

Why Ban Facial Recognition Technology?

First, the accuracy of these systems isn’t the primary concern for many critics of FRS. Whether the technology is accurate or not is inessential.

While Academia is where much research on facial recognition is conducted, it is also where many of the concerns and criticisms are raised regarding the technology’s use in areas of life such as education or law enforcement.

Northeastern University Professor of Law and Computer Science Woodrow Hartzog is one of the most outspoken critics of the technology. In a 2018 article Hartzog said, “The mere existence of facial recognition systems, which are often invisible, harms civil liberties, because people will act differently if they suspect they’re being surveilled.”

The concerns over privacy are numerous. As AI ethics researcher Rosalie A. Waelen put it in a 2022 piece for AI & Ethics, “[FRS] is expected to become omnipresent and able to infer a wide variety of information about a person.” The information it is meant to infer is not necessarily information an individual is willing to disclose.

Facial recognition technology has demonstrated difficulties identifying individuals of diverse races, ethnicities, genders, and age. This, when used by law enforcement, can potentially lead to false arrests, imprisonments, and other issues.

As a matter of fact, it already has. In Detroit, Robert Williams, a black man, was incorrectly identified by facial recognition software as a watch thief and falsely arrested in 2020. After being detained for 30 hours, he was released due to insufficient evidence after it became clear that the photographed suspect and Williams were not the same person.

This wasn’t the only time this happened in Detroit either. Michael Oliver was wrongly picked by facial recognition software as the one who threw a teacher’s cell phone and broke it.

A similar case happened to Nijeer Parks in late 2019 in New Jersey. Parks was detained for 10 days for allegedly shoplifting candy and trying to hit police with a car. Facial recognition falsely identified him as the perpetrator, despite Parks being 30 miles away from the incident at the time.

There is also, in critics’ minds, an inherently dehumanizing element to facial recognition software and the way it analyzes the individual. Recall the aforementioned incident wherein Google Photos mislabeled Jacky Alciné and his friend as “gorillas.” It didn’t even recognize them as human. Given Google’s response to the situation was to remove “gorilla” and similar categories, it arguably still doesn’t.

Finally, there comes the issue of what would happen if the technology was 100% accurate. The dehumanizing element doesn’t just go away if Photos can suddenly determine that a person of color is, in fact, a person of color.

The way these machines see us is fundamentally different from the way we see each other because the machines’ way of seeing goes only one way. As Andrea Brighenti said, facial recognition software “leads to a qualitatively different way of seeing … .[the subject is] not even fully human. Inherent in the one way gaze is a kind of dehumanization of the observed.”

In order to get an AI to recognize human faces, you have to teach it what a human is, which can, in some cases, cause it to take certain human characteristics outside of its dataset and define them as decidedly “inhuman.”

That said, making facial recognition technology more accurate for detecting people of color only really serves to make law enforcement and business-related surveillance better. This means that, as researchers Nikki Stevens and Os Keyes noted in their 2021 paper for academic journal Cultural Studies, “efforts to increase representation are merely efforts to increase the ability of commercial entities to exploit, track and control people of colour.”

Final Thoughts

Ultimately, how much one worries about facial recognition technology comes down to a matter of trust. How much trust does a person place in the police or Amazon or any random individual who gets their hands on this software and the power it provides that they will only use it “for the right reasons”?

This technology provides institutions with power, and when thinking about giving power to an organization or an institution, one of the first things to consider is the potential for abuse of that power. For facial recognition, specifically for law enforcement, that potential is quite large.

In an interview for this article, Frederic Lederer, William & Mary Law School Chancellor Professor and Director of the Center for Legal & Court Technology, shared his perspective on the potential abuses facial recognition systems could facilitate in the U.S. legal system:

“Let’s imagine we run information through a facial recognition system, and it spits out 20 [possible suspects], and we had classified those possible individuals in probability terms. We know for a fact that the system is inaccurate and even under its best circumstances could still be dead wrong.

If what happens now is that the police use this as a mechanism for focusing on people and conducting proper investigation, I recognize the privacy objections, but it does seem to me to be a fairly reasonable use.

The problem is that police officers, law enforcement folks, are human beings. They are highly stressed and overworked human beings. And what little I know of reality in the field suggests that there is a large tendency to dump all but the one with the highest probability, and let’s go out and arrest him.”

Professor Lederer believes this is a dangerous idea, however:

“…since at minimum the way the system operates, it may be effectively impossible for the person to avoid what happens in the system until and unless… there is ultimately a conviction.”

Lederer explains that the Bill of Rights guarantees individuals a right to a “speedy trial.” However, court interpretations have borne out that arrested individuals will spend at least a year in prison before the courts even think about a speedy trial.

Add to that plea bargaining:

“…Now, and I don’t have the numbers, it is not uncommon for an individual in jail pending trial to be offered the following deal: ‘plead guilty, and we’ll see you’re sentenced to the time you’ve already been [in jail] in pre-trial, and you can walk home tomorrow.’ It takes an awful lot of guts for an individual to say ‘No, I’m innocent, and I’m going to stay here as long as is necessary.’

So if, in fact, we arrest the wrong person, unless there is painfully obvious evidence that the person is not the right person, we are quite likely to have individuals who are going to serve long periods of time pending trial, and a fair number of them may well plead guilty just to get out of the process.

So when you start thinking about facial recognition error, you can’t look at it in isolation. You have to ask: ‘How will real people deal with this information and to what extent does this correlate with everything else that happens?’ And at that point, there’s some really good concerns.”

As Lederer pointed out, these abuses already happen in the system, but facial recognition systems could exacerbate these abuses and even increase them. They can perpetuate pre-existing biases and systemic failings, and even if their potential benefits are enticing, the potential harm is too present and real to ignore.

Of the viable use cases of facial recognition that have been explored, the closest thing to a “safe” use case is ID verification. However, there are plenty of equally effective ID verification methods, some of which use biometrics like fingerprints.

In reality, there might not be any “safe” use case for facial recognition technology. Any advancements in the field will inevitably aid surveillance and control functions that have been core to the technology from its very beginning.

For now, Lederer said he hasn’t come to any firm conclusions as to whether the technology should be banned. But he and privacy advocates like Hartzog will continue to watch how it’s used.

Read Next: What’s Next for Ethical AI?

The post The Toll Facial Recognition Systems Might Take on Our Privacy and Humanity appeared first on IT Business Edge.

Top ETL Tools 2022

Collins Ayuya — Thu, 14 Jul 2022 23:05:45 +0000

In this data-driven age, enterprises leverage data to analyze products, services, employees, customers, and more, on a large scale. ETL (extract, transform, load) tools enable highly scaled sharing of information by bringing all of an organization’s data together and avoiding data silos.

What are ETL Tools?

Extract, transform, and load a data management process for collecting data from multiple sources to support discovery, analysis, reporting, and decision-making. ETL tools are instruments that automate the process of turning raw data into information that can deliver actionable business intelligence. They extract data from underlying sources, transform data to satisfy the data models enterprise repositories, and load data into its target destination.

“Transform” is perhaps the most important part of ETL: Making sure all data is in the proper type and format for its intended use. The term has been around since the 1970s and typically has referred to data warehousing, but now is also used to power Big Data analytics applications.

Also read: Best Big Data Tools & Software for Analytics

Choosing ETL Tools

There are a variety of factors that determine which ETL tool suits your needs best. Let’s explore some of the most relevant ones.

Business goals

Your business goals are the most vital consideration when choosing ETL tools. The data integration needs of the business require ETL tools that ensure speed, flexibility, and effectiveness.

Use case

Client use cases determine what kind of ETL tools to implement. For instance, where the implementation covers different use cases or involves different cloud options, modern ETL approaches trump older ETL approaches.

Capabilities

A good ETL tool should not only be flexible enough to read and write data regardless of location but also enable users to switch providers without long delays.

Integration

An organization’s scope and frequency of integration efforts determine the kind of ETL tools they require. Organizations with more intensive tasks may require more integrations daily. They should ensure the tools they choose satisfy their integration needs.

Data sources

Data sources determine the type of ETL tools to be implemented, as some organizations may need to work with only structured data while others may have to consider both structured and unstructured data or specific data types.

Budget

Considering your budget as you research prospective ETL solutions is crucial, as costs can rise considerably with ETL tools that need lots of data mapping and manual coding. Knowing not only the ETL tool but what supporting activities you will be required to pay for is key to ensuring you get the right ETL tool working optimally.

Top ETL Tools

Here are our picks for the top ETL tools based on our survey and analysis of the market.

Oracle Data Integrator

Oracle Data Integrator (ODI) is a comprehensive data integration platform that encompasses data integration requirements such as high-volume, high-performance batch loads, SOA-enabled data services, and event-driven trickle-feed integration processes. It is part of Oracle’s data integration suite of solutions for data quality, cloud data, metadata management, and big data preparation.

Oracle Data Integrator offers support for both unstructured and structured data and is available as both an enterprise ETL tool and a cloud-based ETL tool.

Key Differentiators

High-Performance Data Transformation: ODI offers high-performance data transformation through powerful ETL that minimizes the performance impact on source systems. It also lowers cost by using the power of the database system CPU and memory to carry out transformations instead of using independent ETL transformation servers.
Out-of-the-Box Integrations: The Enterprise Edition of ODI provides a comprehensive selection of prebuilt connectors. Its modular design offers developers greater flexibility when connecting diverse systems.
Heterogeneous System Support: ODI offers heterogeneous system support with integrations for big data, popular databases and other technologies.

Cons: ODI may require advanced IT skills for data manipulation, as implementation may prove to be complex. Licensing also may prove to be expensive for smaller organizations and teams. Furthermore, it lacks the drag-and-drop features characteristic of other ETL tools.

Azure Data Factory

Azure Data Factory simplifies hybrid data integration through a serverless and fully managed integration service that allows users to integrate all their data.

The service provides more than 90 built-in connectors at no extra cost and allows users to simply construct not only ETL processes but also ELT processes, transforming the data in the data warehouse. These processes can be constructed through coding or through an intuitive code-free environment. The tool also improves overall efficiency through autonomous ETL processes and improved insights across teams.

Key Differentiators

Code-Free Data Flows: Azure Data Factory offers a data integration and transformation layer that accelerates data transformation across users’ digital transformation initiatives. Users can prepare data, build ETL and ELT processes, and orchestrate and monitor pipelines code-free. Intelligent intent-driven mapping automates copy activities to transform faster.
Built-in Connectors: Azure Data Factory provides one pay-as-you-go service to save users from the challenges of cost, time, and the number of solutions associated with ingesting data from multiple and heterogeneous sources. It offers over 90 built-in connectors and underlying network bandwidth of up to 5 Gbps throughput.
Modernize SSIS in a Few Clicks: Data Factory enables organizations to rehost and extend SSIS in a handful of clicks.

Con: The tool supports some data hosted outside of Azure, but it primarily focuses on building integration pipelines connecting to Azure and other Microsoft resources in general. This is a limitation for users running most of their workloads outside of Azure.

Talend Open Studio

Talend helps organizations understand the data they have, where it is, and its usage by providing them with the means to measure the health of their data and evaluate how much their data supports their business objectives.

Talend Open Studio is a powerful open-source ETL tool designed to enable users to extract, standardize and transform datasets into a consistent format for loading into third-party applications. Through its numerous built-in business intelligence tools, it can provide value to direct marketers.

Key Differentiators

Graphical Conversion Tools: Talend’s graphical user interface (GUI) enables users to easily map data between source and destination areas by selecting the required components from the palette and placing them into the workspace.
Metadata Repository: Users can reuse and repurpose work through a metadata repository to improve both efficiency and productivity over time.
Database SCD Tools: Tracking slowly changing dimensions (SCD) can be helpful for keeping a record of historical changes within an enterprise. For databases such as MSSQL, MySQL, Oracle, DB2, Teradata, Sybase, and more, this feature is built-in.

Cons: Installation and configuration can take a significant amount of time due to the modular nature of the tool. Additionally, to realize its full benefits, users may be required to upgrade to the paid version.

Informatica PowerCenter

Informatica is a data-driven company passionate about creating and delivering solutions that expedite data innovations. PowerCenter is Informatica’s data integration product, which is a metadata-driven platform with the goals of improving the collaboration between business and IT teams and streamlining data pipelines.

Informatica enables enterprise-class ETL for on-premises data integration while providing top-class ETL, ELT, and elastic Spark-based data processing for every cloud data integration needed through artificial intelligence (AI)-powered cloud-native data integration.

Key Differentiators

PowerCenter Integration Service: PowerCenter Integration Service assists to read and manage the integration’s workflow, which in turn delivers multiple integrations according to the needs of the organization.
Optimization Engine: Informatica’s Optimization Engine sends users’ data processing tasks to the most cost-effective destination, whether traditional ETL, Spark serverless processing, cloud ecosystem pushdown, or cloud data warehouse pushdown. This ensures the right processing is chosen for the right job, ensuring controlled and optimized costs.
Advanced Data Transformation: Informatica PowerCenter offers advanced data transformation to help unlock the value of non-relational data through exhaustive parsing of JSON, PDF, XML, Internet of Things (IoT), machine data, and more.

Con: For higher volumes, the computational resource requirement may be high.

Microsoft SSIS

Microsoft SQL Server Integration Services (SSIS) is a platform for developing enterprise-grade data transformation and integration solutions to solve complex business problems.

Integration Services can be used to handle these problems by downloading or copying files, loading data warehouses, managing SQL data and objects, and cleansing and mining data. SSIS can extract data from XML files, Flat files, SQL databases, and more. Through a GUI, users can build packages and perform integrations and transformations.

Key Differentiators

Transformations: SSIS offers a rich set of transformations such as business intelligence (BI), row, rowset, split and join, auditing, and custom transformations.
SSIS Designer: SSIS Designer is a graphical tool that can be used to build and maintain Integration Service packages. Users can use it to construct the control flow and data flows in a package as well as to add event handlers to packages and their objects.
Built-in Data Connectors: SSIS supports diverse built-in data connectors that enable users to establish connections with data sources through connection managers.

Cons: SSIS has high CPU memory usage and performance issues with bulk data workloads. The tool also requires technical expertise, as the manual deployment process can be complex.

AWS Glue

AWS Glue is a serverless data integration service that simplifies the discovery, preparation, and combination of data for analytics, application development, and machine learning. It possesses the data integration capabilities that enterprises require to analyze their data and put it to use in the shortest time possible. ETL developers and data engineers can visually build, execute, and monitor ETL workflows through AWS Glue Studio.

Key Differentiators

ETL Jobs at Scale: AWS Glue enables users to simply run and manage ETL jobs at scale, as it automates a significant part of the effort required for data integration.
ETL Jobs Without Coding: Through AWS Glue Studio, users can visually create, execute, and monitor AWS ETL jobs. They can create ETL jobs that move and transform data through a drag-and-drop editor, and AWS Glue will automatically generate the code.
Event-Driven ETL Pipelines: AWS Glue enables users to build event-driven ETL pipelines, as Glue can run ETL jobs as new data arrives.

Con: Since AWS Glue is made for AWS console and its products, it makes it difficult to use for other technologies.

Integrate.io

Integrate.io is a data integration solution and ETL provider that offers customers all the tools they require to customize their data flows and deliver better data pipelines for improved insights and customer relationships. This ETL service is compatible with data lakes and connects with most major data warehouses, proving that it is one of the most flexible ETL tools available.

Key Differentiators

Rapid, Low-Code Implementation: Integrate.io enables users to transform their data with little to no code, offering them the flexibility that alleviates the complexities of dependence on extensive coding or manual data transformations.
Reverse ETL: Integrate.io’s low-code Reverse ETL platform enables users to convert their data warehouses into the heartbeats of their organizations by providing actionable data across users’ teams. Users can focus less on data preparation and more on actionable insights.
Single Source of Truth: Users have the ability to combine their data from all of their sources and send them a single destination with Integrate.io. A single source of truth for customer data enables organizations to save time, optimize their insights, and improve their market opportunities.

Con: The tool does not support on-premises solutions.

Hevo Data

Hevo Data is a no-code data pipeline that simplifies the ETL process and enables users to load data from any data source, including software-as-a-service (SaaS) applications, databases, streaming services, cloud storage, and more.

Hevo offers over 150 data sources, with more than 40 of them available for free. The tool also enriches and transforms data into a format ready for analysis without users writing a single line of code.

Key Differentiators

Near Real-Time Replication: Near real-time replication is available to users of all plans. For database sources, it is available via pipeline prioritization, while for SaaS sources, it is dependent on API (application programming interface) call limits.
Built-in Transformations: Hevo allows users to format their data on the fly with its drag-and-drop preload transformations and to generate analysis-ready data in their warehouses using post-load transformation.
Reliability at Scale: Hevo provides top-class fault-tolerant architecture with the ability to scale with low latency and zero data loss.

Con: Some users report that Hevo is slightly complex, especially concerning operational support.

Comparing the Top ETL Tools

Tool	Mapping	Drag and Drop	Reporting	Auditing	Automation
Oracle Data Integrator	✔	X	✔	✔	✔
Azure Data Factory	✔	✔	✔	✔	✔
Talend Open Studio	✔	✔	✔	✔	✔
Informatica PowerCenter	✔	✔	✔	✔	✔
Microsoft SSIS	✔	X	✔	✔	✔
AWS Glue	✔	✔	✔	✔	✔
Integrate.io	✔	✔	✔	✔	✔
Hevo Data	✔	✔	X	✔	✔

The post Top ETL Tools 2022 appeared first on IT Business Edge.

Snowflake vs. Databricks: Big Data Platform Comparison

Surajdeep Singh — Thu, 14 Jul 2022 19:16:49 +0000

The extraction of meaningful information from Big Data is a key driver of business growth.

For example, the analysis of current and past product and customer data can help organizations anticipate customer demand for new products and services and spot opportunities they might otherwise miss.

As a result, the market for Big Data tools is ever-growing. In a report last month, MarketsandMarkets predicted that the Big Data market will grow from $162.6 billion in 2021 to $273.4 billion in 2026, a compound annual growth rate (CAGR) of 11%.

A variety of purpose-built software and hardware tools for Big Data analysis are available on the market today. To make sense of all that data, the first step is acquiring a robust Big Data platform, such as Snowflake or Databricks.

Current Big Data analytics requirements have forced a major shift in Big Data warehouse and storage architecture, from the conventional block- and file-based storage architecture and relational database management systems (RDBMS) to more scalable architectures like scale-out network-attached storage (NAS), object-based storage, data lakes, and data warehouses.

Databricks and Snowflake are at the forefront of those changing data architectures. In some ways, they perform similar functions—Databricks and Snowflake both made our lists of the Top DataOps Tools and the Top Big Data Storage Products, while Snowflake also made our list of the Top Data Warehouse Tools—but there are very important differences and use cases that IT buyers need to be aware of, which we’ll focus on here.

What is Snowflake?

Snowflake for Data Lake Analytics is a cross-cloud platform that enables a modern data lake strategy. The platform improves data performance and provides secure, quick, and reliable access to data.

Snowflake’s data warehouse and data lake technology consolidates structured, semi-structured, and unstructured data onto a single platform, provides fast and scalable analytics, is simple and cost-effective, and permits safe collaboration.

Key differentiators

Store data in Snowflake-managed smart storage with automatic micro-partitioning, encryption at rest and in transit, and efficient compression.
Support multiple workloads on structured, semi-structured, and unstructured data with Java, Python, or Scala.
Access data from existing cloud object storage instances without having to move data.
Seamlessly query, process, and load data without sacrificing reliability or speed.
Build powerful and efficient pipelines with Snowflake’s elastic processing engine for cost savings, reliable performance, and near-zero maintenance.
Streamline pipeline development using SQL, Java, Python, or Scala with no additional services, clusters, or copies of data to manage.
Gain insights into who is accessing what data with a built-in view, Access History.
Automatically identify classified data with Classification, and protect it while retaining analytical value with External Tokenization and Dynamic Data Masking.

Pricing: Enjoy a 30-day free trial, including $400 worth of free usage. Contact the Snowflake sales team for product pricing details.

What is Databricks?

The Databricks Lakehouse Platform unifies your data warehousing and artificial intelligence (AI) use cases onto a single platform. The Big Data platform combines the best features of data lakes and data warehouses to eliminate traditional data silos and simplify the modern data stack.

Key differentiators

Databricks Lakehouse Platform delivers the strong governance, reliability, and performance of data warehouses along with the flexibility, openness, and machine learning (ML) support of data lakes.
The unified approach eliminates the traditional data silos separating analytics, data science, ML, and business intelligence (BI).
The Big Data platform is developed by the original creators of Apache Spark, MLflow, Koalas, and Delta Lake.
Databricks Lakehouse Platform is being developed on open standards and open source to maximize flexibility.
The multicloud platform’s common approach to security, data management, and governance helps you function more efficiently and innovate seamlessly.
Users can easily share data, build modern data stacks, and avoid walled gardens, with unrestricted access to more than 450 partners across the data landscape.
Partners include Qlik, RStudio, Tableau, MongoDB, Sparkflows, HashiCorp, Rearc Data, and TickSmith.
Databricks Lakehouse Platform provides a collaborative development environment for data teams.

Pricing: There’s a 14-day full trial in your cloud or a lightweight trial hosted by Databricks. Reach out to Databricks for pricing information.

Snowflake vs. Databricks: What Are the Differences?

Here, in our analysis, is how the Big Data platforms compare:

Features	Snowflake	Databricks
Scalability	✔	✔
Integration	✔	✔
Customization		✔
Ease of Deployment	✔	✔
Ease of Administration and Maintenance	✔	✔
Pricing Flexibility		✔
Ability to Understand Needs	✔	✔
Quality of End-User Training	✔	✔
Ease of Integration Using Standard Application Programming Interfaces (APIs) and Tools	✔	✔
Availability of Third-Party Resources		✔
Data Lake	✔	✔
Data Warehouse	✔	✔
Service and Support	✔	✔
Willingness to Recommend	✔	✔
Overall Capability Score		✔

Choosing a Big Data Platform

Organizations need resilient and reliable Big Data management, analysis and storage tools to reliably extract meaningful insights from Big Data. In this guide, we explored two of the best tools in the data lake and data warehouse categories.

There are a number of other options for Big Data analytics platforms, and you should find the one that best meets your business needs. Explore other tools such as Apache Hadoop, Apache HBase, NetApp Scale-out NAS and others before making a purchase decision.

Further reading:

The post Snowflake vs. Databricks: Big Data Platform Comparison appeared first on IT Business Edge.

5 Top VCs For Data Startups

Tom Taulli — Tue, 12 Jul 2022 14:47:56 +0000

It seems that the boom times for venture capital are over. This is not just the sentiment of the media or analysts. Keep in mind that a variety of venture capitalists agree that the slowdown is real – and could last a few years.

Just look at Sequoia. On May 16, the top VC firm made a presentation to its portfolio companies entitled “Adapting to Endure.” It noted that the economy was at a “crucible moment” and founders need to be careful with cash burn rates.

Despite all this, top venture capitalists understand that some of the best opportunities come during hard times. Besides, there remain plenty of secular trends that will continue to drive growth.

One is data. There’s little argument from CEOs that this is a strategic asset. However, there needs to be effective tools to get value from data, and that will continue to drive investment in data startups for some time to come.

Here we’ll look at five of the top venture capital firms for data – along with some insight into where they see current investment opportunities.

Accel

Founded in 1983, Accel has invested in many categories over the years, like consumer, media, security, ecommerce and so on. But the firm has also shown strong data chops.

Its most iconic investment occurred in the summer of 2005. Accel agreed to invest $12.7 million in Facebook – which is now called Meta – for a 10.7% stake.

In terms of its enterprise data deals, they include companies like UiPath, Cloudera, Atlassian and Slack. As for recent investments, there is the $60 million funding of Cyera. The company has built a cloud-native data security platform that evaluates whether data – on AWS, Azure and GCP — is sensitive and vulnerable to risk. This is all done in real time.

Accel just raised a mega $4 billion fund that is focused on late-stage deals, an impressive display of confidence by the firm’s limited partners (LPs). This is certainly a contrarian bet as this category of investments has softened during the past year. But with valuations much more attractive, the timing could actually be good for Accel.

Greylock

Another name with some staying power, 24-year-old Greylock Partners focuses on enterprise and consumer software companies. The investments span early seed levels to later stages. In fact, the firm will incubate some of its deals at its offices. This was the case with companies like Palo Alto Networks, Workday and Sumo Logic.

One of Greylock’s best deals was for LinkedIn. The firm invested in the startup – when it had fewer than one million members – a year after its founding in 2004.

Then in 2016, Microsoft agreed to acquire LinkedIn for $26.5 billion. Reid Hoffman, who is the cofounder of LinkedIn, is currently a partner at Greylock.

An interesting recent funding for a data startup is for Baseten. The company’s system allows for fast and easy migration of machine learning to production applications. It automates the complex backend and MLOps processes. Greylock participated in the seed and Series A financings.

Sequoia

Sequoia is one of the pioneers of the venture capital industry. Don Valentine founded the firm in 1972 and he raised his first fund a couple years later. It wasn’t easy, as he had to convince investors about the potential benefits of investing in startups. At the time, it was a fairly radical concept for institutions.

But Valentine had a knack for finding the next big thing. For example, he was an early investor in Atari and Apple.

This was just the beginning. Sequoia would go on to have one of the best track records in venture capital. Just some of its huge winners include Snowflake, Stripe, WhatsApp, ServiceNow, Cisco, Yahoo! and Google.

No doubt, a big part of the investment thesis for Sequoia is on data. For example, in early June the firm led a $4.5 million seed round for CloseFactor. The startup leverages sophisticated machine learning to customize sales pitches and target the right prospects. The system has shown 2-to-4 times improvements in the quality of pipelines.

Also read: Top 7 Data Management Trends to Watch in 2022

Andreessen Horowitz

It usually takes at least a decade to become an elite venture firm. The reason is that early-stage investments generally need lots of time to generate breakout returns.

But for Andreessen Horowitz, it was able to become an elite firm within a few years. Then again, it certainly helped that its founders are visionary entrepreneurs Marc Andreessen and Ben Horowitz.

Yet they also set out to disrupt the traditional model for venture capital. For example, it set out to operate like a Hollywood talent agency. Andressen Horowitz hired specialists to help entrepreneurs with many parts of their business, such as PR, sales, marketing, and design.

The formula has been a winner. Some of Andressen Horowitz’s notable investments include: Stripe, Databricks, Plaid, Figma, Tanium and GitHub. And yes, many other venture capital firms have replicated the model.

As for a recent data deal from Andreessen Horowitz, there is the $100 million Series D funding for Imply Data (the valuation came to $1.1 billion). The founders of the company are the creators of Apache Druid, which is an open source database for analytics applications. With Imply, it has focused on the large market for developers building analytics applications.

Andreessen Horowitz certainly has lots of fire power for many more deals. In January, the company announced $9 billion in new capital for venture opportunities, growth stage and biotech.

Lightspeed

Lightspeed got its start at the depths of the dotcom bust – October 2000. But the timing would be propitious. The firm had fresh capital and the valuations were much more attractive.

In the early days, Lightspeed was focused on consumer startups. For example, it was an early investor in Snapchat. Lightspeed contributed $485,000 in the seed round.

However, during the past decade, Lightspeed has upped its game with enterprise software and infrastructure opportunities. Some of its standout deals include AppDynamics, MuleSoft, and Nutanix.

Among recent data deals for Lightspeed, Redpanda Data is one that stands out. The venture capital firm led a $50 million Series B round. Redpanda has built a streaming platform for developers. Think of it as a system of record for real-time and historical data.

In 2020, Lightspeed raised three funds for a total of $4.2 billion. The firm is now seeking about $4.5 billion for its next set of financing vehicles.

The post 5 Top VCs For Data Startups appeared first on IT Business Edge.

Business Intelligence Archives | IT Business Edge

How Revolutionary Are Meta’s AI Efforts?

How Far Along is Meta AI?

Meta AI Controversies

The Latest Developments in Meta AI

What Areas Will Meta AI Influence?

Marketing

Culture

Economy

Why We Need to Take Meta AI Seriously

Data Lake Strategy Options: From Self-Service to Full-Service

What are Data Lakes?

Self-service Data Lake Tools

Apache: Hadoop & Spark

Hewlett Packard Enterprise (HPE) GreenLake

Cloud Data Lake Tools

Amazon Web Services (AWS) Data Lake

Google Cloud

Microsoft Azure

Full-service Data Lake Tools

Cloudera Cloud Platform

Cohesity

Databricks

Domo

IBM

Oracle

Snowflake

Choosing a Data Lake Strategy and Architecture

What’s New With Google Vertex AI?

What Is Google Vertex AI?

What’s Included in the Latest Update?

Reduction Server

Tabular Workflows

Serverless Apache Spark

Example-based Explanations

Problem-Solving With Vertex AI

Data Lake vs. Data Warehouse: What’s the Difference?

What is a data lake?

How does a data lake work?

What is a data warehouse?

How does a data warehouse work?

Differences between data lake and data warehouse

Data type: Unstructured data vs. processed data

Storage: Stored raw vs. clean and transformed

Purpose: Undetermined vs. determined

Database schema: Schema-on-read vs schema-on-write

Users: Data scientist vs. business or data analysts

Size: All data up to petabytes of space vs. only structured data

Data lake vs. data warehouse: Which is right for me?

10 Top Data Companies

Databricks

SAS

Snowflake

Splunk

Cloudera

MongoDB

Confluent

Datadog

Fivetran

Tecton

The Toll Facial Recognition Systems Might Take on Our Privacy and Humanity

What Are Facial Recognition Systems?

How Do Facial Recognition Systems Work?

Use Cases of Facial Recognition Systems

Who Develops Facial Recognition Systems?

Is This a Problem?

Why Use Facial Recognition?

Why Ban Facial Recognition Technology?

Final Thoughts

Top Data Lake Solutions for 2022

Benefits of Data Lake Solutions

Common Features of Data Lake Solutions

The Best Cloud Data Lake Solutions

Snowflake

Key Differentiators

Cost

Databricks

Key Differentiators

Costs

Cloudera data lake service