Confused about key terms in data science and AI?
You’re not alone. There are lots of important terms to know if you’re looking to further understand, pilot, and scale AI in your career and at your company.
Don’t worry, though. Our friends at Pandata have you covered. Pandata empowers organizations to design and develop human-centered AI and machine learning solutions. They’re data science experts and they’re AI translators—they know how to describe complex topics in simple ways.
They even put together a handy data science and AI glossary, which we’re republishing from their site with their permission. It has 20 terms you need to know if you want to better understand and use AI.
A series of instructions or recipes for manipulating data to achieve an end goal. We use programming languages such as Python or R to implement algorithms. Algorithms can include processes ranging from simple addition to extremely complex neural networks.
AI is a solution that learns to recognize and react to patterns, emulating traditionally human tasks like understanding language, recommending business actions, and synthesizing large amounts of information. AI works best when it assists humans by learning very repetitive tasks that depend on large amounts of information.
Data should be used to derive actionable business intelligence. Our primary goal is to use data to contribute to business value. This is done through statistical analysis, data visualization/reporting, and machine learning.
An interactive data visualization or series of visualizations that allow stakeholders to explore various dimensions of data. We develop dashboards using tools such as Tableau or PowerBI with the goal of decipherability and ease of use such that the end user can independently drill into data details or explore high level summary information.
Data engineering involves planning, designing, and implementing information systems. This includes data storage as well as the pipelines that data scientists use to access and transform data.
An organization’s data can be augmented in ways that improve business insight and empower predictive analytics. We use extensive knowledge of open source data to supplement and enrich your proprietary data sources.
Where you store your data is dependent on what type of data you have. A “Data Lake” is used when all you have is raw, unprocessed data that frequently has varying structures that do not have any relations between one another. A “Data Warehouse” is used to store structured or relational data from many sources, not just one.
Data Science exists at the intersection of math, statistics, computer programming, and business. Data Science is the application of these tools to provide insight and value from data.
Data can be used to provide business intelligence, but if a stakeholder cannot understand it, it is difficult to convert that intelligence into business value. Visualization and reporting bridge that gap. This is also necessary when presenting results from statistical analysis or machine learning.
Using architectures like deep neural networks to perform machine learning. If the situation calls for it, deep learning can outperform classical methods and provide state of the art performance. We find that deep learning is most useful with sequential data, image data, or learning from simulated environments.
A critical early stage in any data-related project, EDA involves exploring the available data and summarizing the main characteristics, often using visualizations. It can provide additional insight to the data set, and result in ideas and hypotheses to explore with more formal statistical modeling.
In order to prepare a cleaned data set for querying and further use, ETL refers to the extraction of data from one or more sources, the transformation of data into a proper format or structure, and the loading of data into a target database.
Feature selection is used in machine learning – selecting only the relevant features from data and removing features that are redundant or irrelevant allows for simplification of models and reduces training time.
Apache Hadoop is an open-source software framework for distributed storage and processing of data. Hadoop benefits from the distribution of files across the nodes in a Hadoop cluster and the processing of data in parallel across multiple nodes. Hadoop can be deployed on local computer clusters, in the cloud (using services like Amazon’s AWS or Microsoft’s Azure), or as a combination hybrid solution.
Machine learning algorithms allow computers to learn from data in order to perform specific tasks. Most often, this is some form of prediction or optimization, although it can also be useful for general pattern mining.
Much of the world’s data comes in the form of natural language, which is often unstructured. We combine classical methods and modern deep learning to gain actionable insights and predictive analytics with text data in all forms.
Although pattern mining is useful for all forms of machine learning, it is most useful in “unsupervised” settings, when data cannot be naturally used for predictive analytics. It often provides business intelligence on its own, and can be used as a stepping stone to performing predictive analytics.
Recommender systems are used to predict a user’s preferences based on inputs such as a user’s historical preferences or the preferences of similar users. Common uses of recommender systems include suggestions generated by streaming content services like Spotify and YouTube and product recommendations generated by Amazon and many other e-commerce sites.
Supervised machine learning uses training data which includes an input and the expected output. Once trained, the model will accept a previously unseen input and predict the output based on the function developed during training. Common examples of supervised learning algorithms include decision trees, linear and logistic regression, and k-nearest neighbor. Common applications include predicting future patterns or classifying categories.
Unsupervised learning is more exploratory in nature. Output categories are not included in the training set, and a common goal is to find previously undetected patterns. A common example of an unsupervised learning algorithm is k-means and common applications are clustering for anomaly detection.
This is most often used to gather high-level knowledge of data. This high-level knowledge is used to motivate further business intelligence efforts. Statistical analysis can create actionable business intelligence on its own or in combination with a reporting solution. It is also often considered a necessary ingredient for machine learning.