Tag: datascience


Empirical Discovery: Concept and Workflow Model

June 20th, 2014 — 12:00am

Concept models are a powerful tool for articulating the essential elements and relationships that define new or complex things we need to understand.  We’ve previously defined empirical discovery as a new method, looking at antecedents, and also comparing and contrasting the distinctive characteristics of Empirical Discovery with other knowledge creation and insight seeking methods.  I’m now sharing our concept model of Empirical Discovery, which identifies the most important actors, activities, and outcomes of empirical discovery efforts, to complement the written definition by illustrating   how the method works in practice.

Empirical discovery concept model from Joe Lamantia

In this model, we illustrate the activities of the three kinds of people most central to discovery efforts: Insight Consumers, Data Scientists, and Data Engineers.  We have robust definitions of all the major actors involved in discovery (used to drive product development), and may share some of these various personas, profiles, and snapshots subsequently.  For reading this model, understand Insight Consumers as the people who rely on insights from discovery efforts to effect and manage the operations of the business.  Data Scientists are the sensemakers who achieve insights, and create data products, and analytical models through discovery efforts.  Data Engineers enable discovery efforts by building the enterprise data analysis infrastructure necessary for discovery, and often implement the outcomes of empirical discovery by building new tools based on the insights and models Data Scientists create.

A key assumption of this model is that discovery is by definition an iterative and serendipitous method, relying on frequent back-steps and unpredictable repetition of activities as a necessary aspect of how discovery efforts unfold.  This model also assumes the data, methods, and tools shift during discovery efforts, in keeping with the evolution of motivating questions, and the achievement of interim outcomes.  Similarly, discovery efforts do not always involve all of these elements.

To keep the essential structure and relationships between elements clear and in the foreground, we have not shown all of the possible iterative loops or repeated steps.  Some closely related concepts are grouped together, to allow reading the model on two levels of detail.

For a simplified view, follow the links between named actors and groups of concepts shown with colored backgrounds and labels.  In this reading, an Insight Consumer articulates questions to a Data Scientist, who combines domain knowledge with the Empirical Discovery Method (yellow) to direct the application of Analytical Tools (blue) and Models (salmon) to Data Sets (green) drawn from Data Sources (magenta).  The Data Scientist shares Insights resulting from discovery efforts with the Insight Consumer, while Data Engineers may implement the models or data products created by the Data Scientist by turning them into tools and infrastructure for the rest of the business.  For a more detailed view of the specific concepts and activities common to Empirical discovery efforts, follow the links between the individual concepts within these named groups.  (Note: there are two kinds of connections; solid arrows indicating definite relationships, and for the Data Sets and Models groups, dashed arrows indicating possible paths of evolution.  More on this to follow)

Another way to interpret the two levels of detail in this model is as descriptions of formal vs. informal implementations of the empirical discovery method.  People and organizations who take a more formal approach to empirical discovery may require explicitly defined artifacts and activities that address each major concept, such as predictions and experimental results.  In less formal approaches, Data Scientists may implicitly address each of the major concepts and activities, such as framing hypotheses, or tracking the states of data sets they are working with, without any formal artifact or decision gateway.  This situational flexibility is follow-on of the applied nature of the empirical discovery method, which does not require scientific standards of proof and reproducibility to generate valued outcomes.

The story begins in the upper right corner, when an Insight Consumer articulates a belief or question to a Data Scientist, who then translates this motivating statement into a planned discovery effort that addresses the business goal. The Data Scientist applies the Empirical Discovery Method (concepts in yellow); possibly generating a hypothesis and accompanying predictions which will be tested by experiments, choosing data from the range of available data sources (grouped in magenta), and selecting initial analytical methods consistent with the domain, the data sets (green), and the analytical or reference models (salmon) they will work with.  Given the particulars of the data and the analytical methods, the Data Scientist employs specific analytical tools (blue) such as algorithms and statistical or other measures, based on factors such as expected accuracy, and speed or ease of use.  As the effort progresses through iterations, or insights emerge, experiments may be added or revised, based on the conclusions the Data Scientist draws from the results and their impact on starting predictions or hypotheses.

For example, an Insight Consumer who works in a product management capacity for an on-line social network with a business goal of increasing users’ level of engagement with the service wishes to identify opportunities to recommend users establish new connections with other similar and possibly known users based on unrecognized affinities in their posted profiles.  The data scientist translates this business goal into a series of experiments investigating predictions about which aspects of user profiles more effectively predict the likelihood of creating new connections in response to system-generated recommendations for similarity.  The Data Scientist frames experiments that rely on data from the accumulated logs of user activities within the network that have been anonymized to comply with privacy policies, selecting specific working sets of data to analyze based on awareness of the shoe and nature of the attributes that appear directly in users’ profiles both across the entire network, and among pools of similar but unconnected users. The Data Scientist plans to begin with analytical methods useful for predictive modeling of the effectiveness of recommender systems in network contexts, such as measurements of the affinity of users’ interests based on semantic analysis of social objects shared by users within this network and also publicly in other online media, and also structural or topological measures of relative position and distance from the field of network science.  The Data Scientist chooses a set of standard social network analysis algorithms and measures, combined with custom models for interpreting user activity and interest unique to this network.  The Data Scientist has predefined scripts and open source libraries available for ready application to data (MLlib, Gephi, Weka, Pandas, etc.) in the form of Analytical tools, which she will combine in sequences according to the desired analytical flow for each experiment.

The nature of analytical engagement with data sets varies during the course of discovery efforts, with different types of data sets playing different roles at specific stages of the discovery workflow.  Our concept map simplifies the lifecycle of data for purposes of description, identifying five distinct and recognizable ways data are used by the Data Scientist, with five corresponding types of data sets.  In some cases, formal criteria on data quality, completeness, accuracy, and content govern which stage of the data lifecycle any  given data set is at.  In most discovery efforts, however, Data Scientists themselves make a series of judgements about when and how the data in hand is suitable for use.  The dashed arrows linking the five types of data sets capture the approximate and conditional nature of these different stages of evolution.  In practice, discovery efforts begin with exploration of data that may or may not be relevant for focused analysis, but which requires some direct engagement to and attention to rule in or out of consideration. Focused analytical investigation of the relevant data follows, made possible by the iterative addition, refinement and transformation (wrangling – more on this in later posts) of the exploratory data in hand.  At this stage, the Data Scientist applies analytical tools identified by their chosen analytical method.  The model building stage seeks to create explicit, formal, and reusable models that articulate the patterns and structures found during investigation.  When validation of newly created analytical models is necessary, the Data Scientist uses appropriate data – typically data that was not part of explicit model creation.  Finally, training data is sometimes necessary to put models into production – either using them for further steps in analytical workflows (which can be very complex), or in business operations outside the analytical context.

Because so much discovery activity requires transformation of the data before or during analysis, there is great interest in the Data Science and business analytics industries in how Data Scientists and sensemakers work with data at these various stages.  Much of this attention focuses on the need for better tools for transforming data in order to make analysis possible.  This model does not explicitly represent wrangling as an activity, because it is not directly a part of the empirical discovery method; transformation is done only as and when needed to make analysis possible.  However, understanding the nature of wrangling and transformation activities is a very important topic for grasping discovery, so I’ll address in later postings. (We have a good model for this too…)

Empirical discovery efforts aim to create one or more of the three types of outcomes shown in orange: insights, models, and data products.  Insights, as we’ve defined them previously, are discoveries that change people’s perspective or understanding, not simply the results of analytical activity, such as the end values of analytical calculations, the generation of reports, or the retrieval and aggregation of stored information.

One of the most valuable outcomes of discovery efforts is the creation of externalized models that describe behavior, structure or relationships in clear and quantified terms.  The models that result from empirical discovery efforts can take many forms — google ‘predictive model’ for a sense of the tremendous variation in what people active in business analytics consider to be a useful model — but their defining characteristic is that a model always describes aspects of a subject of discovery and analysis that are not directly present in the data itself.  For example, if given the node and edge data identifying all of the connections between people in the social network above, one possible model resulting from analysis of the network structure is a descriptive readout of the topology of the network as scale-free, with some set of subgraphs, a range of node centrality values’, a matrix of possible shortest paths between nodes or subgraphs, etc.  It is possible to make sense of, interpret, or circulate a model independently of the data it describes and is derived from.

Data Scientists also engage with models in distinct and recognizable ways during discovery efforts.  Reference models, determined by the domain of investigation, often guide exploratory analysis of discovery subjects by providing Data Scientists with general  explanations and quantifications for processes and relationships common to the domain.  And the models generated as insight and understanding accumulate during discovery evolve in stages from initial articulation through validation to readiness for production implementation; which means being put into effect directly on the operations of the business.

Data products are best understood as ‘packages’ of data which have utility for other analytical or business purposes, such as a list of users in the social network who will form new connections in response to system-generated suggestions of other similar users.  Data products are not literally finished products that the business offers for external sale or consumption.  And as background, we assume operationalization or ‘implementation’ of the outcomes of empirical discovery efforts to change the functioning of the business is the goal of different business processes, such as product development.  While empirical discovery focuses on achieving understanding, rather than making things, this is not the only thing Data Scientists do for the business.  The classic definition of Data Science as aimed at creating new products based on data which impact the business, is a broad mandate, and many of the position descriptions for data science jobs require participation in product development efforts.

Two or more kinds of outcomes are often bundled together as the results of a genuinely successful discovery effort; for example, an insight that two apparently unconnected business processes are in fact related through mutual feedback loops, and a model explicitly describing and quantifying the nature of the relationships as discovered through analysis.

There’s more to the story, but as one trip through the essential elements of empirical discovery, this is a logical point to pause and ask what might be missing from this model? And how can it be improved?

 

Related posts:

1 comment » | Language of Discovery

Strata New York Video: Designing Big Data Interactions With the Language of Discovery

December 6th, 2013 — 12:00am

I’m late to making it available here, but O’Reilly media published the video recording of my presentation on The Language of Discovery: A Toolkit For Designing Big Data Interactions from last year’s (2012) Strata conference in NY.

Looking back at this, I’m happy to say that while my thinking on several of the key ideas has advanced quite a bit in the past 12 months (see our more recent materials), the core ideas and concepts remain vital.

Those are, briefly:

  • Big Data is useless unless people can engage with it effectively
  • Discovery is a critical and inadequately acknowledged aspect of sense making that is core to realizing value from Big Data
  • Discovery is literally the most important human/machine interaction in the emerging Age of Insight
  • Providing discovery capability requires understanding people’s needs and goals
  • The Language of Discovery is an effective tool for understanding discovery needs and activities, and designing solutions
  • There are known patterns and structure in discovery activities that you can use to create discovery solutions

I’ve posted it to vimeo for easier viewing – slides are here /user-experience-ux/strata-new-york-slides-new-discovery-patterns for those who wish to follow along – enjoy!

Comment » | Language of Discovery

Understanding Data Science: Two Recent Studies

October 22nd, 2013 — 12:00am

If you need such a deeper understanding of data science than Drew Conway’s popular venn diagram model, or Josh Wills’ tongue in cheek characterization, “Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician.” two relatively recent studies are worth reading.

Analyzing the Analyzers,’ an O’Reilly e-book by Harlan Harris, Sean Patrick Murphy, and Marck Vaisman, suggests four distinct types of data scientists — effectively personas, in a design sense — based on analysis of self-identified skills among practitioners.  The scenario format dramatizes the different personas, making what could be a dry statistical readout of survey data more engaging.  The survey-only nature of the data,  the restriction of scope to just skills, and the suggested models of skill-profiles makes this feel like the sort of exercise that data scientists undertake as an every day task; collecting data, analyzing it using a mix of statistical techniques, and sharing the model that emerges from the data mining exercise.  That’s not an indictment, simply an observation about the consistent feel of the effort as a product of data scientists, about data science.

And the paper ‘Enterprise Data Analysis and Visualization: An Interview Study‘ by researchers Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffery Heer considers data science within the larger context of industrial data analysis, examining analytical workflows, skills, and the challenges common to enterprise analysis efforts, and identifying three archetypes of data scientist.  As an interview-based study, the data the researchers collected is richer, and there’s correspondingly greater depth in the synthesis.  The scope of the study included a broader set of roles than data scientist (enterprise analysts) and involved questions of workflow and organizational context for analytical efforts in general.  I’d suggest this is useful as a primer on analytical work and workers in enterprise settings for those who need a baseline understanding; it also offers some genuinely interesting nuggets for those already familiar with discovery work.

We’ve undertaken a considerable amount of research into discovery, analytical work/ers, and data science over the past three years — part of our programmatic approach to laying a foundation for product strategy and highlighting innovation opportunities — and both studies complement and confirm much of the direct research into data science that we conducted. There were a few important differences in our findings, which I’ll share and discuss in upcoming posts.

Related posts:

Comment » | Language of Discovery, User Research

Discovery and the Age of Insight

August 21st, 2013 — 12:00am

Several weeks ago, I was invited to speak to an audience of IT and business leaders at Walmart about the Language of Discovery.   Every presentation is a feedback opportunity as much as a chance to broadcast our latest thinking (a tenet of what I call lean strategy practice – musicians call it trying out new material), so I make a point to share evolving ideas and synthesize what we’ve learned since the last instance of public dialog.

For the audience at Walmart, as part of the broader framing for the Age of Insight, I took the opportunity to share findings from some of the recent research we’ve done on Data Science (that’s right, we’re studying data science).  We’ve engaged consistently with data science practitioners for several years now (some of the field’s leaders are alumni of Endeca), as part of our ongoing effort to understand the changing nature of analytical and sense making activities, the people undertaking them, and the contexts in which they take place.  We’ve seen the discipline emerge from an esoteric specialty into full mainstream visibility for the business community.  Interpreting what we’ve learned about data science through a structural and historic perspective lead me to draw a broad parallel between data science now and natural philosophy at its early stages of evolution.

We also shared some exciting new models for enterprise information engagement; crafting scenarios using the language of discovery to describe information needs and activity at the level of discovery architecture, IT portfolio planning,  and knowledge management (which correspond to UX, technology, and business perspectives as applied to larger scales and via business dialog) – demonstrating the versatility of the language as a source of linkage across separate disciplines.

But the primary message I wanted to share is that discovery is the most important organizational capability for the era.  More on this in follow up postings that focus on smaller chunks of the thinking encapsulated in the full deck of slides.

Discovery and the Age of Insight: Walmart EIM Open House 2013 from Joe Lamantia

Comment » | Language of Discovery

Back to top