Big data: Ten level taxonomy in learning
Big Data, at all sorts oflevels in learning, reveals secrets we never imagined we could discover. It reveals things to you the user, searcher, buyer and learner. It also reveals thing about you to the seller, ad vendors, tech giants and educational institutions. Big data is now big business, where megabytes mean megabucks. Given that less 2% of all information is now non-digital, it is clear where the data mining will unearth its treasure- online. As we do more online, searching, buying, selling, communicating, dating, banking, socializing and learning, we create more and more data that provides fuel for algorithms that improve with big numbers. The more you feed these algorithms the more useful they become.
Among the fascinating examples, is Google’s success with big data in their translation service, where a trillion word data-set provides the feed for translations between over a dozen languages. Amazon’s recommendation engine looks at what you bought, what you didn’t buy, how long you looked at things and what books are bought together. This big data driven engine accounts for a third of all Amazon sales. With Netflix, their recommendation engine accounts for an astonishing three quarters of all new orders. Target, the US retailer, know (creepily) when someone is pregnant without the mother-to-be telling them. This led to an irate father threatening legal action when his daughter received a mail voucher for baby clothes. He returned a few days later, sheepishly apologizing!
Why is Big Data such a big deal in learning?
Online learning, by definition, is data, it can also produce data. This is one of the great advantages of being online, that it is a two-way form of communication. For many years data has been gathered and used in online learning. De facto standards even emerged making this data interoperable, namely SCORM and now TinCan.
However, something new has happened, the awareness that the data produced by online learning is much more powerful than we ever imagined. It can be gathered and used to solve all sorts of difficult problems in learning, problems that have plagued education and training – formative assessment, drop-out, course improvement, productivity, cost reduction and so on.
Learning = Large data
So how relevant is big data to learning? We need to start with an admission, that big data in learning is really just ‘Large data’. We’re not dealing with the unimaginable amounts of relevant data that Google bring to bear when you search or translate. The datasets we’re talking about come from individual learners, courses, individual institutions and sometimes, but rarely from groups of institutions, national tests and examinations and rarer still, from international tests or large complexes of institutions.
Ten level taxonomy of data
Data can be harvested at 10 different levels:
1. Data on brain
We’ve seen the commercial launch of some primitive toys using brain sensors (see my previous post) but we’ve yet to see brain and situation really hit the world of learning. Learning is wholly about changing the brain, so one would expect, at some time, for brain research to accelerate learning through cheap, consumer brain and body based technology. S Korea is developing software and hardware that may profoundly change the way we learn. With the development of an ’emotional sensor set’ that measures EEG, EKG and, in total, 7 kinds of biosignals, along with a situational sensor set that measures temperature, acceleration, Gyro and GPS, they want to literally read our brains and bodies to accelerate learning. There are problems with this approach as it’s not yet clear that the EEG and other brain data, gathered by sensors measure much more than ‘cognitive noise’ and general increases in attention or stress, and how do we causally relate these physiological states to learning, other than the simple reduction of stress. The measures are like simple temperature gauges that go up and down. However, the promise is that a combination of these variables does the job.
2. Data on learner
This is perhaps the most fruitful type of data as it is the foundation for both learners and teachers to improve the speed and efficacy of learning. At the simplest level one can have conditional branches that take input from the learner and other data sources to branch the course and provide routes and feedback to the learner (and teacher). Beyond this rule-sets and algorithms can be used to provide much more sophisticated systems that present, screen-by-screen, the content of the learning experience. There are many ways in which adaptive learning can be executed. See this paper from Jim Thompson on Types on Adaptive Learning. In adaptive learning systems, the software acts as a sort of satnav, in that it knows who you are, what you know, what you don’t know, where you’re having difficulty and a host of data about other, useful learner-specific variables. These variables can be used by the software, learner or teacher to improve the learning journey.
3. Data on course components
One can look at specific learning experiences components in a course, such as video, use of forums, specific assessment items and so on. Peter Kese of Viidea is an expert in the analytics from recorded lectures and his results are fascinating. Gathering data from recorded lectures improves lectures, as one can spot the points at which attention drops and where key images, points and slides raise attention and keep the learners engaged. When Andrew Ng, the founder of Coursera, looked at the data from his ‘Machine Learning’ MOOC, he noticed that around 2000 students had all given the same wrong answer – they had inverted two algebraic equations. What was wrong, of course, was the question. This is a simple example of an anomaly in a relatively small but complete data set that can be used to improve a course. The next stage is to look for weaknesses in the course in a more systematic way using algorithms designed to look specifically for repeatedly failed test items. At this level we can pinpoint learner disengagement, weak and even erroneous test items, leading to course improvement. At a more sophisticated level, in a networked learning solution where the learning experiences are presented to the learner based on algorithms, screen-by-screen, items can be promoted or demoted within the network.
4. Data on course
A course can produce data that also shows weak spots. It can also show dropout rates and perhaps indications of the cause of those dropouts. One can gather pre-course data about the nature of the learners (age, gender, ethnicity, geographical location, educational background, employment profile and existing competences). During the course time taken on tasks, note taking, when learning takes place and for how long. Physiological data such as eye tracking and signals from the brain. This pre-course or initial diagnostic data can be used to determine what is presented in the course. At a more sophisticated level, it can be used as the course progresses, much as a satnav provides continuous data when you drive. Course output data from summative assessment is also useful, however, the big data approach pushes us towards not relying solely on this as was so often the case in the past. This is important for two reasons the learner themselves, knowing what they’ve achieved, not achieved, and the tutor, teacher, trainer, who can use personal data to provide formative assessment, interventions and advice based on such data. In this sales course for a major US retailer, sales staff are given sales training in a 3D simulation which delivers sales scenarios with a wide range of customers and customer needs. Individual competences are taught, practiced and tracked, so that the actual performance of the learners is measured within the simulation. Sales in the stores where staff received the simulation training were 6% greater than the control group who did traditional training. This is a good example of fine-grained data being gathered
5. Data on groups of courses
MOOCs, in particular, have raised the stakes in data-driven design and delivery of courses. In truth, less data is gathered about learners than one would imagine by the likes of Coursera and Udacity but MOOC mania has accelerated the interest in data-driven reflection. The University of Edinburgh have produced a data-heavy report on their six 2013 Coursera MOOCs taken by over 300,000 learners. The report has good data, tries to separate out active learners from window shoppers and not short on surprises. It’s a rich resource and a follow up report is promised. This is in the true spirit of Higher Education – open, transparent and looking to innovate and improve. Rather than summarise the report, I’ve plucked out the Top Ten surprises, that point towards the future development of MOOCs. If I were looking at MOOCs, I’d pour over this data carefully. That, combined with the useful information on resources expended by the University, is an invaluable business planning tool. Lori Breslow, Director of MIT Teaching and Learning Laboratory has looked at data generated by MOOC users provide clues on how to design the future of learning using massive data from “Circuits and Electronics” (6.002x), edX’s MOOC, launched in March 2012 which includes IP addresses of 155,000 enrolled students, clickstream data on each of the 230 million interactions students had with platform, scores on homework assignments, labs, and exams, 96,000 individual posts on a discussion forum and an end-of-course survey to which over 7,000 students responded.
6. Data on institution
At this organizational level, it is vital that institutions gather data that is much more fine-grained than just assessment scores and numbers of students who leave. Many institutions, arguably most have problems with drop-outs, either across the institution or on specific courses. One way to tackle this issue is to gather data to identify deep root causes, as well as spot points at which interventions can be planned.
7. Data on groups of institutions
Perhaps we should be a bit realistic about the word ‘big’ in an educational context, as it is unlikely that many, other than a few large multinational, private companies will have the truly ‘big’ data. Skillsoft, Blackboard, Laureate and others may be able to muster massive data sets, but a typical school, college or university may not. The MOOC providers, such as Coursera and Udacity are another group that have the ability and reach to gather significantly large amounts of data about learners.
8. Data on national
National data is gathered by Governments and organisations to diagnose problems and successes and reflect on whether policies are working. This is most often input data, such as numbers of students applying for courses and who those students are and so on. Then there’s output data, usually measured in terms of exams and certification. This misses much, in terms of actual improvement and often leads to an obsession with testing that takes attention away from the more useful data about the processes of learning and teaching.
9. Data on international
At international leve the United Nations, UNESCO and others collect data, such as PISA, PIAC and OECD data, produced to compare countries performance. It is not at all clear that this data is as reliable as its authors claim. Within countries politicians then take these statistics, exaggerate their significance, cherry-pick the comparative countries (Singapore but not Finland) and use it to design and implement policies that can, potentially do great harm. PISA, for example, has huge differences in demographics, socio-economic ranges and linguistic diversity within the tested nations. The skews in the data, include the selection of one flagship city (Shanghai) to compare against entire nations. Immigration skews include numbers of immigrants, effect of selective immigration, migration towards English speaking nations, and first-generation language issues. There’s also the issue of taking longer to read irregular languages and selectivity in the curriculum. (see Leaning Tower ofPISA: 7 skews)
10. Data on web
Google, Amazon, Wikipedia, YouTube, Facebook and others gather huge amounts of data from users of their services, This data is then used to improve the service. Indeed, I have argued that Google search, Google translate, Wikipedia Amazon and other services now play an important pedagogic role in real learning. There are lessons here for education in terms of the importance of data. One should always be looking to gather data on online learning and Google Analytics is a wonderful tool.
Big data is changing learning by providing a sound basis for learners, teachers, managers and policy makers to improve their systems. Too much is hidden so more and more open data is needed. Data must be open. Data must be searchable. Data must also be governed and managed. There is also the issue of visualization. Big data is about decision making by the learner, teacher or at an organizational, national or international level and must be understood through visualization. However, data is also being used to do great harm. Big data in the hands of small minds can be dangerous (see When Big Data goes bad: 6 Epic fails).