Putting Big Data in its Place

Transcript

Hello, I will be speaking in English, so prepare. I know we are running long. You can hear me okay? Yes? Okay. So, my name is Michael Amundsen. This is me, this is my handle. If you want to contact me on Twitter or GitHub and I will post the slides from today on this Twitter account. And then I will also share it with the speaker counsel.

I work for a group called API Academy. I won’t say more but I just want you to know it’s a group of very exciting and very intelligent individuals in various places in the world. Some of the things that I will be saying today is not really from me. It’s from the entire group. It’s a great group that I’m very proud to be a part of.

So I wanted to talk about this idea of big data. In big data not in the sense of what it is but what it means to us and some things that I was thinking about on the way here. I took the fast train from Paris and it was a wonderful experience. It reminds me that very often we’re going very fast in order to get to where we want to go. Sometimes we miss looking around. We miss paying attention to what’s really happening and this is a concern I have about where we are today. We’re in a great rush to focus on what we want to get to but I think there are some things we need to think about at the same time.

So, a very famous quote, "Those who cannot remember the past are condemned to repeat it." This is the lesson we try to remember to not repeat mistakes. 1905 George Santayana, a very great quote.

This is actually my favorite quote on this subject, "Those who ignore the mistakes of the future are bound to make them." It’s a very interesting idea, almost 100 years later, right? Joseph Miller is a Biologist and a future thinker. What is he telling us? Mistakes of the future. If we pay attention to the road ahead, we can see a turn. If we are going to fast we miss the turn, we make mistakes. Sometimes it’s important for us to look far enough ahead to understand what may happen.

There are three things I want to try to talk about today. I want to talk about data and storage. What this means for us, ahead of us. Not just today, not just trying to get the hospitals and doctors and apps to work, but what happens in 10 years, 20 years, 50 years. I want to talk about modeling information — that’s very different from data. How do we make sure we know what we’re talking about together? How will we make sure machines know they’re talking about together? And finally, I want to talk about what I call the ravages of time. Very often we don’t think about time as an element in designing our systems, but time is with us always and time affects us always. When I come back here next year, time would have affected me and it will affect all of us. It will have affect our data. It will affect us in 20, 50, 100 years from now.

Storage

So let’s talk about storage or data. I’m going to talk about the idea of data just as symbols and numbers and characters. This is data. This is what data is, this is data. This is data. This is what we deal with. It’s called a database not an information base. We store data and we share data together. This idea of data separate from its meaning or separate from information comes from a gentleman by the name of Claude Shannon.

Claude Shannon comes up with this idea. He’s working in the military at the end of the second World War and he wants to make sure messages traveling from place to place are the same. How can he tell if it’s the same message that was sent, that’s received, without ever understanding the message? He invents the parity bit. The idea that we can use mathematics to understand that this is the same message without ever understanding what this message means is incredibly important — and we all rely on it every day. We rely on it for filters and video encoding. We rely on it for cryptography, all these ideas that I can do mathematics on data without understanding it. So, this whole notion is about how things can effect data as it travels but we want to make sure that we can ensure its validity. That is just valid data.

So now we have lots of big data, very big data. We heard some talk about it already. So just for a reference, in 2003 if we compared the number of connected devices to the population of the world. We had less than one device per ten people on earth, in 2003. 2010, already two devices for every person on earth. Today, the estimate is almost three and half devices per person on earth. Keep in mind there are millions of people that have no devices. In 2020, possibly as many as seven, almost seven devices per person on earth. With my watch and my phone and my tablet and my blood pressure and my scale, I’m doing my part to keep up that number.

Many of us will have many, many devices - that’s a lot of data. So much data that it’s estimated to be 40,000 or 40 zettabytes, 40,0000 exabytes. It’s a huge number we can’t even think about. Maybe we think we need one place to store it all, just one big place. But of course that never happens. In fact most of the data we generate, we give away. We generate it at Twitter and Facebook and LinkedIn and all these other places and it becomes theirs. But in many cases especially in health, we want to keep that data. We want to hold on to that data. We probably generate, it’s estimated, one gigabyte per day of information today. As we add more devices we will increase that. One gigabyte of data printed on paper will fill a truck for every person here. That’s 365 trucks of data per each person, that’s a lot of trucks.

In the U.S. we’re building special places to store data. This is in the U.S. State of Utah. There’s a place called Bumblehive that where our government is beginning to store data. This will hold one yottabyte of data, that’s a thousand zettabytes. Remember, we just said that the world would produce in, 2020 40 zettabytes in 1 day. How long will this last? Not long.

So you might ask yourself, how does the brain deal with data? Where does the data that we take in go? How much data do we store? How much can the brain keep? There are lots of , 100 terabytes is the estimate of the storage capacity of the human brain today. It varies in orders of magnitude in either way. By the way, that’s a 100,000 gigabytes. Remember, I said we would generate about thousand gigabytes a day. So that means we have more than 250 years of storage in this brain, that’s pretty amazing. So does our brain actually store every bit of sense information? The answer is no. Our brains does not store every bit of information. Instead what they do is every day they collect information and move it into working storage and finally into long-term memory through pruning. We actually ignore most of the information that we take in and we create memories or remembrances in long-term memory and attach them to things. We get rid of most of the data that we see.

Now you might think that this is a bad thing. By the way there is a great book it’s called, "The Secret World of Sleep," which talks a great deal about this. It’s a very, very good book about how we prune data. In fact, if we remembered everything, it would be terrible. It would constantly bother us. Every little thing if we remembered every little bit of scrap of paper, every light, every blink, every person, every place that we had ever seen. In fact, forgetting helps us become more efficient.

Now why do I mention that? Because we are going to have to deal with this particular problem in big data. We’re not going to be able to keep every byte of it. There is just no way. Decade after decade, we’re not going to be able to keep it. It turns out also that is not a good idea to have a lot of data in front of us because learning is hard. Learning to choose data is very hard. In fact as Barry Schwartz would say, "Learning to choose well in a world of unlimited possibilities is too hard."

So, our brains are not set to be able to look at every possibility. We know that we have, actually there are studies that say we regret more. When we have more options, we regret the choice we have. Our brains are designed to make quick choices not to deal with a lot of them.

There’s a great book by Barry Schwartz called the "Paradox of Choice” which I think was very important. Choice is intimidating, choice can cause a problems. We do not want to offer too many choices to doctors or hospitals or patients because that will actually make things worse. So, the brain has this process and it usually uses sleep to do this, to actually prune data and then turn that into useful memory. Data is lost every day and our memories are really just fabrications of what we want to remember. We are going to have to do something like this for our data.

There are lots of possibilities ahead of us. There are some technologies ready for pruning this. Learning, deep learning and all these other kinds of things are important but we need to prepare today to think about how we’re going to do that, where we are going to do that, and who is going to be in charge of the process of deciding of what we should teach machines to prune away. Doctors, physicians and patients, as well need to collaborate in this idea.

Now, it’s important that we don’t think about the notion of every bit of data aggregated together into a big pool. Where we kind of narrow it down and we combine lots of bits. What’s important instead is a thing called a data lake which is very interesting idea from James Dixon. James Dixon talks about this idea where all the data goes in but then you have filters and flows out of the data for people who want to use it.

So a Doctor may need a different view or a different flow of that data than the Hospital or than a Patient. Lots of times it’s important that I don’t know things that my Doctor knows. It’s just not a good idea. We need to be prepared for this. Those of us that are in charge of building computer systems have to set this up and make this available to the experts who will make decisions about how to use it.

So what am I saying? Our challenge here is to support pruning strategies in the designs of our systems to implement these data lake ideas and to make sure we reduce overload. These are all our responsibility for making sure that this is successful.

Models

I want to talk a little about modeling data. The model is very different. The model is an example we follow or something we really thing about. Remember the data I talked about before? Well this is a model for that data. This is a model that tells us how we can use all those bits and pieces. Usually, our models are computer programs. That means if everyone is using this same program they all use the same model. The problem is, doctors don’t need the same models as patients and providers, insurance providers and device providers need different models as well but they need the same data.

So, models are a way for us to add meaning to data. We take in lots of inputs in our world. We have these ideas of what those inputs mean. What color is Red? What color is Green? Who said what and what happened, if I’m a witness to an accident.

Shared models are best, that’s when I know that I understand each of you and you understand me. Since we don’t share such a good model, I have an interpreter. This is my representation today. The proof that we all have different models in our head is right here on the screen. Optical illusions prove that the some data can mean different things to different people, whether that’s an old woman looking down or a young woman looking away, two faces or a vase, a women with long hair, or a man smoking a large pipe. These are all things that we apply to the same data. Models allow us to read the data the way we wish.

Data plus model is how we get information. There is no information in the bits or pieces, there’s only information in our model. So, information is what’s conveyed, it’s what we share. This was the information from the model that I was just talking about a moment ago remember? The data, the model, the information. This is what we’re responsible for when we create information systems. This is what we need to help professionals understand how they can give us the models we need to get the information they are looking for. Remember this data? This is the same data I’ve added a simple model of some color and some indentation. It makes it much easier for us.

We can improve the usability of data messages and there are three primary ways to do it.

Selecting Formats: There are lots of formats of data. I’ve already heard conversations today about how we were surprised that different formats exist and we can’t get them to work together. It’s important for us to have a shared understanding by using these formats to send messages. Typically our geeks would say they are going to use JSON, or application JSON but there are little affordance or little extra meaning in this. This is JSON. There’s another format of JSON that is called a rich data format. It’s called collection JSON. This is the same information done differently with a different model. A model that helps us understand it better.
Selecting Protocols: Protocols in the world are very, very important as well. There are lots of possibly protocols. Most of us today focus on one them, HTTP. But there are many of them. Protocols let us take action upon what we do. We heard before that Brazil actually lets us read and write a lot of the medical records. In Germany I can’t even find the medical records because I’m not a German and I haven’t registered. So, action is incredible because that tells us what to do. It turns out that there are several of these protocols that are very good at action. Two of them you may have not of heard of CoAP and MQTT. These will become more and more important. It’s important for us to be prepared, those of us who create applications to use more than one protocol system.
Selecting Semantics: Now, I’ve still just talked about data. I haven’t actually talked about meaning. Semantics is how we get meaning. On the web that model is actually shared through semantic indicators. You can see in the screen that there are bits and pieces that say what this data is about. That’s the meaning. There are lots of sources of meaning, that exist today through various registrations and some of them are specific vocabularies for the medical profession. It’s important that we treat semantics separately.

We represent data in formats. We support multiple protocols, and we treat semantics separately. Very often we combine these too much in our technology and the limits our possibilities and causes us to have to change our technology over and over again.

Time

I only have a few of minutes left and I want to try to cover one more topic, and that is time. Time is really a killer. Time is always with us. The arrow of time is just terrible. Everything changes and nothing stands still. Think of all of the computer languages we seen in the last 50 years. All of the storage formats that we’ve seen over the centuries. How many people remember this? Right? How many people still have these in the house somewhere? How many people can still use them? Just a few.

You do not want this to happen to your data. This is the risk we run. How long will we keep this data? A year, five years. I plan on living a while. My children will live quite a while. It’s quite possible that some people in this room will live past 100. We need to be able to use that data for centuries to come. This is our big challenge.

Think of how many different, even just programs, just programs. Just this word processor program that have changed in the last 30 or 40 years. Just the formats for online markup HTML versions that have changed in the last 20 years. There’s reason that one of the biggest bodies on the internet that controls standards still prints everything in simple text. Why? Because you don’t need a program to read it. What happened to all of those Word documents to all of those Word processor files from Word Star and from Lotus Symphony? Where are they now? Who can use them? What’s going to happen to your medical data in 30 years?

Storage formats are not transfer formats. The way we communicate is not the way we need to store information. This is really, really important. There are lots of possibilities for long-term storage. CSV comma-separated is very popular. XML is still popular, JSON, there is even a series of formats called RDF, based on Triples. This a thing called Notation 3. Somewhere along the way we have to select a storage format that’s going to last for century or more. We have to commit to this. There are several in the running. I actually recommend either XML or JSON today because they have an extra tool called a Schema - a way that we can add separate semantics to the data as we talked before.

It’s also important to remember that the media itself is volatile. You know all of those CD’s you are storing information on, everyday a few of those bits disappear. They will eventually wear out and they would be no good. There are lots of things. We don’t have enough time to talk about it, lots of things ahead such as storing in Carbon Nano Balls, storing at the level of molecules, even storing in quantum bits and pieces. But the one I think was most interesting is actually storing in DNA.

Researchers in Zurich can now store information in DNA that can last a million years. They can turn bits into DNA strings, store them in a DNA gel. Then that can stay stable for more than a million years and then they can come back to get the bits again. All sorts of things are going to change our data designs, our systems are going to have to be prepared for this.

Does anybody remember the movie Time Machine from 1960 H.G. Wells book from the turn of the century? They even had this idea that there were going to be these talking rings that you’d spin. They actually used this spin was the energy, in order to give it power to talk to you. How are we going to read these formats? Are we going to have the same kind of power? The same kind of applications?

The beauty of the Rosetta Stone is that it told us more than one way of how to read the same bit of data. It let us decrypt it. The Rosetta Air Craft from a European Space Agency, they just landed on P67 last fall on the Asteroid actually carried a Rosetta object called the Rosetta Project. Thirteen-Thousand pages of information in twelve hundred languages. Anybody who finds this, this is the key to deciphering lots of languages. Game programmers know this problem already. They’ve created emulators for the games that are 20 years old so we can still play them today. They put them on small little cards that you can carry around.

What are we going to do in 200 years to all our medical data? Will we have the programs available to us? We need to learn from this. We need to be prepared to hold onto this data that we are creating, this big data that we are so excited about for centuries to come. Be ready to migrate it to new media when necessary and we may need to even save applications in the process. Save applications with the data. Okay, just a few more slides, I promise.

A Lot Ahead of Us

We have a lot ahead of us and we are going very fast. It’s important that we pay attention. It’s fantastic the were working today on open data, but we have responsibilities centuries ahead. Vint Cerf one of the people that helped us build the internet says, "If we don’t want to digital lives to fade away, we need to make sure the objects we create can be rendered a long time from now." We need to make sure that we don’t make the mistakes of the future. The future is fantastic. There is no telling what we are going to see. We just need to keep our eyes open as we are going along. And that is what I have. Thank you very much.

Q and A

Man 1: Thank you very much. Have you some questions? I’m sure you have. There.

Man 2: [Inaudible]

Michael: The question was where are the shared models of healthcare? The one that I know really well right now is something that is called HL7 which is used in the States but it’s not the only one. I can’t give you a collection now, but I’ve actually been trying to build up some, and I will post, with my list as I find them and maybe we can create a place to start finding these shared vocabularies. It’s a really key point and that’s very, very good. I’ll start a list and I’ll post it on Twitter. Yes?

Man 1: Another question?

Man 3: Hi, everyone. Even if we are storing like an XML format in a DNA strand, don’t we have to worry about mutation and things breaking down over time?

Michael: Yes. So the question has to do with the Zurich project that actually use DNA as storage and are we concerned about breakage. I don’t understand all of the technical details of the Zurich project. It was just announced a couple of months ago. But they were claiming that because DNA also takes into account degradation and has copies that they can take advantage of that. But I don’t understand all of the science but it is certainly is a problem. The idea of repetition helps us in degradation both at runtime and at storage, and that’s a real key point. But I will include a reference to the article and you can take a look. Yes.

Man 1: Another question?

Man 4: Hi, everyone. I saw in a hospital, a strange system to archive their records. They use microfilms. Do you believe that the only way to store digital data is to not through digital data but to something like microfilms.

Michael: Yes. So, the question is, is the best way to store digital data is not actually digital form, but in this example microfilms or microfiche, which have been with us for about a hundred years. I don’t think that we’re stuck with just that. In fact, I think what’s going to happen is we’ll go through many forms or many types. When I do some research in libraries, I still use microfilm or microfiche. So, one of our problems is we have actually about a hundred years of information stored in that format. So, we either need to transfer that or keep the machines working for the next couple of years so that we can read them.