Bummer! This is just a preview. You need to be signed in with a Pro account to view the entire video.
Start a free Basic trial
to watch this video
Keynote - Scale Oriented Architecture
39:00 with John SheehanJohn is a Co-founder and CEO of Runscope and API fanatic. As an early employee at Twilio, John lead the developer evangelism program and worked as a Product Manager for Developer Experience. After Twilio, John was Platform Lead at IFTTT working with API providers to create new channels. John is also the creator of RestSharp, API Changelog, API Digest, API Jobs and co-host of Traffic and Weather, an API and cloud podcast.
-
0:00
[MUSIC]
-
0:03
[MUSIC]
-
0:07
[MUSIC]
-
0:12
[MUSIC]
-
0:23
[SOUND].
-
0:25
[MUSIC]
-
1:11
Thanks everybody thanks for hanging around I know it's dinner time so,
-
1:14
I'm gonna talk really fast cuz you're probably hungry I'm probably hungry plus I
-
1:18
don't, I only have one gear.
-
1:20
So, I can only talk one speed I want to talk about scale-oriented architecture so,
-
1:24
we really like repurposing acronyms at Run scope because so
-
1:28
service-oriented architecture is boring and enterprise and so 2005.
-
1:33
And so, like, we thought man We could really spice it up by redefining it
-
1:39
okay let me tell you a little bit about what Run Scope does because it's, it's
-
1:43
actually sort of important to understand what we're building and the tools that we
-
1:47
offer at a very high level I'm not going to try and sell you anything.
-
1:50
To understand why we built the infrastructure behind it the way we did so
-
1:54
we let you do three things, log, monitor, measure your API usage to
-
1:57
solve API problems fast and so what we're trying to do is help developers who
-
2:01
consume APIs understand the performance and characteristics of those API calls.
-
2:06
So they could build better applications that's essentially what we do
-
2:09
there's three ways we do that logging, Keith's just showed you that.
-
2:12
If you were in this room you can get a nice log for all the traffic that, for
-
2:15
all the calls you're making monitoring, we do ongoing API
-
2:18
monitoring through product called Onescope Radar that hits your API as often as
-
2:22
every minute from nine locations around the world or from behind your firewall.
-
2:26
And then metrics which is our newest usage and
-
2:29
performance reports for the API consumption that your app uses so
-
2:33
over 20,000 developers use Runscope we launched just about 18 months ago.
-
2:38
So it's kind of crazy 18 months for, for traction and
-
2:42
it's been growing very quickly in fact, here's our, our graph for traffic.
-
2:46
I used to show this graph and I used to cheat, and I used an [UNKNOWN] graph,
-
2:49
which is really nice cuz it, it's always up in the right no matter,
-
2:52
you know how little traffic you had it's always going that way.
-
2:55
This is now actually our weekly total graph so you can see that
-
2:58
in the last couple months traffic has really, really started taking off.
-
3:01
And, and, and
-
3:02
all of these sort of early infrastructure decisions that we made are really starting
-
3:06
to come into focus very quickly as we've been starting to grow very quickly.
-
3:11
Let's go back to the very beginning of Run scope so
-
3:13
it was 2000, late 2012, early 2013.
-
3:16
My co-founder and I, we both worked together at Twilio.
-
3:19
He was on the API team I was on the evangelism and pr, developer experience
-
3:24
team and we started building Run scope and what we started out with was essentially.
-
3:32
If you can read this two components to start with so there was OneScope.com which
-
3:38
was the dashboard for where everybody would interact with their account data.
-
3:41
And then there was these Run Scope urls.
-
3:43
So Run Scope url is a special url that every traffic-
-
3:47
or every request you make through it gets a snapshot taken by Run Scope but
-
3:51
then sent to the upstream API and then, on the way back, we capture the response.
-
3:54
And then back to your app.
-
3:55
So you don't make an app changes,
-
3:57
all you do is swap out his host name and you can start getting an inspection for
-
4:00
any API, any language or framework, any environment, that sorta thing.
-
4:04
Well, these two components obviously needed to talk to each other.
-
4:06
Once the Run Scope URL captured traffic, we needed to get it into.
-
4:11
The Run scope dashboard and so, we started building some services and
-
4:15
the first two that we built are Identity and Request Vault.
-
4:17
So, Identity keeps track of people and Request Vault keeps track of requests so,
-
4:22
that's all of the API track that's going back and forth.
-
4:25
Excuse me and so when we first started building this you know Frank was
-
4:29
working on [UNKNOWN] and I was working on Dashboard.
-
4:31
We needed to start talking to each other and
-
4:33
I actually really, really hate databases, like a lot I mean,
-
4:37
I've been writing SQL since SQL Server 6.5 and I can, you know?
-
4:41
You know, sling SQL with you know the mediocre best of them [LAUGH] and
-
4:45
so, but I just really hate databases.
-
4:47
But I really Like API's in fact, in 2010 when I started this or 2008,
-
4:52
2009 when I discovered Twilio it really like altered my view on
-
4:55
how software should work and how different components should take to each other.
-
4:58
And I really fell in love with that really simple unified interface that HP gives us
-
5:01
with a really well designed API and so, what we decided to do was for
-
5:05
each of these two services we were just gonna put an API in front of them.
-
5:08
So, we had an API service for identity and
-
5:10
an API for request fault Identity was backed by PostGres and
-
5:14
Request Vault was backed by Retis at them time which we'll come back to in a moment.
-
5:19
This worked really well I got to build dashboard I can actually sort of
-
5:22
build you an API contract that we agreed to ahead of time I didn't have
-
5:25
to wait to, till the data was in there you know we just sort of figured out what
-
5:27
the interface should be and started talking back and forth.
-
5:30
And I built Run Scope again, only using API calls and to this day almost two
-
5:35
years later from that point Run Scope still only makes API calls it's
-
5:39
never talked to a database directly but then we started adding components.
-
5:42
So we added a public API so you could interact with your data [UNKNOWN] so
-
5:46
that started talking to these two services as well use the exact same thing,
-
5:49
identity and Request Vault.
-
5:51
If you imagine that there's arrows between all of the things but
-
5:54
it gets kind of crowded if I put it on a slide.
-
5:55
But imagine there's another set of arrows there between API.Run Scope .com and
-
5:59
the ones in the middle and then we added a new service to handle customer
-
6:05
lifecycle events and that was just another API that talked to everything.
-
6:08
And pretty soon we just started adding services and adding services.
-
6:12
Each one with a small job to be done, right so Courier sends
-
6:15
emails externally FileCabinet manages file storage for, for data.
-
6:20
[UNKNOWN] Service eventually took over a bunch of parts from Identity and
-
6:23
handles things like SAML, that sort of stuff so we have all these services and
-
6:27
then we started adding more consumers like our internal admin.
-
6:30
We added a remote proxy so
-
6:32
you can actually Run Scope urls in on premise in your behind your firewall.
-
6:37
We also expanded to run it from one amazon service region to nine around the world so
-
6:41
that piece was like needed to be very independent.
-
6:44
We added something for handling web hooks.
-
6:46
And then we ended up doing another un-promised test runner pretty soon
-
6:51
the whole thing had really blossomed into like I guess, sort of behind without us
-
6:55
really even paying attention to it, it really turned into something big.
-
6:57
In fact, to this day, err, now we have over 40 internal services and
-
7:00
only seven engineers.
-
7:02
Our service to engineer ratio has been greater than four at
-
7:07
almost the entire length of the company time the company has existed.
-
7:10
So we have lots and lots of little services,
-
7:13
all focused around each one does one job to be done, and does it well, and
-
7:17
uses the technology behind it to sorta that best serves that service's job.
-
7:24
All right so let's talk about,
-
7:24
I wanna talk about two ways that we took this infrastructure and we used it to,
-
7:28
to build scale or infrastructure the first one is one scaling the technology,
-
7:31
the next one we'll cover in a little bit is scaling the team so when we scale
-
7:35
the stack there was a bunch of benefits that we started to get from this so
-
7:38
called micro service architecture.
-
7:40
I know micro services has now reached peak buzz word where it
-
7:43
means everything to everybody and nothing to everybody all at the same time and
-
7:47
so when I say it I think what I just described to you is that loose network of
-
7:50
small services all serving a single job.
-
7:53
That's how we talk about micro-services, and so
-
7:54
when I say that, that's what we mean.
-
7:57
So we started to get some really good benefits for,
-
7:58
from this micro-service architecture and
-
8:00
the first one was independent deployability so the previous,
-
8:03
one of the previous jobs I had was a large monolithic Ruby application.
-
8:07
Something you're probably all familiar with, and every time we wanted to play
-
8:11
a change, somebody would literally yell deploying deprod now, and the whole office
-
8:16
would know to stop committing changes for a moment to master, and then deploy.
-
8:20
How, how, how familiar does that sounds to people like, maybe it's not a verbal yell,
-
8:24
but maybe a hip chat or a slack message that says hey, stop shipping stuff right?
-
8:29
And we've never really done that all of our sort of we're deploying now
-
8:33
have been informative and
-
8:34
not so much a warning shot to not screw things up or to not break the build.
-
8:38
Uh,one of the first things I also told Frank when we started was as much as I
-
8:42
hate databases, I hated deploying from the command line I'm like an okay developer.
-
8:48
I just don't feel like I should be a command line expert in order to push code
-
8:52
and so Frank built a service called Prometheus.
-
8:54
And so Prometheus is our internal deploy tool and
-
8:56
there are a couple of interesting things on this slide here
-
8:59
you can see there's a list of all of our GitHub branches.
-
9:02
And we're in our realm, out test realm here, that we run as a mirror staging and
-
9:06
then the branch is here, so we can deploy any branch.
-
9:08
You can see build status for any given branch.
-
9:10
And the that's summarized up here.
-
9:12
And you can see what the last deploy was, who did it, what got deployed.
-
9:16
And we'll get a deploy history in a moment but
-
9:18
I have never deployed from the command line so I've,
-
9:20
I've only ever used Prometheus to deploy Run Scope tools and what I can do is,
-
9:24
I can go in and say all right, I need to ship a new email template and currier.
-
9:28
I commit that to currier.
-
9:29
I go on to Prometheus, deploy currier to test.
-
9:32
Test it, deploy to [UNKNOWN] and nobody else knows the difference nobody else had
-
9:35
to stop what they were doing cuz they weren't likely working on
-
9:38
email template code and we can ship that very frequently in fact.
-
9:42
Across everything we deploy more than 30 times a day.
-
9:46
We've deployed over 15 times a day from ba, going back to when we were two people.
-
9:50
So we've sort of maintained that velocity as we've,
-
9:53
as we've gotten bigger but Prometheus gives us a lot of other nice things too.
-
9:57
So you can see when somebody else is deploying, so as it's in progress so
-
10:00
you don't both sort of step on each other.
-
10:02
It actually prevents you from doing anything dangerous now so
-
10:05
if I hit deploy while one is running it'll just say, sorry there's one running.
-
10:08
You have to wait for that to finish and then the other thing we,
-
10:11
you can see the console output as its deploying.
-
10:14
And that shows host output from all of the EC2 instances that, that is being deployed
-
10:18
to and so, if something goes wrong it's really nice to have the raw console log.
-
10:23
To really dig into what happened in,
-
10:24
in the deploy we also have like one-click re-deploy as a roll back.
-
10:28
So, if we deploy something that breaks.
-
10:31
We just, we have like this company policy that as soon as you
-
10:34
think something's wrong.
-
10:35
You should immediately roll it back and then figure out what happened.
-
10:38
Don't try to assess the situation with something that's potentially broken in
-
10:41
prod but you can go back to any point in time essentially for any deploy.
-
10:46
And commit that we've ever made in GitHub and
-
10:47
roll back to that at any given point all right.
-
10:52
So we've got really good independent employability.
-
10:54
This allowed us to move faster.
-
10:56
We also started to get really good service-level isolation.
-
10:59
So this used to say modularity in you know,
-
11:03
I got dinged by people who were like hey.
-
11:05
My DLLs are modular, and hey.
-
11:07
My JARS are modular, and and then they are, and your gems are too,
-
11:10
and I don't wanna take anything away from that.
-
11:13
So what I'm saying when you get isolation, you get to do things with your sort of
-
11:17
modules that you wouldn't necessarily be able to do with a code base module.
-
11:21
So we swapped out a major component that we were running that was written
-
11:26
in Python, we rewrote it in Go, and now we've gotta go to production with this,
-
11:30
and we, we consider ourselves infrastructure.
-
11:32
And so we can't go down we're trying to achieve 100% up time SLI across all of
-
11:36
our, our proxies.
-
11:37
So shipping a new huge change like that is not something that we
-
11:40
can just say alright we're gonna do a scheduled maintenance window.
-
11:43
Your API trap is gonna break for 20 minutes while we,
-
11:46
you know, try to deploy this new go thing it had to be done, gradually worked in.
-
11:51
Because this change was sort of isolated at the service level and, and
-
11:53
encapsulated behind a HTB interface, we could use existing HTB tools that
-
11:57
are really good at sort of phasing things in like load balancers.
-
12:00
So what we did is we brought up the new, the new nodes with the new code on it.
-
12:04
And then behind a load balancer we would bring up just 5% of traffic and
-
12:09
then shuttle that over the go instances and make sure that was working.
-
12:12
All right, that worked, let's do it in another region so
-
12:14
we then we ended up with 5% across all of our regions.
-
12:16
And then we just slowly brought the traffic up watching,
-
12:19
making sure that we were getting the exact same responses and
-
12:21
the exact same characteristics that we were expecting from the old code.
-
12:25
We ran them both for
-
12:26
about a couple days after we got them both up to 100% just load balancing across them
-
12:30
all they did on the backend was talk to an API so it really didn't matter.
-
12:33
There was no sort of work to for getting data into the system once we collected it.
-
12:38
Once everything was good, we just pulled Pythons one out,
-
12:40
nobody knew the difference.
-
12:41
So we did a completely transparent major rewrite of our most important component,
-
12:46
by keeping that, those services isolated, and
-
12:48
letting us move it in sort of in pieces without affecting everything else.
-
12:52
[SOUND] We also get really nice Independent Scalability.
-
12:55
So if you go back to the database thing, so our first sort of
-
12:58
naive implementation of request fault was just dumping everything in Rattus.
-
13:03
Rattus is good for a lot of things but, pretty soon you start trying to do
-
13:06
relational things with it and it doesn't really work for
-
13:08
that so well since it's not really what it was designed for.
-
13:10
And so what we did is we actually realized that Postgres would be good for it, but
-
13:15
we just had to run Postgres with a little more horsepower than we were used to.
-
13:18
So not only did we rewrite request fault to switch over to Postgres, once we
-
13:22
did that we were able to scale up request fault independent of anything else,
-
13:25
right so we put a little bit more horsepower behind it.
-
13:28
Identity is now our most popular internal service it makes 100 of millions of
-
13:31
calls a month, mostly all internal.
-
13:34
And we're able to scale that without scaling courier along with it
-
13:38
courier makes a couple thousand calls a month right?
-
13:40
And so we throw more instances, more hardware at the services that need it and
-
13:44
we don't have to worry about how that like wasting resources on,
-
13:48
on part of the stacks that don't need it.
-
13:50
[BLANK_AUDIO]
-
13:55
All right. We also get network resiliency.
-
13:57
So, a lot of the services are distributed across availability zones in Amazon.
-
14:02
So, again using HP load balancing it's really nice that you
-
14:05
can spread out the load for the given work.
-
14:08
Across even boundaries like service providers.
-
14:11
So, we run a couple of proxies in rack, in Rack Space.
-
14:14
And if all of Amazon's service regions went down we
-
14:17
would still be proxying your traffic through Rack Space.
-
14:19
And we hope to add, you know, Azure and Google Cloud Compute Engine when ever it's
-
14:23
called and Soft Layer and Digital Ocean and all these others in the future and
-
14:26
get complete resiliency across service providers.
-
14:29
Even though it's unlikely that Amazon, Amazon's never lost two regions at once.
-
14:33
[SOUND] So, my ops guy will, you know, be upset that I jinxed that into happening.
-
14:39
All right so we get a lot of sort of like technical benefits from the ser,
-
14:43
micro-service architecture, that's allowed us to keep moving quickly, and
-
14:47
to keep up with the load as we're going.
-
14:49
But we also get a lot of human benefits sparkles and
-
14:55
we've been able to scale the team a lot faster as well.
-
14:57
So we're 13 people now but we, you know 18 months ago we were two people.
-
15:01
And we wanted to make sure that we kept that sort of initial startup
-
15:05
velocity as we kept adding people, and as there were more moving parts, and
-
15:09
as there were more code and more component scheme ship.
-
15:12
So here's a couple benefits that we've got on the, on the people side.
-
15:16
So the first one is that it puts us in to a network mindset.
-
15:19
So we work with APIs and the first rule of distributive systems isn't probably this
-
15:23
officially, but the network is fallible, right the network is unreliable.
-
15:26
And that has never been so true in our
-
15:29
lives until we started building all these api German applications.
-
15:33
I'm gonna talk more about that in a little bit,
-
15:35
but what when you start building behind all
-
15:38
these services you're constantly thinking this call is going to fail.
-
15:42
It's not, not when it fails or it might fail it's going to fail at some point.
-
15:46
And so you start building your applications with this sort of fallible
-
15:49
network in mind and would build a lot more resilience in the system by default.
-
15:52
When we talk about smart client later on,
-
15:55
that's one of things that is really stems from our thinking that every call we
-
15:59
make t to a service could fail at any given point, and we wanted to like,
-
16:03
build a client that thinks about failures as well as successes sort of equally.
-
16:10
The next one is it helps us isolate breakage so if I break courier, if I ship.
-
16:14
Bad code to courier which is again making a few 1000,
-
16:17
sending a few 1000 emails a month.
-
16:20
Its not gonna take down the whole system.
-
16:21
In fact all of the dependents servers to courier will just send a message to it and
-
16:25
be like if courier is broken they'll just keep retrying until it works again.
-
16:30
And so when I break code, cuz as the CEO who
-
16:34
still codes I'm the most likely culprit for breakage, it really
-
16:39
minimizes the impact that I'm going to have across this system, especially once
-
16:42
you've sort of built all the systems with that network fallibility in mind.
-
16:46
Some of our new employees who are a little more junior were really,
-
16:50
really scared about their first deploys.
-
16:52
And we were like you know what don't worry about it like you can't break everything.
-
16:55
It would be very difficult to break 40 services with one deploy.
-
17:00
In fact in our infrastructure it would be nearly impossible.
-
17:04
So the first time I, I gave this talk I used this term human modularity and I
-
17:07
promised I would never use it again but I haven't come up with anything better yet.
-
17:10
So this is the third time I've, I'll say that I'll never use this term again.
-
17:15
It's been easier to organize our team and tasks around services instead of ,.
-
17:21
[UNKNOWN] around like having to understand the whole system.
-
17:24
The amount of information you need to keep in memory in your human memory not
-
17:28
computer memory in order to fix a problem is much smaller when,
-
17:31
when the problem is boiled down to go add this feature to this service,
-
17:35
instead of go try figure out the dependency chains for all the calls across
-
17:39
all of the you know methods and all of the consumers of this module.
-
17:42
It really like reduces the amount of,
-
17:44
of in process things that you have to remember in order to ship a change.
-
17:49
And so, we can actually also structure teams around the most popular services so,
-
17:54
our calculon test runner, it processes all the test data.
-
17:59
We basically have a team that sort of sits and works on that, and
-
18:01
as long as they maintain their API contracts and their performance SLAs.
-
18:05
It doesn't really matter what changes they make behind the scenes and they, they
-
18:09
don't have to worry about how that effects the rest of the services around it.
-
18:12
So this, I mean, obviously it aligns very well with the tech benefits of isolation,
-
18:17
but you can also you know, group people around those problems and
-
18:20
services as well.
-
18:22
And my favorite one, going back to why, about hating data, databases,
-
18:26
is you get a uniform interface so no matter where I am working in the code,
-
18:29
I know that if I want to send and
-
18:31
email all I need to do is make an http post to courier and it will get sent.
-
18:34
I don't need to know a bunch of different database,
-
18:37
you know query languages I don't need to know the protocol for
-
18:40
redress versus postgress versus lasic search versus dynamo db.
-
18:44
All I need to know is how to make a HP request and
-
18:46
I can interact with any part of this system from any other part of the system.
-
18:49
And so there's a lot less to learn again for new developers because all
-
18:52
they have to know is one way to access and write, to read and write data.
-
18:59
Also later on as we have devs that only work higher in the stack, this is like,
-
19:03
this just removes a huge amount of cognitive overhead for them to try to
-
19:06
understand how to ship a new feature or what technology to, to use to build it.
-
19:12
All right, so we spend a lot of time investing infrastructure between
-
19:15
Prometheus, between all the, [UNKNOWN] orchestration system,
-
19:19
our atlas realm, config centralization, that sort of stuff.
-
19:25
And without that we really wouldn't have been able to get all the benefits that we
-
19:28
got from these micro-services.
-
19:29
So I would say if you're not willing to invest in the infrastructure to
-
19:32
make this possible then don't invest in micro-services.
-
19:34
Most of the complaints I hear are, I don't know how to deploy 40 service
-
19:38
using [UNKNOWN] Isn't there a lot of management overhead for that?
-
19:40
Isn't there a lot of operations overhead for that and there absolutely is so,
-
19:43
if you're not willing to invest in automating that and
-
19:46
abstracting that away from, from your developers.
-
19:48
You're probably gonna have a harder time getting value out of the micro-services.
-
19:53
All right, so, so far I've talked about internal API's for Run Scope.
-
19:58
But if you look at an application from a broader perspective,
-
20:02
that's really just one component of what we do right so
-
20:04
I've mentioned this courier service bunch that talks to mandrale.
-
20:07
We have a bunch of other services that talk to other third parties.
-
20:10
And ultimately we have all of these different pieces that sort of make up
-
20:13
our application experience right?
-
20:15
We have the third party APIs we depend on we have our we don't have native apps but
-
20:18
imagine you know, we were shipping a customer [UNKNOWN] stuff.
-
20:21
We'd have mobile apps, we'd have desktop apps, we'd have a website.
-
20:23
We'd have sensors and hardware dumping data in the system.
-
20:26
All of these things are talking to each other over APIs and
-
20:30
it's really like these APIs oh, not the slide I thought it was gonna be.
-
20:35
Anyway, get to watch the animation again.
-
20:37
So all of these pieces really encompass what the,
-
20:41
the technical components or pieces of the actual user experience.
-
20:45
It, with any one of these pieces not talking to each other or
-
20:47
not working well with each other, you get a bad user experience.
-
20:50
So this was never more true for me than when I worked at if this than that.
-
20:53
So if connects services you use you can do things like when the U,
-
20:57
team USA wins the gold medal then you know, turn on a light, a disco ball, and
-
21:01
play music, right?
-
21:02
You can do all sorts of hardware hacking in combinations you can
-
21:05
combine different services and ,.
-
21:08
If we got bad data from the Olympic API for that integration,
-
21:11
then somebody's lightbulb doesn't get lit up in the service.
-
21:14
We look bad because we promised somebody that you would be able to do
-
21:16
fun things like that right?
-
21:17
So, all these pieces combined really comprise your application experience.
-
21:23
So, one, once you have this sort of API driven application or
-
21:26
distributed application.
-
21:27
You run into sorta these greater challenges.
-
21:29
So I'm gonna talk about three and
-
21:32
yeah, we'll just start with the first one, all right.
-
21:36
[LAUGH] The first challenge is getting a complete picture of your app.
-
21:39
So we run all those internal services, and I am ashamed to tell you that I
-
21:43
still don't, can't tell you how many API calls we make on the whole every month.
-
21:47
So we're, we're working towards that it's hard to
-
21:50
use our own tool on itself sometimes and, and so we get in these looping situations.
-
21:54
But our goal is to make it so that you will always know how many calls you make,
-
21:58
and to get a, make it more easy to get a complete picture of your app.
-
22:03
Remote services tend to have like an ad,
-
22:05
a really strong adverse effect on your user experience so I mentioned the one
-
22:09
about, you know, not turning on the lights when the USA wins a gold medal.
-
22:13
We had another integration with a social network, every other Friday night at about
-
22:18
11 p.m the responses would start going to about 15 to 20 second response times but
-
22:22
then it would work.
-
22:23
So it, it wasn't like downtime, downtime is super easy to spot right?
-
22:27
Like we, tried to make the call and
-
22:28
we go an exception, it was broken slow is way harder to find.
-
22:31
And so, because this is one of our most popular integrations at Ift.
-
22:35
It would basically lock up all our workers for 15 to 20 seconds at a time.
-
22:38
Something that was taking like normally a couple of 100 milliseconds.
-
22:42
And it would just grind the whole system to a halt and
-
22:43
then all our queues would back up and then everything would break.
-
22:46
And it took us a really long time to identify what it was in fact,
-
22:50
right as we identified what it was the service provider fixed it.
-
22:54
[LAUGH] So,.
-
22:55
But we didn't have any information or really good complete picture of
-
22:58
all the calls going back and forth to even go to them and say hey here's the pattern
-
23:02
we're seeing that its this every other week thing at 11 o'clock on Friday.
-
23:05
So, getting that complete picture is still difficult but
-
23:09
we're, we're trying to make that easier.
-
23:14
Also like traditional monitoring tools don't really get into this very well.
-
23:17
So a lot of like APM tools will tell you that something was slow or broken.
-
23:22
But they don't really consider like what the data was on the wire.
-
23:25
So if you've ever gotten in your bug exception tracker a like, JSON parse
-
23:29
exception that's probably the most common API downtime notification that exists.
-
23:36
JSON Par's exception says nothing about the network or
-
23:38
why that JSON couldn't par us.
-
23:40
But usually what happens is the API you're trying to hit is returning a 503 which
-
23:43
is an eight, in like, typically the default like engine x or the Apache page
-
23:47
which comes back as HTML and your JSON Parser can't do anything with it right?
-
23:50
But you're thinking my adjacent code is, my adjacent parsing code, or my adjacent
-
23:54
library is broken when in fact the remote service is sending you invalid data.
-
23:58
A lot of the APM tools don't really and the error log,
-
24:00
you don't really describe them very well.
-
24:02
So it's important to try to get a better grasp on like, what the actual traffic is
-
24:06
and what the actual responses are and in both good and bad situations.
-
24:10
So our strategy for that is that we watch and
-
24:12
log everything so we do this everything third party, we log.
-
24:16
Thankfully we have a really great tool for doing that.
-
24:19
but, every call we make outbound to an API we, we shuttle that through Run scope so
-
24:22
we can see what's going back and forth.
-
24:25
We were getting ran, really weird what we thought were weird MailChimp errors and
-
24:29
then, cuz we got 500s back and we went and looked and
-
24:31
it turns out MailChimp turn, returns 500 for basically any error.
-
24:35
[LAUGH] If you saw King's talk just before this,
-
24:37
it would be a classic example of how not to use status codes.
-
24:41
All of our inbound web hook requests are also run through Run Scope URL.
-
24:45
So if our server for some reason couldn't process it, we could go to Run Scope and
-
24:48
we can click retry and rerun that web hook and we don't miss anything.
-
24:52
This has really sort of altered how we view APIs even though we
-
24:56
knew going in that this was like really important.
-
24:59
Actually having done it now for all of our third party service integrations has
-
25:02
really made us a lot more skeptical of even the providers that we've chosen who
-
25:05
had good reputations or that we thought had really good reputations.
-
25:09
All right, next challenge in, in dealing with distributed applications is
-
25:12
managing change so, if you've been doing software for more than five minutes you
-
25:15
know that this is not a problem that is limited to API driven applications.
-
25:19
But I'm gonna talk about it from the perspective of, of using APIs.
-
25:24
API version changes are very difficult to change, right, or to track.
-
25:29
Largely you have some HP servers that may or may not be using like versions in
-
25:34
the resources or version content types or media types is unlikely to be notifying
-
25:40
you every time they add something and your app is like dependent on,
-
25:43
on all of these changes not being, being made in a broken fashion.
-
25:48
A lot of changes are made accidentally so a popular social network API used to
-
25:53
accidentally capitalize an attribute name temporarily until they realized that
-
25:57
whatever was generating their objects was doing this.
-
26:00
And that would just cause, I mean, one letter changing you think would not be
-
26:04
the end of the world and it would hose up our entire application cuz-
-
26:08
We were looking for essentially the wrong key and
-
26:10
couldn't find that data but all the rest of the data was there right.
-
26:12
So, that's a change that they didn't intend and that's sort of
-
26:16
like the bad change but there are a lot of other subtle changes too that
-
26:19
companies sort of accidentally introduce.
-
26:22
It turns out so SDK's which is sort of like the way that API providers will
-
26:27
tell you that they are getting around these sort of.
-
26:29
Changing problems by abstracting away the underlying API actually only attend up.
-
26:34
End up exacerbating the problem in the long run instead of solving it.
-
26:38
So, this is probably my favorite about API's I hate SDK's.
-
26:43
They are terrible they're not terrible for everything, but
-
26:47
they're terrible for maintaining software or writing maintainable software.
-
26:51
Which is where you spend about 99% of the time with software that gets out of
-
26:54
the development stage, right.
-
26:56
Is maintaining it.
-
26:57
So I wanna give you a couple reasons why you
-
26:58
should think twice about taking a dependency on an SDK.
-
27:00
okay so let's look at the first case which is managing some version changes.
-
27:04
So this is typical conversation between, your code and an API in SDK.
-
27:09
API might be on version one and the API SDK might be on version two.
-
27:14
And now you're now, now you're watching two versions, right?
-
27:16
So, the underlying API may get new features, but
-
27:19
now you've gotta go check to see if your SDK supports those new features.
-
27:22
And a lotta cases they don't right away.
-
27:25
So now you're tracking the state of two versions.
-
27:27
Oh, I thought there was another part of that slide.
-
27:29
But anyway, you're starting to get compounding version management, and
-
27:33
compounding things that you have to watch in order to,
-
27:35
to, to keep up to date with that API.
-
27:38
This is just case if you're using one.
-
27:39
So if you're using two you're actually introducing a third layer of
-
27:42
version management that you definitely don't wanna be dealing with.
-
27:45
So in a lot of languages there aren't that many HTTP clients or
-
27:48
adjacent parsers or XML parsers.
-
27:50
They're sort of like these core typically open source or
-
27:54
in the core standard library.
-
27:56
Things that everyone depends on, right an HP client is a, is a classic one.
-
28:00
So your code is dealing with APIs.
-
28:02
We would have a version and then SDKs which have another version which may or
-
28:06
may, may not be up to date with the API version.
-
28:08
They are also depending on versions of these dependent pieces that
-
28:10
get shared across a lot of different SDK libraries.
-
28:15
And so what happens is what happens when you have an HTTP client.
-
28:19
With two consuming SDK's and one requires version one and
-
28:22
one requires version two of that client.
-
28:24
Now if you're in node you're probably lucky because you can
-
28:26
do multiple versions, dependencies with different versions.
-
28:29
If you were in Python or Ruby or
-
28:31
pretty much anything else or .net you're kinda screwed here.
-
28:34
So that underlying SDK changes what its depending on,
-
28:38
you now have to go research whether or not.
-
28:40
You can bring, the SDK version up to date, its dependency up to date.
-
28:45
And now this is again, code you didn't maintain.
-
28:46
You're now multiple levels removed from code that you actually care about and
-
28:50
are writing yourself.
-
28:51
And you've gotta start making decisions on can we submit a pull request to
-
28:54
get this updated can we run it with the wrong version can we
-
28:57
override the hard link that it has to a specific version?
-
29:00
Again, none of this is delivering any value to your customers or
-
29:03
your users of your application, right?
-
29:04
You're way down the dependency chain here dealing with problems that you don't
-
29:07
want to deal with.
-
29:08
And I thought this was sort of a rare problem because at IFT we
-
29:11
had 65 API integrations in a GEM file a mile long and
-
29:15
it turns out it doesn't take very long, especially in something like PYTHON or
-
29:18
RUBY before you start running into this problem.
-
29:22
Also what happens when something goes wrong right?
-
29:23
When that conversation starts breaking, is it between your code and the SDK?
-
29:28
Is it it between the SDK and the API?
-
29:30
This is actually very hard to determine in a lot of cases so
-
29:33
we've tried to make it possible to use our tools to see what was on the network so
-
29:36
you can start removing variables.
-
29:38
But this is I can't tell you how many times I've been trying to
-
29:41
debug like where is the breakage happening, is it after I got the data or
-
29:45
is it before the data reached my coder, or which layer is it in?
-
29:49
All right, so SDKs cause lots of problems,
-
29:54
especially if you use more than one, but there are times that they're okay.
-
29:56
So, one time that it's really okay is if you're doing prototyping.
-
29:59
Right? So you just wanna figure out what is
-
30:00
the capability of this API that I'm using, and
-
30:03
will it solve this problem that I have, right?
-
30:05
So if that's the case, sure, pull it in, prototype it.
-
30:08
But I would do that with like a very temporary mindset and
-
30:11
I know there is no temporary in software.
-
30:12
The, you know, temporary tends to be permanent very quickly when you're writing
-
30:16
an application, but if you're prototyping, it can be ok to use an, SDK there.
-
30:22
Another good time to use an SDK is if you don't have a good HP client.
-
30:25
So this used to be like one of the biggest reasons why people made SDKs is because if
-
30:29
you looked at languages like I don't know,
-
30:31
let's say dot net that did not have a very good HP client for a long time.
-
30:35
You didn't want to have to tell your customers of your API, yeah you have to
-
30:39
use this really horrible HP web request API in order to talk to our API, right?
-
30:43
You wanted to sort of make a smoother experience there.
-
30:45
That's why I wrote Rasharp for .NET, that's why I wrote the Twilio C Sharp SDK,
-
30:49
we were basically hiding away HP web requests.
-
30:51
Thankfully .NET has a much better HP client now.
-
30:54
And so that's less important there but there are other languages that
-
30:57
continue to lack good, good HP clients in the standard library.
-
31:00
Another good time to see if you're building an entire client.
-
31:03
So one of the, sort of the modern like restful API STK really got popular.
-
31:08
With things like Twitter.
-
31:09
And the reason it did is because people were building Twitter clients and
-
31:12
they needed every API call to exist.
-
31:14
So if you're building something that is gonna use 100% of an API then sure.
-
31:19
Go ahead and use an SDK cuz you're gonna end up building that anyway.
-
31:21
If you're using some portion of it you probably don't need to carry the weight
-
31:25
around in order to make that valuable or to make it work not using an SDK.
-
31:30
If you're using complex APIs like something that does sync or
-
31:32
something that's not over like HP, you know, thrift web sockets,
-
31:36
that sorta thing, sure, go ahead and use one for that.
-
31:38
Or if you're using native APIs on a mobile app it ties in like, you know,
-
31:42
photo library, camera, or hardware sensors, or that sorta thing, then sure,
-
31:45
pull in the SDK for that, cuz you probably don't wanna write that yourself.
-
31:48
The danger zone is when you're using more than one because of
-
31:50
the dependency management problem.
-
31:53
If they're community-built because community-built SDK's tend to
-
31:58
go stale very, very quickly.
-
32:00
And again if this is like, core to your product and
-
32:02
core to your experience, you don't wanna be reliant on some open source.
-
32:05
Developer who may not care that your app is no longer working, and
-
32:09
that, that doesn't care that an underlying dependency changed.
-
32:12
It's actually easier going back to community-built, and
-
32:14
it's actually easier for somebody making an open source SDK to pin to specific
-
32:18
versions of dependencies to minimize the amount of support that they get.
-
32:22
But they're really just sho, they're basically distributing the,
-
32:26
the, the who has to deal with that to somebody else.
-
32:30
Another time that it's dangerous,
-
32:32
if it has lots of dependencies we covered that or if it's inactive.
-
32:36
So if you go to API provider and the SDK hasn't been updated in I'd say,
-
32:41
you know, the last 90, si, 90 days or
-
32:43
six months I'd be very weary about pulling that into your application.
-
32:47
In some cases that can mean stable and
-
32:49
in other cases it could just mean like, not up to date.
-
32:52
So it's hard, you have to make that, that judgement call for yourself.
-
32:55
If you're an API provider, you actually must create SDKs like,
-
32:59
you cannot produce a public API these days that does not
-
33:03
have good SDKs if you expect developers to get up and running quickly.
-
33:06
And that first five minutes a developer reaches your site is the single most
-
33:09
important time for
-
33:09
their entire success on your platform, and an SDK gets them over that hump quickly.
-
33:13
So, if you're going to build them.
-
33:15
And you should if you're providing an API build them for
-
33:17
as many platforms as possible, as native as possible, they should feel like
-
33:21
the language they're written in don't have a PHP developer write a C Sharp SDK.
-
33:25
And make sure they're as well documented, because what you're saying is hey,
-
33:28
we have an API.
-
33:29
We need SDKs, but now you have 17 APIs.
-
33:31
And if you don't believe me, I enumerated here on this slide so
-
33:35
this is the big six current sort of landscape for which languages you need to
-
33:40
support because those are where all the developers are.
-
33:42
Here are all the version within that that you need to support.
-
33:44
Thankfully RUBY 186 is dead now you don't need to do that anymore so
-
33:49
let's go with 16 and then something like Golang comes along and
-
33:53
now you've got to learn a new language.
-
33:54
And you're gonna have a lot of developers clamoring for a wrapper there.
-
33:57
And if every one of these is not documented as well as your rest API,
-
34:00
then you've created a horrible first run experience for
-
34:03
somebody and you're going to lose that customer.
-
34:06
So, good luck with that.
-
34:09
All right.
-
34:10
It's not fun. I mean,
-
34:11
that was part of my job at Twilio, was maintaining the SDKs and it was.
-
34:15
More time spent managing version and
-
34:17
dependencies than actually adding value to the SDKs.
-
34:20
So I'm gonna breeze through this really quick cuz I'm running out of time
-
34:23
this is how we do it at Run Scope.
-
34:25
So we wrote a smart thing we call Smart Client, and
-
34:27
the goal is to get around some of these sort of problems.
-
34:30
Now, we use this mostly with internal services.
-
34:32
We do a little bit of this with external services,
-
34:34
depending on how the API is designed but in our case what we're trying to do
-
34:37
is avoid hard coded sort of paths to services in there.
-
34:41
You can see the EC-ugly, EC-2 url in there.
-
34:43
We're also trying to make the most common error handing cases sort of work
-
34:48
by default so what we do is we have something called smart Client and
-
34:52
it does service locations so
-
34:54
this ties in with a lot of infrastructure investments that I was talking about.
-
34:57
So that whatever realm I'm in this knows where to find it.
-
34:59
If I'm local it knows it's on local it's 5003 If I'm in prod it knows it's at this,
-
35:04
this DNS entry which is a load balancer in front of like seven instances.
-
35:08
We also try to make as many things as we can item potent I really like that word so
-
35:11
I'm gonna say it about three more times.
-
35:13
So what this does is say hey, if this is down or we couldn't get that resource,
-
35:17
we're gonna retry it smartly, to make sure that we can get that resource
-
35:21
without having to worry about writing a lot of repetitive retrial logic.
-
35:25
And so we like to do [UNKNOWN] creates, all of our creates.
-
35:29
Internally are done through puts instead of posts, so
-
35:31
that we can reissue it as many times as we want, and
-
35:33
the put will take effect when that API responds properly and it doesn't matter if
-
35:37
we sent it more than once, so we get really nice retries on creates.
-
35:40
Our sign up page is our most important page in our entire company, and
-
35:43
if for some reason somebody hits sign up and it doesn't work,.
-
35:47
It's bad, but
-
35:47
if it takes five seconds that's can be, that's like a less bad, right?
-
35:51
And we're, we want that sign-up to go through sorta regardless of
-
35:54
what the underlying problems are.
-
35:56
Then we put these in thin wrappers.
-
35:57
The thin wrappers thing is something that we definitely apply to third-party API's
-
36:01
so that we can swap out underlying integrations.
-
36:03
If we do a new prototype that uses an SDK we try to put a thin wrapper around that.
-
36:06
So when it's time to pull it out and
-
36:08
we're gonna make calls directly, we don't have to change the the calling code.
-
36:12
All right.
-
36:14
So challenge number three is, is testing and monitoring.
-
36:16
So APIs have this unique property that you can test them,
-
36:20
you can monitor them the same way you test them.
-
36:23
Does anybody run unit test on production servers in your production environment?
-
36:30
Right. And nobody does that right.
-
36:31
You run it on your staging server, or
-
36:32
you run it locally, you run it on your CI server.
-
36:35
But once it goes to prod you're not like doing ongoing monitoring or
-
36:38
ongoing testing using your unit tests or functional test or any of that right.
-
36:42
You gotta test it externally somehow.
-
36:44
Well API's are nice cuz you can apply the exact same way that you would test it
-
36:47
which is making HTTP requests against it.
-
36:49
To your local environment, to your staging environment,
-
36:51
to your production environment all using the exact same test plan.
-
36:54
And so, what we did, is we built a product that tries to address both of these cases
-
37:00
and removes this arbitrary dotted line and makes it easy to test in development and
-
37:04
production looking for functionality and correctness in development, but
-
37:08
then in production looking for performance and availability, right so.
-
37:11
Locally, when we're building an API we wanna make sure it's,
-
37:14
it's right, that we're getting the right data out of it.
-
37:16
And then when we go to production using that exact same test plan,
-
37:19
we wanna make sure that it's performing properly and that it's up.
-
37:21
And so we took all these ideas from how to do API testing in a way that you
-
37:26
couldn't really do previously, and we made a product out of it I'll skip that point.
-
37:31
Called Run Scope Radar, and it does all of those things.
-
37:34
It runs from nine locations around the world, or
-
37:36
behind your firewall, or on your local machine.
-
37:38
You can take one test plan and run it against multiple environments very easily.
-
37:41
So, if you're looking to get a, sort of a better grasp on your internal services and
-
37:45
just understanding are things correct, or
-
37:47
are things up, Radar is really a great tool for that.
-
37:49
We run it against all of our services we have 100 and 100 of these radar tests, and
-
37:54
we know very quickly if a deploy heard something or spoke something down.
-
37:59
All right so, I'd love for you to try Run Scope,
-
38:01
you can actually try Run Scope now without even giving me your email address.
-
38:04
You go to our homepage and click the link that says try it now sign up later,
-
38:08
you can get a free one day account that's completely anonymous.
-
38:11
If you do sign up and you do give us your email address I will mail one of
-
38:15
these Everything is going to 200 OK T-shirts.
-
38:17
Just sign up and email me at john@runscope.com and
-
38:20
we'll send it to you.
-
38:20
And if you like working on distributed systems and
-
38:23
you like API's and you like building great developer tools and
-
38:26
helping developers be more productive, runscope.com/jobs.
-
38:31
We're hiring for dev ops engineers and.
-
38:34
Sales people and product engineers and pretty much everything.
-
38:37
We're really fun.
-
38:38
I promise.
-
38:40
[LAUGH] We just played croquet the other day and it was amazing.
-
38:44
[LAUGH] Anyway so, I've got a minute 16 left for questions.
-
38:52
>> Well let's all take a deep breath.
-
38:54
>> [LAUGH] >> Holy crap.
-
38:56
And let's put our hands together for John too.
-
39:00
[APPLAUSE]
You need to sign up for Treehouse in order to download course files.
Sign up