System Design EBook 1670968892 - Inglês (2025)

Prévia do material em texto

<p>System Design</p><p>What are database isolation levels? What are they used for? 4</p><p>What is IaaS/PaaS/SaaS? 6</p><p>Most popular programming languages 7</p><p>What is the future of online payments? 9</p><p>What is SSO (Single Sign-On)? 11</p><p>How to store passwords safely in the database? 13</p><p>How does HTTPS work? 16</p><p>How to learn design patterns? 18</p><p>A visual guide on how to choose the right Database 20</p><p>Do you know how to generate globally unique IDs? 22</p><p>How does Twitter work? 24</p><p>What is the difference between Process and Thread? 26</p><p>Interview Question: design Google Docs 28</p><p>Deployment strategies 30</p><p>Flowchart of how slack decides to send a notification 32</p><p>How does Amazon build and operate the software? 33</p><p>How to design a secure web API access for your website? 35</p><p>How do microservices collaborate and interact with each other? 38</p><p>What are the differences between Virtualization (VMware) and</p><p>Containerization (Docker)? 40</p><p>Which cloud provider should be used when building a big data</p><p>solution? 42</p><p>How to avoid crawling duplicate URLs at Google scale? 44</p><p>Why is a solid-state drive (SSD) fast? 47</p><p>Handling a large-scale outage 49</p><p>AWS Lambda behind the scenes 51</p><p>1</p><p>HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC). 53</p><p>How to scale a website to support millions of users? 55</p><p>DevOps Books 58</p><p>Why is Kafka fast? 60</p><p>SOAP vs REST vs GraphQL vs RPC. 62</p><p>How do modern browsers work? 63</p><p>Redis vs Memcached 64</p><p>Optimistic locking 65</p><p>Tradeoff between latency and consistency 67</p><p>Cache miss attack 68</p><p>How to diagnose a mysterious process that’s taking too much CPU,</p><p>memory, IO, etc? 70</p><p>What are the top cache strategies? 71</p><p>Upload large files 74</p><p>Why is Redis so Fast? 76</p><p>SWIFT payment network 77</p><p>At-most once, at-least once, and exactly once 80</p><p>Vertical partitioning and Horizontal partitioning 82</p><p>CDN 84</p><p>Erasure coding 87</p><p>Foreign exchange in payment 89</p><p>Block storage, file storage and object storage 94</p><p>Block storage, file storage and object storage 95</p><p>Domain Name System (DNS) lookup 97</p><p>What happens when you type a URL into your browser? 99</p><p>AI Coding engine 101</p><p>Read replica pattern 103</p><p>2</p><p>Read replica pattern 105</p><p>Email receiving flow 107</p><p>Email sending flow 109</p><p>Interview Question: Design Gmail 111</p><p>Map rendering 113</p><p>Interview Question: Design Google Maps 115</p><p>Pull vs push models 117</p><p>Money movement 119</p><p>Reconciliation 122</p><p>Which database shall I use for the metrics collecting system? 126</p><p>Metrics monitoring and altering system 129</p><p>Reconciliation 131</p><p>Big data papers 134</p><p>Avoid double charge 136</p><p>Payment security 138</p><p>System Design Interview Tip 139</p><p>Big data evolvement 140</p><p>Quadtree 142</p><p>How do we find nearby restaurants on Yelp? 144</p><p>How does a modern stock exchange achieve microsecond latency? 147</p><p>Match buy and sell orders 149</p><p>Stock exchange design 151</p><p>Design a payment system 153</p><p>Design a flash sale system 155</p><p>Back-of-the-envelope estimation 157</p><p>3</p><p>What are database isolation levels? What are they used</p><p>for?</p><p>Database isolation allows a transaction to execute as if there are no</p><p>other concurrently running transactions.</p><p>The diagram below illustrates four isolation levels.</p><p>🔹Serializalble: This is the highest isolation level. Concurrent</p><p>transactions are guaranteed to be executed in sequence.</p><p>🔹Repeatable Read: Data read during the transaction stays the same</p><p>as the transaction starts.</p><p>🔹Read Committed: Data modification can only be read after the</p><p>transaction is committed.</p><p>4</p><p>🔹Read Uncommitted: The data modification can be read by other</p><p>transactions before a transaction is committed.</p><p>The isolation is guaranteed by MVCC (Multi-Version Consistency</p><p>Control) and locks.</p><p>The diagram below takes Repeatable Read as an example to</p><p>demonstrate how MVCC works:</p><p>There are two hidden columns for each row: transaction_id and</p><p>roll_pointer. When transaction A starts, a new Read View with</p><p>transaction_id=201 is created. Shortly afterward, transaction B starts,</p><p>and a new Read View with transaction_id=202 is created.</p><p>Now transaction A modifies the balance to 200, a new row of the log is</p><p>created, and the roll_pointer points to the old row. Before transaction A</p><p>commits, transaction B reads the balance data. Transaction B finds</p><p>that transaction_id 201 is not committed, it reads the next committed</p><p>record(transaction_id=200).</p><p>Even when transaction A commits, transaction B still reads data based</p><p>on the Read View created when transaction B starts. So transaction B</p><p>always reads the data with balance=100.</p><p>Over to you: have you seen isolation levels used in the wrong way?</p><p>Did it cause serious outages?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>5</p><p>What is IaaS/PaaS/SaaS?</p><p>The diagram below illustrates the differences between IaaS</p><p>(Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), and SaaS</p><p>(Software-as-a-Service).</p><p>For a non-cloud application, we own and manage all the hardware and</p><p>software. We say the application is on-premises.</p><p>With cloud computing, cloud service vendors provide three kinds of</p><p>models for us to use: IaaS, PaaS, and SaaS.</p><p>𝐈𝐚𝐚𝐒 provides us access to cloud vendors' infrastructure, like servers,</p><p>storage, and networking. We pay for the infrastructure service and</p><p>install and manage supporting software on it for our application.</p><p>𝐏𝐚𝐚𝐒 goes further. It provides a platform with a variety of middleware,</p><p>frameworks, and tools to build our application. We only focus on</p><p>application development and data.</p><p>𝐒𝐚𝐚𝐒 enables the application to run in the cloud. We pay a monthly or</p><p>annual fee to use the SaaS product.</p><p>Over to you: which IaaS/PaaS/SaaS products have you used? How do</p><p>you decide which architecture to use?</p><p>Image Source: https://www.ibm.com/cloud/learn/iaas-paas-saas</p><p>6</p><p>Most popular programming languages</p><p>Programming languages come and go. Some stand the test of time.</p><p>Some already are shooting stars and some are rising rapidly on the</p><p>horizon.</p><p>I draw a diagram by putting the top 38 most commonly used</p><p>programming languages in one place, sorted by year. Data source:</p><p>StackOverflow survey.</p><p>1 JavaScript</p><p>2 HTML/CSS</p><p>3 Python</p><p>4 SQL</p><p>5 Java</p><p>6 Node</p><p>7 TypeScript</p><p>8 C</p><p>9 Bash/Shell</p><p>10 C</p><p>11 PHP</p><p>7</p><p>12 C</p><p>13 PowerShell</p><p>14 Go</p><p>15 Kotlin</p><p>16 Rust</p><p>17 Ruby</p><p>18 Dart</p><p>19 Assembly</p><p>20 Swift</p><p>21 R</p><p>22 VBA</p><p>23 Matlab</p><p>24 Groovy</p><p>25 Objective-C</p><p>26 Scala</p><p>27 Perl</p><p>28 Haskell</p><p>29 Delphi</p><p>30 Clojure</p><p>31 Elixir</p><p>32 LISP</p><p>33 Julia</p><p>34 F</p><p>35 Erlang</p><p>36 APL</p><p>37 Crystal</p><p>38 COBOL</p><p>Over to you: what’s the first programming language you learned? And</p><p>what are the other languages you learned over the years?</p><p>8</p><p>What is the future of online payments?</p><p>I don’t know the answer, but I do know one of the candidates is the</p><p>blockchain.</p><p>As a fan of technology, I always seek new solutions to old challenges.</p><p>A book that explains a lot about an emerging payment system is</p><p>‘Mastering Bitcoin’ by Andreas M. Antonopoulos. I want to share my</p><p>discovery of this book with you because it explains very clearly bitcoin</p><p>and its underlying blockchain. This book makes me rethink how to</p><p>renovate payment systems.</p><p>Here are the takeaways:</p><p>1. The bitcoin wallet balance is calculated on the fly, while the</p><p>traditional wallet balance is stored in the database. You can check</p><p>chapter 12 of System Design Interview Volume 2, on how to implement</p><p>a traditional wallet (https://amzn.to/34G2vmC).</p><p>9</p><p>2. The golden source of truth for bitcoin is the blockchain, which is also</p><p>the journal. It’s the same if we use Event Sourcing architecture to build</p><p>a traditional wallet, although there are other options.</p><p>3. There is a small virtual machine for bitcoin - and also Ethereum. The</p><p>virtual machine defines a set of bytecodes to do basic tasks such as</p><p>validation.</p><p>Over to you: if Elon Musk set up a base on planet Mars, what payment</p><p>solution will you recommend?</p><p>10</p><p>What is SSO (Single Sign-On)?</p><p>A friend recently went through the irksome experience of being signed</p><p>out from a number of websites they use daily. This event will be familiar</p><p>to millions of web users, and it is a tedious process to fix. It can involve</p><p>trying to remember multiple long-forgotten</p><p>engine</p><p>DeepMind says its new AI coding engine (AlphaCode) is as good as an</p><p>average programmer.</p><p>The AI bot participated in the 10 Codeforces coding competitions and</p><p>was ranked 54.3%. It means its score exceeded half of the human</p><p>contestants. If we look at its score for the last 6 months, AlphaCode</p><p>ranks at 28%.</p><p>The diagram below explains how the AI bot works:</p><p>1. Pre-train the transformer models on GitHub code.</p><p>2. Fine-tune the models on the relatively small competitive</p><p>programming dataset.</p><p>3. At evaluation time, create a massive amount of solutions for each</p><p>problem.</p><p>4. Filter, cluster and rerank the solutions to a small set of candidate</p><p>programs (at most 10), and then submit for further assessments.</p><p>5. Run the candidate programs against the test cases, evaluate the</p><p>performance, and choose the best one.</p><p>101</p><p>Do you think AI bot will be better at Leetcode or competitive</p><p>programming than software engineers five years from now?</p><p>102</p><p>Read replica pattern</p><p>There are two common ways to implement the read replica pattern:</p><p>1. Embed the routing logic in the application code (explained in the last</p><p>post).</p><p>2. Use database middleware.</p><p>We focus on option 2 here. The middleware provides transparent</p><p>routing between the application and database servers. We can</p><p>customize the routing logic based on difficult rules such as user,</p><p>schema, statement, etc.</p><p>The diagram below illustrates the setup:</p><p>103</p><p>1. When Alice places an order on amazon, the request is sent to Order</p><p>Service.</p><p>2. Order Service does not directly interact with the database. Instead, it</p><p>sends database queries to the database middleware.</p><p>3. The database middleware routes writes to the primary database.</p><p>Data is replicated to two replicas.</p><p>4. Alice views the order details (read). The request is sent through the</p><p>middleware.</p><p>5. Alice views the recent order history (read). The request is sent</p><p>through the middleware.</p><p>The database middleware acts as a proxy between the application and</p><p>databases. It uses standard MySQL network protocol for</p><p>communication.</p><p>Pros:</p><p>- Simplified application code. The application doesn’t need to be aware</p><p>of the database topology and manage access to the database directly.</p><p>- Better compatibility. The middleware uses the MySQL network</p><p>protocol. Any MySQL compatible client can connect to the middleware</p><p>easily. This makes database migration easier.</p><p>Cons:</p><p>- Increased system complexity. A database middleware is a complex</p><p>system. Since all database queries go through the middleware, it</p><p>usually requires a high availability setup to avoid a single point of</p><p>failure.</p><p>- Additional middleware layer means additional network latency.</p><p>Therefore, this layer requires excellent performance.</p><p>104</p><p>Read replica pattern</p><p>In this post, we talk about a simple yet commonly used database</p><p>design pattern (setup): 𝐑𝐞𝐚𝐝 𝐫𝐞𝐩𝐥𝐢𝐜𝐚 𝐩𝐚𝐭𝐭𝐞𝐫𝐧.</p><p>In this setup, all data-modifying commands like insert, delete, or</p><p>update are sent to the primary DB, and reads are sent to read replicas.</p><p>The diagram below illustrates the setup:</p><p>1. When Alice places an order on amazon.com, the request is sent</p><p>to Order Service.</p><p>2. Order Service creates a record about the order in the primary</p><p>DB (write). Data is replicated to two replicas.</p><p>3. Alice views the order details. Data is served from a replica</p><p>(read).</p><p>4. Alice views the recent order history. Data is served from a</p><p>replica (read).</p><p>There is one major problem in this setup: 𝐫𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐥𝐚𝐠.</p><p>105</p><p>Under certain circumstances (network delay, server overload, etc.),</p><p>data in replicas might be seconds or even minutes behind. In this case,</p><p>if Alice immediately checks the order status (query is served by the</p><p>replica) after the order is placed, she might not see the order at all.</p><p>This leaves Alice confused. In this case, we need “read-after-write”</p><p>consistency.</p><p>Possible solutions to mitigate this problem:</p><p>1⃣ Latency sensitive reads are sent to the primary database.</p><p>2⃣ Reads that immediately follow writes are routed to the primary</p><p>database.</p><p>3⃣ A relational DB generally provides a way to check if a replica is</p><p>caught up with the primary. If data is up to date, query the replica.</p><p>Otherwise, fail the read request or read from the primary.</p><p>106</p><p>Email receiving flow</p><p>The following diagram demonstrates the email receiving flow.</p><p>1. Incoming emails arrive at the SMTP load balancer.</p><p>2. The load balancer distributes traffic among SMTP servers. Email</p><p>acceptance policy can be configured and applied at the</p><p>SMTP-connection level. For example, invalid emails are bounced to</p><p>avoid unnecessary email processing.</p><p>3. If the attachment of an email is too large to put into the queue, we</p><p>can put it into the attachment store (s3).</p><p>4. Emails are put in the incoming email queue. The queue decouples</p><p>mail processing workers from SMTP servers so they can be scaled</p><p>independently. Moreover, the queue serves as a buffer in case the</p><p>email volume surges.</p><p>5. Mail processing workers are responsible for a lot of tasks, including</p><p>filtering out spam mails, stopping viruses, etc. The following steps</p><p>assume an email passed the validation.</p><p>6. The email is stored in the mail storage, cache, and object data store.</p><p>107</p><p>7. If the receiver is currently online, the email is pushed to real-time</p><p>servers.</p><p>8. Real-time servers are WebSocket servers that allow clients to</p><p>receive new emails in real-time.</p><p>9. For offline users, emails are stored in the storage layer. When a user</p><p>comes back online, the webmail client connects to web servers via</p><p>RESTful API.</p><p>10. Web servers pull new emails from the storage layer and return</p><p>them to the client.</p><p>108</p><p>Email sending flow</p><p>In this post, we will take a closer look at the email sending flow.</p><p>1. A user writes an email on webmail and presses the “send” button.</p><p>The request is sent to the load balancer.</p><p>2. The load balancer makes sure it doesn’t exceed the rate limit and</p><p>routes traffic to web servers.</p><p>3. Web servers are responsible for:</p><p>- Basic email validation. Each incoming email is checked against</p><p>pre-defined rules such as email size limit.</p><p>- Checking if the domain of the recipient’s email address is the</p><p>same as the sender. If it is the same, email data is inserted to storage,</p><p>cache, and object store directly. The recipient can fetch the email</p><p>directly via the RESTful API. There is no need to go to step 4.</p><p>4. Message queues.</p><p>109</p><p>4.a. If basic email validation succeeds, the email data is passed to</p><p>the outgoing queue.</p><p>4.b. If basic email validation fails, the email is put in the error</p><p>queue.</p><p>5. SMTP outgoing workers pull events from the outgoing queue and</p><p>make sure emails are spam and virus free.</p><p>6. The outgoing email is stored in the “Sent Folder” of the storage</p><p>layer.</p><p>7. SMTP outgoing workers send the email to the recipient mail server.</p><p>Each message in the outgoing queue contains all the metadata</p><p>required to create an email. A distributed message queue is a critical</p><p>component that allows asynchronous mail processing. By decoupling</p><p>SMTP outgoing workers from the web servers, we can scale SMTP</p><p>outgoing workers independently.</p><p>We monitor the size of the outgoing queue very closely. If there are</p><p>many emails stuck in the queue, we need to analyze the cause of the</p><p>issue. Here are some possibilities:</p><p>- The recipient’s mail server is unavailable. In this case, we need to</p><p>retry sending the email at a later time. Exponential backoff might be a</p><p>good retry strategy.</p><p>- Not enough consumers to send emails. In this case, we may need</p><p>more consumers to reduce the processing time.</p><p>110</p><p>Interview Question: Design Gmail</p><p>One picture is worth more than a thousand words. In this post, we will</p><p>take a look at what happens when Alice sends an email to Bob.</p><p>1. Alice logs in to her Outlook client, composes an email, and presses</p><p>“send”. The email is sent to the Outlook mail server. The</p><p>communication protocol between the Outlook client and mail server is</p><p>SMTP.</p><p>2. Outlook mail server queries the DNS (not shown in the diagram) to</p><p>find the address of the recipient’s SMTP server. In this case, it</p><p>is</p><p>Gmail’s SMTP server. Next, it transfers the email to the Gmail mail</p><p>server. The communication protocol between the mail servers is SMTP.</p><p>3. The Gmail server stores the email and makes it available to Bob, the</p><p>recipient.</p><p>111</p><p>4. Gmail client fetches new emails through the IMAP/POP server when</p><p>Bob logs in to Gmail.</p><p>Please keep in mind this is a highly simplified design. Hope it sparks</p><p>your interest and curiosity:) I'll explain each component in more depth</p><p>in the future.</p><p>112</p><p>Map rendering</p><p>Google Maps Continued. Let’s take a look at 𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠 in this</p><p>post.</p><p>𝐏𝐫𝐞-𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐝 𝐓𝐢𝐥𝐞𝐬</p><p>One foundational concept in map rendering is tiling. Instead of</p><p>rendering the entire map as one large custom image, the world is</p><p>broken up into smaller tiles. The client only downloads the relevant</p><p>tiles for the area the user is in and stitches them together like a mosaic</p><p>for display. The tiles are pre-computed at different zoom levels. Google</p><p>Maps uses 21 zoom levels.</p><p>For example, at zoom level 0, The entire map is represented by a</p><p>single tile of size 256 * 256 pixels. Then at zoom level 1, the number of</p><p>map tiles doubles in both north-south and east-west directions, while</p><p>each tile stays at 256 * 256 pixels. So we have 4 tiles at zoom level 1,</p><p>and the whole image of zoom level 1 is 512 * 512 pixels. With each</p><p>increment, the entire set of tiles has 4x as many pixels as the previous</p><p>level. The increased pixel count provides an increasing level of detail</p><p>to the user.</p><p>This allows the client to render the map at the best granularities</p><p>depending on the client’s zoom level without consuming excessive</p><p>bandwidth to download tiles with too much detail. This is especially</p><p>important when we are loading the images from mobile clients.</p><p>𝐑𝐨𝐚𝐝 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐬</p><p>Now that we have transformed massive maps into tiles, we also need</p><p>to define a data structure for the roads. We divide the world of roads</p><p>into small blocks. We call these blocks road segments. Each road</p><p>segment contains multiple roads, junctions, and other metadata.</p><p>We group nearby segments into super segments. This process can be</p><p>applied repeatedly to meet the level of coverage required.</p><p>We then transform the road segments into a data structure that the</p><p>navigation algorithms can use. The typical approach is to convert the</p><p>map into a 𝒈𝒓𝒂𝒑𝒉, where the nodes are road segments, and two nodes</p><p>are connected if the corresponding road segments are reachable</p><p>113</p><p>neighbors. In this way, finding a path between two locations becomes a</p><p>shortest-path problem, where we can leverage Dijkstra or A*</p><p>algorithms.</p><p>114</p><p>Interview Question: Design Google Maps</p><p>Google started project G𝐨𝐨𝐠𝐥𝐞 M𝐚𝐩𝐬 in 2005. As of March 2021, Google</p><p>Maps had one billion daily active users, 99% coverage of the world in</p><p>200 countries.</p><p>Although Google Maps is a very complex system, we can break it</p><p>down into 3 high-level components. In this post, let’s take a look at how</p><p>to design a simplified Google Maps.</p><p>115</p><p>𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞</p><p>The location service is responsible for recording a user’s location</p><p>update. The Google Map clients send location updates every few</p><p>seconds. The user location data is used in many cases:</p><p>- detect new and recently closed roads</p><p>- improve the accuracy of the map over time</p><p>- used as an input for live traffic data.</p><p>𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠</p><p>The world’s map is projected into a huge 2D map image. It is broken</p><p>down into small image blocks called “tiles” (see below). The tiles are</p><p>static. They don’t change very often. An efficient way to serve static tile</p><p>files is with a CDN backed by cloud storage like S3. The users can</p><p>load the necessary tiles to compose a map from nearby CDN.</p><p>What if a user is zooming and panning the map viewpoint on the client</p><p>to explore their surroundings?</p><p>An efficient way is to pre-calculate the map blocks with different zoom</p><p>levels and load the images when needed.</p><p>𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞</p><p>This component is responsible for finding a reasonably fast route from</p><p>point A to point B. It calls two services to help with the path calculation:</p><p>1⃣ Geocoding Service: resolve the given address to a latitude/longitude</p><p>pair</p><p>2⃣ Route Planner Service: this service does three things in sequence:</p><p>- Calculate the top-K shortest paths between A and B</p><p>- Calculate the estimation of time for each path based on current</p><p>traffic and historical data</p><p>- Rank the paths by time predictions and user filtering. For example,</p><p>the user doesn’t want to avoid tolls.</p><p>116</p><p>Pull vs push models</p><p>There are two ways metrics data can be collected, pull or push. It is a</p><p>routine debate as to which one is better and there is no clear answer.</p><p>In this post, we will take a look at the pull model.</p><p>117</p><p>Figure 1 shows data collection with a pull model over HTTP. We have</p><p>dedicated metric collectors which pull metrics values from the running</p><p>applications periodically.</p><p>In this approach, the metrics collector needs to know the complete list</p><p>of service endpoints to pull data from. One naive approach is to use a</p><p>file to hold DNS/IP information for every service endpoint on the</p><p>“metric collector” servers. While the idea is simple, this approach is</p><p>hard to maintain in a large-scale environment where servers are added</p><p>or removed frequently, and we want to ensure that metric collectors</p><p>don’t miss out on collecting metrics from any new servers.</p><p>The good news is that we have a reliable, scalable, and maintainable</p><p>solution available through Service Discovery, provided by Kubernetes,</p><p>Zookeeper, etc., wherein services register their availability and the</p><p>metrics collector can be notified by the Service Discovery component</p><p>whenever the list of service endpoints changes. Service discovery</p><p>contains configuration rules about when and where to collect metrics</p><p>as shown in Figure 2.</p><p>Figure 3 explains the pull model in detail.</p><p>1⃣ The metrics collector fetches configuration metadata of service</p><p>endpoints from Service Discovery. Metadata include pulling interval, IP</p><p>addresses, timeout and retries parameters, etc.</p><p>2⃣ The metrics collector pulls metrics data via a pre-defined HTTP</p><p>endpoint (for example, /metrics). To expose the endpoint, a client</p><p>library usually needs to be added to the service. In Figure 3, the</p><p>service is Web Servers.</p><p>3⃣ Optionally, the metrics collector registers a change event notification</p><p>with Service Discovery to receive an update whenever the service</p><p>endpoints change. Alternatively, the metrics collector can poll for</p><p>endpoint changes periodically.</p><p>118</p><p>Money movement</p><p>One picture is worth more than a thousand words. This is what</p><p>happens when you buy a product using Paypal/bank card under the</p><p>hood.</p><p>To understand this, we need to digest two concepts: 𝐜𝐥𝐞𝐚𝐫𝐢𝐧𝐠 &</p><p>𝐬𝐞𝐭𝐭𝐥𝐞𝐦𝐞𝐧𝐭. Clearing is a process that calculates who should pay whom</p><p>with how much money; while settlement is a process where real money</p><p>moves between reserves in the settlement bank.</p><p>119</p><p>Let’s say Bob wants to buy an SDI book from Claire’s shop on</p><p>Amazon.</p><p>- Pay-in flow (Bob pays Amazon money):</p><p>1.1 Bob buys a book on Amazon using Paypal.</p><p>1.2 Amazon issues a money transfer request to Paypal.</p><p>1.3 Since the payment token of Bob’s debit card is stored in Paypal,</p><p>Paypal can transfer money, on Bob’s behalf, to Amazon’s bank</p><p>account in Bank A.</p><p>1.4 Both Bank A and Bank B send transaction statements to the</p><p>clearing institution. It reduces the transactions that need to be settled.</p><p>Let’s assume Bank A owns Bank B $100 and Bank B owns bank A</p><p>$500 at the end of the day. When they settle, the net position is that</p><p>Bank B pays Bank A $400.</p><p>1.5 & 1.6 The clearing institution sends clearing and settlement</p><p>information to the settlement bank. Both Bank A and Bank B have</p><p>pre-deposited funds in the settlement bank as money reserves, so</p><p>actual money movement happens between two reserve accounts in</p><p>the settlement bank.</p><p>- Pay-out flow (Amazon pays the money to the seller: Claire):</p><p>2.1 Amazon informs the seller (Claire) that she will get paid soon.</p><p>2.2 Amazon issues a money transfer</p><p>request from its own bank (Bank</p><p>A) to the seller bank (bank C). Here both banks record the</p><p>transactions, but no real money is moved.</p><p>2.3 Both Bank A and Bank C send transaction statements to the</p><p>clearing institution.</p><p>2.4 & 2.5 The clearing institution sends clearing and settlement</p><p>information to the settlement bank. Money is transferred from Bank A’s</p><p>reserve to Bank C’s reserve.</p><p>Notice that we have three layers:</p><p>- Transaction layer: where the online purchases happen</p><p>- Payment and clearing layer: where the payment instructions and</p><p>transaction netting happen</p><p>- Settlement layer: where the actual money movement happen</p><p>120</p><p>The first two layers are called information flow, and the settlement layer</p><p>is called fund flow.</p><p>You can see the 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰 𝐚𝐧𝐝 𝐟𝐮𝐧𝐝 𝐟𝐥𝐨𝐰 𝐚𝐫𝐞 𝐬𝐞𝐩𝐚𝐫𝐚𝐭𝐞𝐝. In the</p><p>info flow, the money seems to be deducted from one bank account and</p><p>added to another bank account, but the actual money movement</p><p>happens in the settlement bank at the end of the day.</p><p>Because of the asynchronous nature of the info flow and the fund flow,</p><p>reconciliation is very important for data consistency in the systems</p><p>along with the flow.</p><p>It makes things even more interesting when Bob wants to buy a book</p><p>in the Indian market, where Bob pays USD but the seller can only</p><p>receive INR.</p><p>121</p><p>Reconciliation</p><p>My previous post about painful payment reconciliation problems</p><p>sparked lots of interesting discussions. One of the readers shared</p><p>more problems we may face when working with intermediary payment</p><p>processors in the trenches and a potential solution:</p><p>1. Foreign Currency Problem: When you operate a store globally, you</p><p>will come across this problem quite frequently. To go back to the</p><p>example from Paypal - if the transaction happens in a currency</p><p>different from the standard currency of Paypal, this will create another</p><p>layer, where the transaction is first received in that currency and</p><p>exchanged to whatever currency your Paypal is using. There needs to</p><p>be a reliable way to reconcile that currency exchange transaction. It</p><p>certainly does not help that every payment provider handles this</p><p>differently.</p><p>2. Payment providers are only that - intermediaries. Each purchase</p><p>does not trigger two events for a company, but actually at least 4. The</p><p>purchase via Paypal (where both the time and the currency dimension</p><p>can come into play) trigger the debit/credit pair for the transaction and</p><p>then, usually a few days later, another pair when the money is</p><p>transferred from Paypal to a bank account (where there might be yet</p><p>another FX discrepancy to reconcile if, for example, the initial purchase</p><p>was in JPY, Paypal is set up in USD and your bank account is in EUR).</p><p>There needs to be a way to reconcile all of these.</p><p>3. Some problems also pop up on the buyer side that is very</p><p>platform-specific. One example is shadow transaction from Paypal: if</p><p>you buy two items on Paypal with 1 week of time between the two</p><p>transactions, Paypal will first debit money from your bank account for</p><p>transaction A. If at the time of transaction B, transaction A has not</p><p>gone through completely or is canceled, there might be a world where</p><p>Paypal will use the money from transaction A to partially pay for</p><p>transaction B, which leads to only a partial amount of transaction B</p><p>being withdrawn from the bank account.</p><p>In practice, this usually looks something like this:</p><p>1) Your shop assigns an order number to the purchase</p><p>122</p><p>2) The order number is carried over to the payment provider</p><p>3) The payment provider creates another internal ID, which is carried</p><p>over across transactions within the system</p><p>4) The payment ID is used when you get the payout on your bank</p><p>account (or the payment provider bundles individual payments, which</p><p>can be reconciled within the payment provider system)</p><p>5) Ideally, your payment provider and your shop have an</p><p>integration/API with the tool you use to (hopefully automatically) create</p><p>invoices. This usually carries over the order id from the shop (closing</p><p>the loop) and sometimes even the payment id to match it with the</p><p>invoice id, which you then can use to reconcile it with your accounts</p><p>receivable/payable. :)</p><p>Credit: A knowledgeable reader who prefers to stay private. Thank</p><p>you!</p><p>123</p><p>Continued: how to choose the right database for metrics collecting</p><p>service?</p><p>There are many storage systems available that are optimized for</p><p>time-series data. The optimization lets us use far fewer servers to</p><p>handle the same volume of data. Many of these databases also have</p><p>custom query interfaces specially designed for the analysis of</p><p>time-series data that are much easier to use than SQL. Some even</p><p>provide features to manage data retention and data aggregation. Here</p><p>are a few examples of time-series databases.</p><p>OpenTSDB is a distributed time-series database, but since it is based</p><p>on Hadoop and HBase, running a Hadoop/HBase cluster adds</p><p>complexity. Twitter uses MetricsDB, and Amazon offers Timestream as</p><p>a time-series database. According to DB-engines, the two most</p><p>popular time-series databases are InfluxDB and Prometheus, which</p><p>are designed to store large volumes of time-series data and quickly</p><p>perform real-time analysis on that data. Both of them primarily rely on</p><p>an in-memory cache and on-disk storage. And they both handle</p><p>durability and performance quite well. According to the benchmark, an</p><p>InfluxDB with 8 cores and 32GB RAM can handle over 250,000 writes</p><p>per second.</p><p>124</p><p>Since a time-series database is a specialized database, you are not</p><p>expected to understand the internals in an interview unless you</p><p>explicitly mentioned it in your resume. For the purpose of an interview,</p><p>it’s important to understand the metrics data are time-series in nature</p><p>and we can select time-series databases such as InfluxDB for storage</p><p>to store them.</p><p>Another feature of a strong time-series database is efficient</p><p>aggregation and analysis of a large amount of time-series data by</p><p>labels, also known as tags in some databases. For example, InfluxDB</p><p>builds indexes on labels to facilitate the fast lookup of time-series by</p><p>labels. It provides clear best-practice guidelines on how to use labels,</p><p>without overloading the database. The key is to make sure each label</p><p>is of low cardinality (having a small set of possible values). This feature</p><p>is critical for visualization, and it would take a lot of effort to build this</p><p>with a general-purpose database.</p><p>125</p><p>Which database shall I use for the metrics collecting</p><p>system?</p><p>This is one of the most important questions we need to address in an</p><p>interview.</p><p>𝐃𝐚𝐭𝐚 𝐚𝐜𝐜𝐞𝐬𝐬 𝐩𝐚𝐭𝐭𝐞𝐫𝐧</p><p>As shown in the diagram, each label on the y-axis represents a time</p><p>series (uniquely identified by the names and labels) while the x-axis</p><p>represents time.</p><p>The write load is heavy. As you can see, there can be many</p><p>time-series data points written at any moment. There are millions of</p><p>operational metrics written per day, and many metrics are collected at</p><p>high frequency, so the traffic is undoubtedly write-heavy.</p><p>At the same time, the read load is spiky. Both visualization and alert</p><p>services send queries to the database and depending on the access</p><p>patterns of the graphs and alerts, the read volume could be bursty.</p><p>𝐂𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞</p><p>The data storage system is the heart of the design. It’s not</p><p>recommended to build your own storage system or use a</p><p>general-purpose storage system (MySQL) for this job.</p><p>A general-purpose database, in theory, could support time-series data,</p><p>but it would require expert-level tuning to make it work at our scale.</p><p>Specifically, a relational database is not optimized for operations you</p><p>would commonly perform against time-series data. For example,</p><p>computing the moving average in a rolling time window requires</p><p>complicated SQL that is difficult to read (there is an example of this in</p><p>the deep dive section). Besides, to support tagging/labeling data, we</p><p>need to add an index for each tag. Moreover, a general-purpose</p><p>relational database does not perform well under constant heavy write</p><p>load. At our scale, we would need to expend significant effort in tuning</p><p>the database, and even then, it might not perform well.</p><p>126</p><p>How about NoSQL? In theory, a few NoSQL databases on the market</p><p>could handle time-series data effectively. For example, Cassandra and</p><p>Bigtable can both be used for time series data. However, this would</p><p>require deep knowledge of the internal workings of each NoSQL to</p><p>devise a scalable schema for effectively storing and querying</p><p>time-series data. With industrial-scale time-series databases readily</p><p>available, using a general purpose NoSQL database is not appealing.</p><p>There are many storage systems available that are optimized for</p><p>time-series data. The optimization lets us use far fewer servers to</p><p>handle the same volume of data. Many of these databases also have</p><p>custom query interfaces specially designed for the analysis of</p><p>time-series data that are much easier to use than SQL. Some even</p><p>provide features to manage data retention and data aggregation. Here</p><p>are a few examples of time-series databases.</p><p>OpenTSDB is a distributed time-series database, but since it is based</p><p>on Hadoop and HBase, running a Hadoop/HBase cluster adds</p><p>complexity. Twitter uses MetricsDB, and Amazon offers Timestream as</p><p>a time-series database. According to DB-engines, the two most</p><p>popular time-series databases are InfluxDB and Prometheus, which</p><p>are designed to store large volumes of time-series data and quickly</p><p>perform real-time analysis on that data. Both of them primarily rely on</p><p>an in-memory cache and on-disk storage. And they both handle</p><p>durability and performance quite well. According to the benchmark</p><p>listed on InfluxDB website, a DB server with 8 cores and 32GB RAM</p><p>can handle over 250,000 writes per second.</p><p>Since a time-series database is a specialized database, you are not</p><p>expected to understand the internals in an interview unless you</p><p>explicitly mentioned it in your resume. For the purpose of an interview,</p><p>it’s important to understand the metrics data are time-series in nature</p><p>and we can select time-series databases such as InfluxDB for storage</p><p>to store them.</p><p>Another feature of a strong time-series database is efficient</p><p>aggregation and analysis of a large amount of time-series data by</p><p>labels, also known as tags in some databases. For example, InfluxDB</p><p>builds indexes on labels to facilitate the fast lookup of time-series by</p><p>127</p><p>labels. It provides clear best-practice guidelines on how to use labels,</p><p>without overloading the database. The key is to make sure each label</p><p>is of low cardinality (having a small set of possible values). This feature</p><p>is critical for visualization, and it would take a lot of effort to build this</p><p>with a general-purpose database.</p><p>128</p><p>Metrics monitoring and altering system</p><p>A well-designed 𝐦𝐞𝐭𝐫𝐢𝐜𝐬 𝐦𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 and alerting system plays a key</p><p>role in providing clear visibility into the health of the infrastructure to</p><p>ensure high availability and reliability. The diagram below explains how</p><p>it works at a high level.</p><p>Metrics source: This can be application servers, SQL databases,</p><p>message queues, etc.</p><p>Metrics collector: It gathers metrics data and writes data into the</p><p>time-series database.</p><p>Time-series database: This stores metrics data as time series. It</p><p>usually provides a custom query interface for analyzing and</p><p>summarizing a large amount of time-series data. It maintains indexes</p><p>on labels to facilitate the fast lookup of time-series data by labels.</p><p>Kafka: Kafka is used as a highly reliable and scalable distributed</p><p>messaging platform. It decouples the data collection and data</p><p>processing services from each other.</p><p>129</p><p>Consumers: Consumers or streaming processing services such as</p><p>Apache Storm, Flink and Spark, process and push data to the</p><p>time-series database.</p><p>Query service: The query service makes it easy to query and retrieve</p><p>data from the time-series database. This should be a very thin wrapper</p><p>if we choose a good time-series database. It could also be entirely</p><p>replaced by the time-series database’s own query interface.</p><p>Alerting system: This sends alert notifications to various alerting</p><p>destinations.</p><p>Visualization system: This shows metrics in the form of various</p><p>graphs/charts.</p><p>130</p><p>Reconciliation</p><p>𝐑𝐞𝐜𝐨𝐧𝐜𝐢𝐥𝐢𝐚𝐭𝐢𝐨𝐧 might be the most painful process in a payment system.</p><p>It is the process of comparing records in different systems to make</p><p>sure the amounts match each other.</p><p>For example, if you pay $200 to buy a watch with Paypal:</p><p>- The eCommerce website should have a record about the purchase</p><p>order of $200.</p><p>- There should be a transaction record of $200 in Paypal (marked with</p><p>2 in the diagram).</p><p>- The Ledger should record a debit of $200 dollars for the buyer, and a</p><p>credit of $200 for the seller. This is called double-entry bookkeeping</p><p>(see the table below).</p><p>Let’s take a look at some pain points and how we can address them:</p><p>131</p><p>𝐏𝐫𝐨𝐛𝐥𝐞𝐦 1: Data normalization. When comparing records in different</p><p>systems, they come in different formats. For example, the timestamp</p><p>can be “2022/01/01” in one system and “Jan 1, 2022” in another.</p><p>𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can add a layer to transform different formats into</p><p>the same format.</p><p>𝐏𝐫𝐨𝐛𝐥𝐞𝐦 2: Massive data volume</p><p>𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can use big data processing techniques to speed</p><p>up data comparisons. If we need near real-time reconciliation, a</p><p>streaming platform such as Flink is used; otherwise, end-of-day batch</p><p>processing such as Hadoop is enough.</p><p>𝐏𝐫𝐨𝐛𝐥𝐞𝐦 3: Cut-off time issue. For example, if we choose 00:00:00 as</p><p>the daily cut-off time, one record is stamped with 23:59:55 in the</p><p>internal system, but might be stamped 00:00:30 in the external system</p><p>(Paypal), which is the next day. In this case, we couldn’t find this record</p><p>in today’s Paypal records. It causes a discrepancy.</p><p>𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we need to categorize this break as a “temporary</p><p>break” and run it later against the next day’s Paypal records. If we find</p><p>a match in the next day’s Paypal records, the break is cleared, and no</p><p>more action is needed.</p><p>You may argue that if we have exactly-once semantics in the system,</p><p>there shouldn’t be any discrepancies. But the truth is, there are so</p><p>many places that can go wrong. Having a reconciliation system is</p><p>always necessary. It is like having a safety net to keep you sleeping</p><p>well at night.</p><p>132</p><p>Which database shall I use? This is one of the most important</p><p>questions we usually need to address in an interview.</p><p>Choosing the right database is hard. Google Cloud recently posted a</p><p>great article that summarized different database options available in</p><p>Google Cloud and explained which use cases are best suited for each</p><p>database option.</p><p>133</p><p>Big data papers</p><p>Below is a timeline of important big data papers and how the</p><p>techniques evolved over time.</p><p>The green highlighted boxes are the famous 3 Google papers, which</p><p>established the foundation of the big data framework. At the high-level:</p><p>𝘉𝘪𝘨 𝘋𝘢𝘵𝘢 𝘛𝘦𝘤𝘩𝘯𝘪𝘲𝘶𝘦𝘴 = 𝘔𝘢𝘴𝘴𝘪𝘷𝘦 𝘥𝘢𝘵𝘢 + 𝘔𝘢𝘴𝘴𝘪𝘷𝘦 𝘤𝘢𝘭𝘤𝘶𝘭𝘢𝘵𝘪𝘰𝘯</p><p>Let’s look at the 𝐎𝐋𝐓𝐏 evolution. BigTable provided a distributed</p><p>storage system for structured data but dropped some characteristics of</p><p>relational DB. Then Megastore brought back schema and simple</p><p>transactions; Spanner brought back data consistency.</p><p>Now let’s look at the 𝐎𝐋𝐀𝐏 evolution. MapReduce was not easy to</p><p>program, so Hive solved this by introducing a SQL-like query</p><p>134</p><p>language. But Hive still used MapReduce under the hood, so it’s not</p><p>very responsive. In 2010, Dremel provided an interactive query engine.</p><p>𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 was born to further solve the latency issue in</p><p>OLAP. The famous 𝒍𝒂𝒎𝒃𝒅𝒂 architecture was based on Storm and</p><p>MapReduce, where streaming processing and batch processing have</p><p>different processing flows. Then people started to build streaming</p><p>processing with apache Kafka. 𝑲𝒂𝒑𝒑𝒂 architecture was proposed in</p><p>2014, where streaming and batching processings were merged into</p><p>one</p><p>flow. Google published The Dataflow Model in 2015, which was an</p><p>abstraction standard for streaming processing, and Flink implemented</p><p>this model.</p><p>To manage a big crowd of commodity server resources, we need</p><p>resource management Kubernetes.</p><p>135</p><p>Avoid double charge</p><p>One of the most serious problems a payment system can have is to</p><p>𝐝𝐨𝐮𝐛𝐥𝐞 𝐜𝐡𝐚𝐫𝐠𝐞 𝐚 𝐜𝐮𝐬𝐭𝐨𝐦𝐞𝐫. When we design the payment system, it is</p><p>important to guarantee that the payment system executes a payment</p><p>order exactly-once.</p><p>136</p><p>At the first glance, exactly-once delivery seems very hard to tackle, but</p><p>if we divide the problem into two parts, it is much easier to solve.</p><p>Mathematically, an operation is executed exactly-once if:</p><p>1. It is executed at least once.</p><p>2. At the same time, it is executed at most once.</p><p>We now explain how to implement at least once using retry and at</p><p>most once using idempotency check.</p><p>𝐑𝐞𝐭𝐫𝐲</p><p>Occasionally, we need to retry a payment transaction due to network</p><p>errors or timeout. Retry provides the at-least-once guarantee. For</p><p>example, as shown in Figure 10, the client tries to make a $10</p><p>payment, but the payment keeps failing due to a poor network</p><p>connection. Considering the network condition might get better, the</p><p>client retries the request and this payment finally succeeds at the</p><p>fourth attempt.</p><p>𝐈𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲</p><p>From an API standpoint, idempotency means clients can make the</p><p>same call repeatedly and produce the same result.</p><p>For communication between clients (web and mobile applications) and</p><p>servers, an idempotency key is usually a unique value that is</p><p>generated by clients and expires after a certain period of time. A UUID</p><p>is commonly used as an idempotency key and it is recommended by</p><p>many tech companies such as Stripe and PayPal. To perform an</p><p>idempotent payment request, an idempotency key is added to the</p><p>HTTP header: .</p><p>137</p><p>Payment security</p><p>A few weeks ago, I posted the high-level design for the payment</p><p>system. Today, I’ll continue the discussion and focus on payment</p><p>security.</p><p>The table below summarizes techniques that are commonly used in</p><p>payment security. If you have any questions or I missed anything,</p><p>please leave a comment.</p><p>138</p><p>System Design Interview Tip</p><p>One pro tip for acing a system design interview is to read the</p><p>engineering blog of the company you are interviewing with. You can</p><p>get a good sense of what technology they use, why the technology</p><p>was chosen over others, and learn what issues are important to</p><p>engineers.</p><p>For example, here are 4 blog posts Twitter Engineering recommends:</p><p>1. The Infrastructure Behind Twitter: Scale</p><p>2. Discovery and Consumption of Analytics Data at Twitter</p><p>3. The what and why of product experimentation at Twitter</p><p>4. Twitter experimentation: technical overview</p><p>139</p><p>Big data evolvement</p><p>I hope everyone has a great time with friends and family during the</p><p>holidays. If you are looking for some readings, classic engineering</p><p>papers are a good start.</p><p>A lot of times when we are busy with work, we only focus on scattered</p><p>information, telling us “how” and “what” to get our immediate needs to</p><p>get things done.</p><p>However, reading the classics helps us know “why” behind the scenes,</p><p>and teaches us how to solve problems, make better decisions, or even</p><p>contribute to open source projects.</p><p>Let’s take big data as an example.</p><p>Big data area has progressed a lot over the past 20 years. It started</p><p>from 3 Google papers (see the links in the comment), which tackled</p><p>real engineering challenges at Google scale:</p><p>- GFS (2003) - big data storage</p><p>- MapReduce (2004) - calculation model</p><p>- BigTable (2006) - online services</p><p>The diagram below shows the functionalities and limitations of the 3</p><p>techniques, and how they evolve over time into two streams: OLTP and</p><p>OLAP. Each evolved product was trying to solve the limitations of the</p><p>140</p><p>last generation. For example, “Hive - support SQL” means Hive was</p><p>trying to solve the lack of SQL in MapReduce.</p><p>If you want to learn more, you can refer to the papers for details. What</p><p>other classics would you recommend?</p><p>141</p><p>Quadtree</p><p>In this post, let’s explore another data structure to find nearby</p><p>restaurants on Yelp or Google Maps.</p><p>A quadtree is a data structure that is commonly used to partition a</p><p>two-dimensional space by recursively subdividing it into four quadrants</p><p>(grids) until the contents of the grids meet certain criteria (see the first</p><p>diagram).</p><p>142</p><p>Quadtree is an 𝐢𝐧-𝐦𝐞𝐦𝐨𝐫𝐲 𝐝𝐚𝐭𝐚 𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 and it is not a database</p><p>solution. It runs on each LBS (Location-Based Service, see last week’s</p><p>post) server, and the data structure is built at server start-up time.</p><p>The second diagram explains the quadtree building process in more</p><p>detail. The root node represents the whole world map. The root node is</p><p>𝐫𝐞𝐜𝐮𝐫𝐬𝐢𝐯𝐞𝐥𝐲 broken down into 4 quadrants until no nodes are left with</p><p>more than 100 businesses.</p><p>𝐇𝐨𝐰 𝐭𝐨 𝐠𝐞𝐭 𝐧𝐞𝐚𝐫𝐛𝐲 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬𝐞𝐬 𝐰𝐢𝐭𝐡 𝐪𝐮𝐚𝐝𝐭𝐫𝐞𝐞?</p><p>- Build the quadtree in memory.</p><p>- After the quadtree is built, start searching from the root and traverse</p><p>the tree, until we find the leaf node where the search origin is.</p><p>- If that leaf node has 100 businesses, return the node. Otherwise, add</p><p>businesses from its neighbors until enough businesses are returned.</p><p>𝐔𝐩𝐝𝐚𝐭𝐞 𝐋𝐁𝐒 𝐬𝐞𝐫𝐯𝐞𝐫 𝐚𝐧𝐝 𝐫𝐞𝐛𝐮𝐢𝐥𝐝 𝐪𝐮𝐚𝐝𝐭𝐫𝐞𝐞</p><p>- It may take a few minutes to build a quadtree in memory with 200</p><p>million businesses at the server start-up time.</p><p>- While the quadtree is being built, the server cannot serve traffic.</p><p>- Therefore, we should roll out a new release of the server</p><p>incrementally to 𝐚 𝐬𝐦𝐚𝐥𝐥 𝐬𝐮𝐛𝐬𝐞𝐭 of servers at a time. This avoids taking a</p><p>large swathe of the server cluster offline and causes service brownout.</p><p>143</p><p>How do we find nearby restaurants on Yelp?</p><p>Here are some design details behind the scenes.</p><p>There are two key services (see the diagram below):</p><p>- 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐒𝐞𝐫𝐯𝐢𝐜𝐞</p><p>144</p><p>- Add/delete/update restaurant information</p><p>- Customers view restaurant details</p><p>- 𝐋𝐨𝐜𝐚𝐥-𝐛𝐚𝐬𝐞𝐝 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 (𝐋𝐁𝐒)</p><p>- Given a radius and location, return a list of nearby restaurants</p><p>How are the restaurant locations stored in the database so that LBS</p><p>can return nearby restaurants efficiently?</p><p>Store the latitude and longitude of restaurants in the database? The</p><p>query will be very inefficient when you need to calculate the distance</p><p>between you and every restaurant.</p><p>One way to speed up the search is using the 𝐠𝐞𝐨𝐡𝐚𝐬𝐡 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦.</p><p>First, divide the planet into four quadrants along with the prime</p><p>meridian and equator:</p><p>- Latitude range [-90, 0] is represented by 0</p><p>- Latitude range [0, 90] is represented by 1</p><p>- Longitude range [-180, 0] is represented by 0</p><p>- Longitude range [0, 180] is represented by 1</p><p>Second, divide each grid into four smaller grids. Each grid can be</p><p>represented by alternating between longitude bit and latitude bit.</p><p>So when you want to search for the nearby restaurants in the</p><p>red-highlighted block, you can write SQL like:</p><p>SELECT * FROM geohash_index WHERE geohash LIKE `01%`</p><p>Geohash has some limitations. There can be a lot of restaurants in one</p><p>block (downtown New York), but none in another block (ocean). So</p><p>there are other more complicated algorithms to optimize the process.</p><p>Let me know if you are interested in the details.</p><p>145</p><p>One picture is worth more than a thousand words. Log4j from attack to</p><p>prevention in one illustration.</p><p>Credit GovCERT</p><p>Link:</p><p>https://www.govcert.ch/blog/zero-day-exploit-targeting-popular-java-libr</p><p>ary-log4j/</p><p>146</p><p>How does a modern stock exchange achieve</p><p>microsecond latency?</p><p>The principal is:</p><p>𝐃𝐨 𝐥𝐞𝐬𝐬 𝐨𝐧 𝐭𝐡𝐞 𝐜𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐩𝐚𝐭𝐡!</p><p>- Fewer tasks on the critical path</p><p>- Less time on each task</p><p>- Fewer network hops</p><p>- Less disk usage</p><p>For the stock exchange, the critical path is:</p><p>- 𝐬𝐭𝐚𝐫𝐭: an order comes into the order manager</p><p>- mandatory risk checks</p><p>- the order gets matched and the execution is sent back</p><p>- 𝐞𝐧𝐝: the execution comes</p><p>out of the order manager</p><p>Other non-critical tasks should be removed from the critical path.</p><p>We put together a design as shown in the diagram:</p><p>147</p><p>- deploy all the components in a single giant server (no containers)</p><p>- use shared memory as an event bus to communicate among the</p><p>components, no hard disk</p><p>- key components like Order Manager and Matching Engine are</p><p>single-threaded on the critical path, and each pinned to a CPU so that</p><p>there is 𝐧𝐨 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐬𝐰𝐢𝐭𝐜𝐡 and 𝐧𝐨 𝐥𝐨𝐜𝐤𝐬</p><p>- the single-threaded application loop executes tasks one by one in</p><p>sequence</p><p>- other components listen on the event bus and react accordingly</p><p>148</p><p>Match buy and sell orders</p><p>Stocks go up and down. Do you know what data structure is used to</p><p>efficiently match buy and sell orders?</p><p>Stock exchanges use the data structure called 𝐨𝐫𝐝𝐞𝐫 𝐛𝐨𝐨𝐤𝐬. An order</p><p>book is an electronic list of buy and sell orders, organized by price</p><p>levels. It has a buy book and a sell book, where each side of the book</p><p>contains a bunch of price levels, and each price level contains a list of</p><p>orders (first in first out).</p><p>The image is an example of price levels and the queued quantity in</p><p>each price level.</p><p>So what happens when you place a market order to buy 2700 shares</p><p>in the diagram?</p><p>- The buy order is matched with all the sell onrders at price 100.10,</p><p>and the first order at price 100.11 (illustrated in light red).</p><p>149</p><p>- Now because of the big buy order which “eats up” the first price level</p><p>on the sell book, the best ask price goes up from 100.10 to 100.11.</p><p>- So when the market is bullish, people tend to buy stocks, and the</p><p>price goes up and up.</p><p>An efficient data structure for an order book must satisfy:</p><p>- Constant lookup time. Operations include: get volume at a price level</p><p>or between price levels, query best bid/ask.</p><p>- Fast add/cancel/execute/update operations, preferably O(1) time</p><p>complexity. Operations include: place a new order, cancel an order,</p><p>and match an order.</p><p>150</p><p>Stock exchange design</p><p>The stock market has been volatile recently.</p><p>Coincidentally, we just finished a new chapter “Design a stock</p><p>exchange”. I’ll use plain English to explain what happens when you</p><p>place a stock buying order. The focus is on the exchange side.</p><p>Step 1: client places an order via the broker’s web or mobile app.</p><p>Step 2: broker sends the order to the exchange.</p><p>151</p><p>Step 3: the exchange client gateway performs operations such as</p><p>validation, rate limiting, authentication, normalization, etc, and sends</p><p>the order to the order manager.</p><p>Step 4: the order manager performs risk checks based on rules set by</p><p>the risk manager.</p><p>Step 5: once risk checks pass, the order manager checks if there is</p><p>enough balance in the wallet.</p><p>Step 6-7: the order is sent to the matching engine. The matching</p><p>engine sends back the execution result if a match is found. Both order</p><p>and execution results need to be sequenced first in the sequencer so</p><p>that matching determinism is guaranteed.</p><p>Step 8 - 10: execution result is passed all the way back to the client.</p><p>Step 11-12: market data (including the candlestick chart and order</p><p>book) are sent to the data service for consolidation. Brokers query the</p><p>data service to get the market data.</p><p>Step 13: the reporter composes all the necessary reporting fields (e.g.</p><p>client_id, price, quantity, order_type, filled_quantity,</p><p>remaining_quantity) and writes the data to the database for</p><p>persistence</p><p>A stock exchange requires 𝐞𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐨𝐰 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. While most web</p><p>applications are ok with hundreds of milliseconds latency, a stock</p><p>exchange requires 𝐦𝐢𝐜𝐫𝐨-𝐬𝐞𝐜𝐨𝐧𝐝 𝐥𝐞𝐯𝐞𝐥 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. I’ll leave the latency</p><p>discussion for a separate post since the post is already long.</p><p>152</p><p>Design a payment system</p><p>Today is Cyber Monday. Here is how money moves when you click the</p><p>Buy button on Amazon or any of your favorite shopping websites.</p><p>I posted the same diagram last week for an overview and a few people</p><p>asked me about the detailed steps, so here you go:</p><p>1. When a user clicks the “Buy” button, a payment event is generated</p><p>and sent to the payment service.</p><p>2. The payment service stores the payment event in the database.</p><p>3. Sometimes a single payment event may contain several payment</p><p>orders. For example, you may select products from multiple sellers in a</p><p>single checkout process. The payment service will call the payment</p><p>executor for each payment order.</p><p>4. The payment executor stores the payment order in the database.</p><p>5. The payment executor calls an external PSP to finish the credit card</p><p>payment.</p><p>6. After the payment executor has successfully executed the payment,</p><p>the payment service will update the wallet to record how much money</p><p>a given seller has.</p><p>153</p><p>7. The wallet server stores the updated balance information in the</p><p>database.</p><p>8. After the wallet service has successfully updated the seller’s balance</p><p>information, the payment service will call the ledger to update it.</p><p>9. The ledger service appends the new ledger information to the</p><p>database.</p><p>10. Every night the PSP or banks send settlement files to their clients.</p><p>The settlement file contains the balance of the bank account, together</p><p>with all the transactions that took place on this bank account during the</p><p>day.</p><p>154</p><p>Design a flash sale system</p><p>Black Friday is coming. Designing a system with extremely high</p><p>concurrency, high availability and quick responsiveness needs to</p><p>consider many aspects 𝐚𝐥𝐥 𝐭𝐡𝐞 𝐰𝐚𝐲 𝐟𝐫𝐨𝐦 𝐟𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐭𝐨 𝐛𝐚𝐜𝐤𝐞𝐧𝐝. See the</p><p>below picture for details:</p><p>𝐃𝐞𝐬𝐢𝐠𝐧 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬:</p><p>1. Less is more - less element on the web page, fewer data</p><p>queries to the database, fewer web requests, fewer system</p><p>dependencies</p><p>2. Short critical path - fewer hops among services or merge into</p><p>one service</p><p>3. Async - use message queues to handle high TPS</p><p>4. Isolation - isolate static and dynamic contents, isolate processes</p><p>and databases for rare items</p><p>5. Overselling is bad. When Decreasing the inventory is important</p><p>155</p><p>6. User experience is important. We definitely don’t want to inform</p><p>users that they have successfully placed orders but later tell</p><p>them no items are actually available</p><p>156</p><p>Back-of-the-envelope estimation</p><p>Recently, a few engineers asked me whether we really need</p><p>back-of-the-envelope estimation in a system design interview. I think it</p><p>would be helpful to clarify.</p><p>Estimations are important because we need them to understand the</p><p>scale of the system and justify the design. It helps answer questions</p><p>like:</p><p>- Do we really need a distributed solution?</p><p>- Is a cache layer necessary?</p><p>- Shall we choose data replication or sharding?</p><p>Here is an example of how the estimations shape the design decision.</p><p>One interview question is to design proximity service and how to scale</p><p>geospatial index is a key part of it. Here are a few paragraphs we</p><p>wrote to show why jumping to a sharding design without estimations is</p><p>a bad idea:</p><p>“One common mistake about scaling the geospatial index is to quickly</p><p>jump to a sharding scheme without considering the actual data size of</p><p>the table. In our case, the full dataset for the geospatial index table is</p><p>not large (quadtree index only takes 1.71G memory and storage</p><p>requirement for geohash index is similar). The whole geospatial index</p><p>can easily fit in the working set of a modern database server. However,</p><p>depending on the read volume, a single database server might not</p><p>have enough CPU or network bandwidth to service all read requests. If</p><p>that is the case, it will be necessary to spread the read load among</p><p>multiple database servers.</p><p>There are two general approaches to spread the load of a relational</p><p>database server. We can add read replicas or shard the database.</p><p>Many engineers like to talk about sharding during interviews. However,</p><p>it might not be a good fit for the geohash table. Sharding is</p><p>complicated. The sharding logic has to be added to the application</p><p>layer. Sometimes, sharding is the only option. In this case though,</p><p>since everything can fit in the working set of</p><p>a database server, there is</p><p>no strong technical reason to shard the data among multiple servers.</p><p>157</p><p>A better approach, in this case, is to have a series of read replicas to</p><p>help with the read load. This method is much simpler to develop and</p><p>maintain. Thus, we recommend scaling the geospatial index table</p><p>through replicas.”</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>158</p><p>passwords, or typing in the</p><p>names of pets from childhood to answer security questions. SSO</p><p>removes this inconvenience and makes life online better. But how does</p><p>it work?</p><p>Basically, Single Sign-On (SSO) is an authentication scheme. It allows</p><p>a user to log in to different systems using a single ID.</p><p>The diagram below illustrates how SSO works.</p><p>Step 1: A user visits Gmail, or any email service. Gmail finds the user</p><p>is not logged in and so redirects them to the SSO authentication</p><p>server, which also finds the user is not logged in. As a result, the user</p><p>11</p><p>is redirected to the SSO login page, where they enter their login</p><p>credentials.</p><p>Steps 2-3: The SSO authentication server validates the credentials,</p><p>creates the global session for the user, and creates a token.</p><p>Steps 4-7: Gmail validates the token in the SSO authentication server.</p><p>The authentication server registers the Gmail system, and returns</p><p>“valid.” Gmail returns the protected resource to the user.</p><p>Step 8: From Gmail, the user navigates to another Google-owned</p><p>website, for example, YouTube.</p><p>Steps 9-10: YouTube finds the user is not logged in, and then requests</p><p>authentication. The SSO authentication server finds the user is already</p><p>logged in and returns the token.</p><p>Step 11-14: YouTube validates the token in the SSO authentication</p><p>server. The authentication server registers the YouTube system, and</p><p>returns “valid.” YouTube returns the protected resource to the user.</p><p>The process is complete and the user gets back access to their</p><p>account.</p><p>Over to you:</p><p>Question 1: have you implemented SSO in your projects? What is the</p><p>most difficult part?</p><p>Question 2: what’s your favorite sign-in method and why?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>12</p><p>How to store passwords safely in the database?</p><p>Let’s take a look.</p><p>𝐓𝐡𝐢𝐧𝐠𝐬 𝐍𝐎𝐓 𝐭𝐨 𝐝𝐨</p><p>🔹 Storing passwords in plain text is not a good idea because anyone</p><p>with internal access can see them.</p><p>🔹 Storing password hashes directly is not sufficient because it is</p><p>pruned to precomputation attacks, such as rainbow tables.</p><p>🔹 To mitigate precomputation attacks, we salt the passwords.</p><p>𝐖𝐡𝐚𝐭 𝐢𝐬 𝐬𝐚𝐥𝐭?</p><p>According to OWASP guidelines, “a salt is a unique, randomly</p><p>generated string that is added to each password as part of the hashing</p><p>process”.</p><p>13</p><p>𝐇𝐨𝐰 𝐭𝐨 𝐬𝐭𝐨𝐫𝐞 𝐚 𝐩𝐚𝐬𝐬𝐰𝐨𝐫𝐝 𝐚𝐧𝐝 𝐬𝐚𝐥𝐭?</p><p>1⃣ A salt is not meant to be secret and it can be stored in plain text in</p><p>the database. It is used to ensure the hash result is unique to each</p><p>password.</p><p>2⃣ The password can be stored in the database using the following</p><p>format: 𝘩𝘢𝘴𝘩( 𝘱𝘢𝘴𝘴𝘸𝘰𝘳𝘥 + 𝘴𝘢𝘭𝘵).</p><p>𝐇𝐨𝐰 𝐭𝐨 𝐯𝐚𝐥𝐢𝐝𝐚𝐭𝐞 𝐚 𝐩𝐚𝐬𝐬𝐰𝐨𝐫𝐝?</p><p>To validate a password, it can go through the following process:</p><p>1⃣ A client enters the password.</p><p>2⃣ The system fetches the corresponding salt from the database.</p><p>14</p><p>3⃣ The system appends the salt to the password and hashes it. Let’s</p><p>call the hashed value H1.</p><p>4⃣ The system compares H1 and H2, where H2 is the hash stored in the</p><p>database. If they are the same, the password is valid.</p><p>Over to you: what other mechanisms can we use to ensure password</p><p>safety?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>15</p><p>How does HTTPS work?</p><p>Hypertext Transfer Protocol Secure (HTTPS) is an extension of the</p><p>Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data</p><p>using Transport Layer Security (TLS.) If the data is hijacked online, all</p><p>the hijacker gets is binary code.</p><p>How is the data encrypted and decrypted?</p><p>Step 1 - The client (browser) and the server establish a TCP</p><p>connection.</p><p>Step 2 - The client sends a “client hello” to the server. The message</p><p>contains a set of necessary encryption algorithms (cipher suites) and</p><p>the latest TLS version it can support. The server responds with a</p><p>“server hello” so the browser knows whether it can support the</p><p>algorithms and TLS version.</p><p>16</p><p>The server then sends the SSL certificate to the client. The certificate</p><p>contains the public key, host name, expiry dates, etc. The client</p><p>validates the certificate.</p><p>Step 3 - After validating the SSL certificate, the client generates a</p><p>session key and encrypts it using the public key. The server receives</p><p>the encrypted session key and decrypts it with the private key.</p><p>Step 4 - Now that both the client and the server hold the same session</p><p>key (symmetric encryption), the encrypted data is transmitted in a</p><p>secure bi-directional channel.</p><p>Why does HTTPS switch to symmetric encryption during data</p><p>transmission? There are two main reasons:</p><p>1. Security: The asymmetric encryption goes only one way. This means</p><p>that if the server tries to send the encrypted data back to the client,</p><p>anyone can decrypt the data using the public key.</p><p>2. Server resources: The asymmetric encryption adds quite a lot of</p><p>mathematical overhead. It is not suitable for data transmissions in long</p><p>sessions.</p><p>Over to you: how much performance overhead does HTTPS add,</p><p>compared to HTTP?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>17</p><p>How to learn design patterns?</p><p>Besides reading a lot of well-written code, a good book guides us like a</p><p>good teacher.</p><p>𝐇𝐞𝐚𝐝 𝐅𝐢𝐫𝐬𝐭 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬, second edition, is the one I would</p><p>recommend.</p><p>When I began my journey in software engineering, I found it hard to</p><p>understand the classic textbook, 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬, by the Gang of Four.</p><p>Luckily, I discovered Head First Design Patterns in the school library.</p><p>This book solved a lot of puzzles for me. When I went back to the</p><p>Design Patterns book, everything looked familiar and more</p><p>understandable.</p><p>Last year, I bought the second edition of Head First Design Patterns</p><p>and read through it. Here are a few things I like about the book:</p><p>18</p><p>🔹 This book solves the challenge of software’s abstract, “invisible”</p><p>nature. Software is difficult to build because we cannot see its</p><p>architecture; its details are embedded in the code and binary files. It is</p><p>even harder to understand software design patterns because these are</p><p>higher-level abstractions of the software. The book fixes this by using</p><p>visualization. There are lots of diagrams, arrows, and comments on</p><p>almost every page. If I do not understand the text, it’s no problem. The</p><p>diagrams explain things very well.</p><p>🔹We all have questions we are afraid to ask when we first learn a</p><p>new skill. Maybe we think it’s an easy one. This book is good at</p><p>tackling design patterns from the student’s point of view. It guides us by</p><p>asking our questions and clearly answering them. There is a Guru in</p><p>the book and there’s also a Student.</p><p>Over to you: which book helped you understand a challenging topic?</p><p>Why do you like it?</p><p>19</p><p>A visual guide on how to choose the right Database</p><p>Picking a database is a long-term commitment so the decision</p><p>shouldn’t be made lightly. The important thing to keep in mind is to</p><p>choose the right database for the right job.</p><p>20</p><p>Data can be structured (SQL table schema), semi-structured (JSON,</p><p>XML, etc.), and unstructured (Blob).</p><p>Common database categories include:</p><p>🔹 Relational</p><p>🔹 Columnar</p><p>🔹 Key-value</p><p>🔹 In-memory</p><p>🔹 Wide column</p><p>🔹 Time Series</p><p>🔹 Immutable ledger</p><p>🔹Geospatial</p><p>🔹Graph</p><p>🔹Document</p><p>🔹Text search</p><p>🔹Blob</p><p>Thanks, Satish Chandra Gupta</p><p>Over to you - Which database have you used for which workload?</p><p>21</p><p>Do you know how to generate globally unique IDs?</p><p>In this post, we will explore common requirements for IDs that are used</p><p>in social media such as Facebook, Twitter, and LinkedIn.</p><p>Requirements:</p><p>🔹Globally unique</p><p>🔹Roughly sorted by time</p><p>🔹Numerical values only</p><p>🔹64 bits</p><p>🔹Highly scalable, low latency</p><p>22</p><p>The implementation details of the algorithms can be found online so</p><p>we will not go into detail here.</p><p>Over to you: What kind of ID generators have you used?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>23</p><p>How does Twitter work?</p><p>This post is a summary of a tech talk given by Twitter</p><p>in 2013. Let’s</p><p>take a look.</p><p>𝐓𝐡𝐞 𝐋𝐢𝐟𝐞 𝐨𝐟 𝐚 𝐓𝐰𝐞𝐞𝐭:</p><p>1⃣ A tweet comes in through the Write API.</p><p>2⃣ The Write API routes the request to the Fanout service.</p><p>3⃣ The Fanout service does a lot of processing and stores them in the</p><p>Redis cache.</p><p>24</p><p>4⃣ The Timeline service is used to find the Redis server that has the</p><p>home timeline on it.</p><p>5⃣ A user pulls their home timeline through the Timeline service.</p><p>𝐒𝐞𝐚𝐫𝐜𝐡 & 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲</p><p>🔹 Ingester: annotates and tokenizes Tweets so the data can be</p><p>indexed.</p><p>🔹 Earlybird: stores search index.</p><p>🔹 Blender: creates the search and discovery timelines.</p><p>𝐏𝐮𝐬𝐡 𝐂𝐨𝐦𝐩𝐮𝐭𝐞</p><p>🔹HTTP push</p><p>🔹Mobile push</p><p>Disclaimer: This article is based on the tech talk given by Twitter in</p><p>2013 (https://bit.ly/3vNfjRp). Even though many years have passed, it’s</p><p>still quite relevant. I redraw the diagram as the original diagram is</p><p>difficult to read.</p><p>Over to you:</p><p>Do you use Twitter? What are some of the biggest differences between</p><p>LinkedIn and Twitter that might shape their system architectures?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>25</p><p>What is the difference between Process and Thread?</p><p>To better understand this question, let’s first take a look at what is a</p><p>Program. A 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 is an executable file containing a set of</p><p>instructions and passively stored on disk. One program can have</p><p>multiple processes. For example, the Chrome browser creates a</p><p>different process for every single tab.</p><p>A 𝐏𝐫𝐨𝐜𝐞𝐬𝐬 means a program is in execution. When a program is loaded</p><p>into the memory and becomes active, the program becomes a</p><p>process. The process requires some essential resources such as</p><p>registers, program counter, and stack.</p><p>26</p><p>A 𝐓𝐡𝐫𝐞𝐚𝐝 is the smallest unit of execution within a process.</p><p>The following process explains the relationship between program,</p><p>process, and thread.</p><p>1. The program contains a set of instructions.</p><p>2. The program is loaded into memory. It becomes one or more</p><p>running processes.</p><p>3. When a process starts, it is assigned memory and resources. A</p><p>process can have one or more threads. For example, in the Microsoft</p><p>Word app, a thread might be responsible for spelling checking and the</p><p>other thread for inserting text into the doc.</p><p>Main differences between process and thread:</p><p>🔹 Processes are usually independent, while threads exist as subsets</p><p>of a process.</p><p>🔹 Each process has its own memory space. Threads that belong to</p><p>the same process share the same memory.</p><p>🔹 A process is a heavyweight operation. It takes more time to create</p><p>and terminate.</p><p>🔹 Context switching is more expensive between processes.</p><p>🔹 Inter-thread communication is faster for threads.</p><p>Over to you:</p><p>1). Some programming languages support coroutine. What is the</p><p>difference between coroutine and thread?</p><p>2). How to list running processes in Linux?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>27</p><p>Interview Question: design Google Docs</p><p>1⃣ Clients send document editing operations to the WebSocket Server.</p><p>2⃣ The real-time communication is handled by the WebSocket Server.</p><p>3⃣ Documents operations are persisted in the Message Queue.</p><p>28</p><p>4⃣ The File Operation Server consumes operations produced by clients</p><p>and generates transformed operations using collaboration algorithms.</p><p>5⃣ Three types of data are stored: file metadata, file content, and</p><p>operations.</p><p>One of the biggest challenges is real-time conflict resolution. Common</p><p>algorithms include:</p><p>🔹 Operational transformation (OT)</p><p>🔹 Differential Synchronization (DS)</p><p>🔹 Conflict-free replicated data type (CRDT)</p><p>Google Doc uses OT according to its Wikipedia page and CRDT is an</p><p>active area of research for real-time concurrent editing.</p><p>Over to you - Have you encountered any issues while using Google</p><p>Docs? If so, what do you think might have caused the issue?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>29</p><p>Deployment strategies</p><p>Deploying or upgrading services is risky. In this post, we explore risk</p><p>mitigation strategies.</p><p>The diagram below illustrates the common ones.</p><p>𝐌𝐮𝐥𝐭𝐢-𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭</p><p>In this model, we deploy new changes to multiple services</p><p>simultaneously. This approach is easy to implement. But since all the</p><p>services are upgraded at the same time, it is hard to manage and test</p><p>dependencies. It’s also hard to rollback safely.</p><p>30</p><p>𝐁𝐥𝐮𝐞-𝐆𝐫𝐞𝐞𝐧 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭</p><p>With blue-green deployment, we have two identical environments: one</p><p>is staging (blue) and the other is production (green). The staging</p><p>environment is one version ahead of production. Once testing is done</p><p>in the staging environment, user traffic is switched to the staging</p><p>environment, and the staging becomes the production. This</p><p>deployment strategy is simple to perform rollback, but having two</p><p>identical production quality environments could be expensive.</p><p>𝐂𝐚𝐧𝐚𝐫𝐲 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭</p><p>A canary deployment upgrades services gradually, each time to a</p><p>subset of users. It is cheaper than blue-green deployment and easy to</p><p>perform rollback. However, since there is no staging environment, we</p><p>have to test on production. This process is more complicated because</p><p>we need to monitor the canary while gradually migrating more and</p><p>more users away from the old version.</p><p>𝐀/𝐁 𝐓𝐞𝐬𝐭</p><p>In the A/B test, different versions of services run in production</p><p>simultaneously. Each version runs an “experiment” for a subset of</p><p>users. A/B test is a cheap method to test new features in production.</p><p>We need to control the deployment process in case some features are</p><p>pushed to users by accident.</p><p>Over to you - Which deployment strategy have you used? Did you</p><p>witness any deployment-related outages in production and why did</p><p>they happen?</p><p>31</p><p>Flowchart of how slack decides to send a notification</p><p>It is a great example of why a simple feature may take much longer to</p><p>develop than many people think.</p><p>When we have a great design, users may not notice the complexity</p><p>because it feels like the feature is just working as intended.</p><p>What’s your takeaway from this diagram?</p><p>Image source:</p><p>https://slack.engineering/reducing-slacks-memory-footprint/</p><p>32</p><p>How does Amazon build and operate the software?</p><p>In 2019, Amazon released The Amazon Builders' Library. It contains</p><p>architecture-based articles that describe how Amazon architects,</p><p>releases, and operates technology.</p><p>As of today, it published 26 articles. It took me two weekends to go</p><p>through all the articles. I’ve had great fun and learned a lot. Here are</p><p>some of my favorites:</p><p>🔹Making retries safe with idempotent APIs</p><p>🔹Timeouts, retries, and backoff with jitter</p><p>🔹Beyond five 9s: Lessons from our highest available data planes</p><p>🔹Caching challenges and strategies</p><p>🔹Ensuring rollback safety during deployments</p><p>🔹Going faster with continuous delivery</p><p>33</p><p>🔹Challenges with distributed systems</p><p>🔹Amazon's approach to high-availability deployment</p><p>Over to you: what’s your favorite place to learn system design and</p><p>design principles?</p><p>Link to The Amazon Builders' Library: aws.amazon.com/builders-library</p><p>34</p><p>How to design a secure web API access for your</p><p>website?</p><p>When we open web API access to users, we need to make sure each</p><p>API call is authenticated. This means the user must be who they claim</p><p>to be.</p><p>In this post, we explore two common ways:</p><p>1. Token based authentication</p><p>2. HMAC (Hash-based Message Authentication Code) authentication</p><p>The diagram below illustrates how they work.</p><p>35</p><p>𝐓𝐨𝐤𝐞𝐧 𝐛𝐚𝐬𝐞𝐝</p><p>Step 1 - the user enters their password into the client, and the client</p><p>sends the password to the Authentication Server.</p><p>36</p><p>Step 2 - the Authentication Server authenticates the credentials and</p><p>generates a token with an expiry time.</p><p>Steps 3 and 4 - now the client can send requests to access server</p><p>resources with the token in the HTTP header. This access is valid until</p><p>the token expires.</p><p>𝐇𝐌𝐀𝐂 𝐛𝐚𝐬𝐞𝐝</p><p>This mechanism generates a Message Authentication Code</p><p>(signature) by using</p><p>a hash function (SHA256 or MD5).</p><p>Steps 1 and 2 - the server generates two keys, one is Public APP ID</p><p>(public key) and the other one is API Key (private key).</p><p>Step 3 - we now generate a HMAC signature on the client side (hmac</p><p>A). This signature is generated with a set of attributes listed in the</p><p>diagram.</p><p>Step 4 - the client sends requests to access server resources with</p><p>hmac A in the HTTP header.</p><p>Step 5 - the server receives the request which contains the request</p><p>data and the authentication header. It extracts the necessary attributes</p><p>from the request and uses the API key that’s stored on the server side</p><p>to generate a signature (hmac B.)</p><p>Steps 6 and 7 - the server compares hmac A (generated on the client</p><p>side) and hmac B (generated on the server side). If they are matched,</p><p>the requested resource will be returned to the client.</p><p>Question - How does HMAC authentication ensure data integrity? Why</p><p>do we include “request timestamp” in HMAC signature generation?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>37</p><p>How do microservices collaborate and interact with each</p><p>other?</p><p>There are two ways: 𝐨𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 and 𝐜𝐡𝐨𝐫𝐞𝐨𝐠𝐫𝐚𝐩𝐡𝐲.</p><p>The diagram below illustrates the collaboration of microservices.</p><p>Choreography is like having a choreographer set all the rules. Then the</p><p>dancers on stage (the microservices) interact according to them.</p><p>Service choreography describes this exchange of messages and the</p><p>rules by which the microservices interact.</p><p>Orchestration is different. The orchestrator acts as a center of</p><p>authority. It is responsible for invoking and combining the services. It</p><p>38</p><p>describes the interactions between all the participating services. It is</p><p>just like a conductor leading the musicians in a musical symphony. The</p><p>orchestration pattern also includes the transaction management</p><p>among different services.</p><p>The benefits of orchestration:</p><p>1. Reliability - orchestration has built-in transaction management and</p><p>error handling, while choreography is point-to-point communications</p><p>and the fault tolerance scenarios are much more complicated.</p><p>2. Scalability - when adding a new service into orchestration, only the</p><p>orchestrator needs to modify the interaction rules, while in</p><p>choreography all the interacting services need to be modified.</p><p>Some limitations of orchestration:</p><p>1. Performance - all the services talk via a centralized orchestrator, so</p><p>latency is higher than it is with choreography. Also, the throughput is</p><p>bound to the capacity of the orchestrator.</p><p>2. Single point of failure - if the orchestrator goes down, no services</p><p>can talk to each other. To mitigate this, the orchestrator must be highly</p><p>available.</p><p>Real-world use case: Netflix Conductor is a microservice orchestrator</p><p>and you can read more details on the orchestrator design.</p><p>Question - Have you used orchestrator products in production? What</p><p>are their pros & cons?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>39</p><p>What are the differences between Virtualization</p><p>(VMware) and Containerization (Docker)?</p><p>The diagram below illustrates the layered architecture of virtualization</p><p>and containerization.</p><p>“Virtualization is a technology that allows you to create multiple</p><p>simulated environments or dedicated resources from a single, physical</p><p>hardware system” [1].</p><p>“Containerization is the packaging together of software code with all its</p><p>necessary components like libraries, frameworks, and other</p><p>dependencies so that they are isolated in their own "container" [2].</p><p>The major differences are:</p><p>🔹 In virtualization, the hypervisor creates an abstraction layer over</p><p>hardware, so that multiple operating systems can run alongside each</p><p>other. This technique is considered to be the first generation of cloud</p><p>computing.</p><p>🔹Containerization is considered to be a lightweight version of</p><p>virtualization, which virtualizes the operating system instead of</p><p>hardware. Without the hypervisor, the containers enjoy faster resource</p><p>provisioning. All the resources (including code, dependencies) that are</p><p>40</p><p>needed to run the application or microservice are packaged together,</p><p>so that the applications can run anywhere.</p><p>Question: how much performance differences have you observed in</p><p>production between virtualization, containerization, and bare-metal?</p><p>Image Source: https://lnkd.in/gaPYcGTz</p><p>Sources:</p><p>[1] Understanding virtualization: https://lnkd.in/gtQY9gkx</p><p>[2] What is containerization?: https://lnkd.in/gm4Qv_x2</p><p>41</p><p>Which cloud provider should be used when building a</p><p>big data solution?</p><p>The diagram below illustrates the detailed comparison of AWS, Google</p><p>Cloud, and Microsoft Azure.</p><p>42</p><p>The common parts of the solutions:</p><p>1. Data ingestion of structured or unstructured data.</p><p>2. Raw data storage.</p><p>3. Data processing, including filtering, transformation, normalization,</p><p>etc.</p><p>4. Data warehouse, including key-value storage, relational database,</p><p>OLAP database, etc.</p><p>5. Presentation layer with dashboards and real-time notifications.</p><p>It is interesting to see different cloud vendors have different names for</p><p>the same type of products.</p><p>For example, the first step and the last step both use the serverless</p><p>product. The product is called “lambda” in AWS, and “function” in</p><p>Azure and Google Cloud.</p><p>Question - which products have you used in production? What kind of</p><p>application did you use it for?</p><p>Source: S.C. Gupta’s post</p><p>43</p><p>How to avoid crawling duplicate URLs at Google scale?</p><p>Option 1: Use a Set data structure to check if a URL already exists or</p><p>not. Set is fast, but it is not space-efficient.</p><p>Option 2: Store URLs in a database and check if a new URL is in the</p><p>database. This can work but the load to the database will be very high.</p><p>Option 3: 𝐁𝐥𝐨𝐨𝐦 𝐟𝐢𝐥𝐭𝐞𝐫. This option is preferred. Bloom filter was</p><p>proposed by Burton Howard Bloom in 1970. It is a probabilistic data</p><p>structure that is used to test whether an element is a member of a set.</p><p>🔹 false: the element is definitely not in the set.</p><p>🔹 true: the element is probably in the set.</p><p>False-positive matches are possible, but false negatives are not.</p><p>The diagram below illustrates how the Bloom filter works. The basic</p><p>data structure for the Bloom filter is Bit Vector. Each bit represents a</p><p>hashed value.</p><p>44</p><p>Step 1: To add an element to the bloom filter, we feed it to 3 different</p><p>hash functions (A, B, and C) and set the bits at the resulting positions.</p><p>Note that both “www.myweb1.com” and “www.myweb2.com” mark the</p><p>same bit with 1 at index 5. False positives are possible because a bit</p><p>might be set by another element.</p><p>Step 2: When testing the existence of a URL string, the same hash</p><p>functions A, B, and C are applied to the URL string. If all three bits are</p><p>45</p><p>1, then the URL may exist in the dataset; if any of the bits is 0, then the</p><p>URL definitely does not exist in the dataset.</p><p>Hash function choices are important. They must be uniformly</p><p>distributed and fast. For example, RedisBloom and Apache Spark use</p><p>murmur, and InfluxDB uses xxhash.</p><p>Question - In our example, we used three hash functions. How many</p><p>hash functions should we use in reality? What are the trade-offs?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>46</p><p>Why is a solid-state drive (SSD) fast?</p><p>“A solid state drive reads up to 10 times faster and writes up to 20</p><p>times faster than a hard disk drive.” [1].</p><p>“An SSD is a flash-memory based data storage device. Bits are stored</p><p>into cells, which are made of floating-gate transistors. SSDs are made</p><p>entirely of electronic components, there are no moving or mechanical</p><p>parts like in hard drives (HDD)” [2].</p><p>The diagram below illustrates the SSD architecture.</p><p>47</p><p>Step 1: “Commands come from the user through the host interface” [2].</p><p>The interface can be Serial ATA (SATA) or PCI Express (PCIe).</p><p>Step 2: “The processor in the SSD controller takes the commands and</p><p>passes them to the flash controller” [2].</p><p>Step 3: “SSDs also have embedded RAM memory, generally for</p><p>caching</p><p>purposes and to store mapping information” [2].</p><p>Step 4: “The packages of NAND flash memory are organized in gangs,</p><p>over multiple channels” [2].</p><p>The second diagram illustrates how the logical and physical pages are</p><p>mapped, and why this architecture is fast.</p><p>SSD controller operates multiple FLASH particles in parallel, greatly</p><p>improving the underlying bandwidth. When we need to write more than</p><p>one page, the SSD controller can write them in parallel [3], whereas</p><p>the HDD has a single head and it can only read from one head at a</p><p>time.</p><p>Every time a HOST Page is written, the SSD controller finds a Physical</p><p>Page to write the data and this mapping is recorded. With this</p><p>mapping, the next time HOST reads a HOST Page, the SSD knows</p><p>where to read the data from FLASH [3].</p><p>Question - What are the main differences between SSD and HDD?</p><p>If you are interested in the architecture, I recommend reading Coding</p><p>for SSDs by Emmanuel Goossaert in reference [2].</p><p>Sources:</p><p>[1] SSD or HDD: Which Is Right for You?:</p><p>https://www.avg.com/en/signal/ssd-hdd-which-is-best</p><p>[2] Coding for SSDs:</p><p>https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introductio</p><p>n-and-table-of-contents/</p><p>[3] Overview of SSD Structure and Basic Working Principle:</p><p>https://www.elinfor.com/knowledge/overview-of-ssd-structure-and-basic</p><p>-working-principle1-p-11203</p><p>48</p><p>Handling a large-scale outage</p><p>This is a true story about handling a large-scale outage written by Staff</p><p>Engineers at Discord Sahn Lam.</p><p>About 10 years ago, I witnessed the most impactful UI bugs in my</p><p>career.</p><p>It was 9PM on a Friday. I was on the team responsible for one of the</p><p>largest social games at the time. It had about 30 million DAU. I just so</p><p>happened to glance at the operational dashboard before shutting down</p><p>for the night.</p><p>Every line on the dashboard was at zero.</p><p>At that very moment, I got a phone call from my boss. He said the</p><p>entire game was down. Firefighting mode. Full on.</p><p>Everything had shut down. Every single instance on AWS was</p><p>terminated. HA proxy instances, PHP web servers, MySQL databases,</p><p>Memcache nodes, everything.</p><p>It took 50 people 10 hours to bring everything back up. It was quite a</p><p>feat. That in itself is a story for another day.</p><p>We used a cloud management software vendor to manage our AWS</p><p>deployment. This was before Infrastructure as Code was a thing. There</p><p>was no Terraform. It was so early in cloud computing and we were so</p><p>big that AWS required an advanced warning before we scaled up.</p><p>What had gone wrong? The software vendor had introduced a bug that</p><p>week in their confirmation dialog flow. When terminating a subset of</p><p>nodes in the UI, it would correctly show in the confirmation dialog box</p><p>the list of nodes to be terminated, but under the hood, it terminated</p><p>everything.</p><p>Shortly before 9PM that fateful evening, one of our poor SREs fulfilled</p><p>our routine request and terminated an unused Memcache pool. I could</p><p>only imagine the horror and the phone conversation that ensured.</p><p>49</p><p>What kind of code structure could allow this disastrous bug to slip</p><p>through? We could only guess. We never received a full explanation.</p><p>What are some of the most impactful software bugs you encountered</p><p>in your career?</p><p>50</p><p>AWS Lambda behind the scenes</p><p>Serverless is one of the hottest topics in cloud services. How does</p><p>AWS Lambda work behind the scenes?</p><p>Lambda is a 𝐬𝐞𝐫𝐯𝐞𝐫𝐥𝐞𝐬𝐬 computing service provided by Amazon Web</p><p>Services (AWS), which runs functions in response to events.</p><p>𝐅𝐢𝐫𝐞𝐜𝐫𝐚𝐜𝐤𝐞𝐫 𝐌𝐢𝐜𝐫𝐨𝐕𝐌</p><p>Firecracker is the engine powering all of the Lambda functions [1]. It is</p><p>a virtualization technology developed at Amazon and written in Rust.</p><p>The diagram below illustrates the isolation model for AWS Lambda</p><p>Workers.</p><p>51</p><p>Lambda functions run within a sandbox, which provides a minimal</p><p>Linux userland, some common libraries and utilities. It creates the</p><p>Execution environment (worker) on EC2 instances.</p><p>How are lambdas initiated and invoked? There are two ways.</p><p>𝐒𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧</p><p>Step1: "The Worker Manager communicates with a Placement Service</p><p>which is responsible to place a workload on a location for the given</p><p>host (it’s provisioning the sandbox) and returns that to the Worker</p><p>Manager" [2].</p><p>Step 2: "The Worker Manager can then call 𝘐𝘯𝘪𝘵 to initialize the function</p><p>for execution by downloading the Lambda package from S3 and</p><p>setting up the Lambda runtime" [2]</p><p>Step 3: The Frontend Worker is now able to call 𝘐𝘯𝘷𝘰𝘬𝘦 [2].</p><p>𝐀𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧</p><p>Step 1: The Application Load Balancer forwards the invocation to an</p><p>available Frontend which places the event onto an internal</p><p>queue(SQS).</p><p>Step 2: There is "a set of pollers assigned to this internal queue which</p><p>are responsible for polling it and moving the event onto a Frontend</p><p>synchronously. After it’s been placed onto the Frontend it follows the</p><p>synchronous invocation call pattern which we covered earlier" [2].</p><p>Question: Can you think of any use cases for AWS Lambda?</p><p>Sources:</p><p>[1] AWS Lambda whitepaper:</p><p>https://docs.aws.amazon.com/whitepapers/latest/security-overview-aw</p><p>s-lambda/lambda-executions.html</p><p>[2] Behind the scenes, Lambda:</p><p>https://www.bschaatsbergen.com/behind-the-scenes-lambda/</p><p>Image source: [1] [2]</p><p>52</p><p>HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC).</p><p>What problem does each generation of HTTP solve?</p><p>The diagram below illustrates the key features.</p><p>🔹HTTP 1.0 was finalized and fully documented in 1996. Every</p><p>request to the same server requires a separate TCP connection.</p><p>🔹HTTP 1.1 was published in 1997. A TCP connection can be left</p><p>open for reuse (persistent connection), but it doesn’t solve the HOL</p><p>(head-of-line) blocking issue.</p><p>HOL blocking - when the number of allowed parallel requests in the</p><p>browser is used up, subsequent requests need to wait for the former</p><p>ones to complete.</p><p>53</p><p>🔹HTTP 2.0 was published in 2015. It addresses HOL issue through</p><p>request multiplexing, which eliminates HOL blocking at the application</p><p>layer, but HOL still exists at the transport (TCP) layer.</p><p>As you can see in the diagram, HTTP 2.0 introduced the concept of</p><p>HTTP “streams”: an abstraction that allows multiplexing different HTTP</p><p>exchanges onto the same TCP connection. Each stream doesn’t need</p><p>to be sent in order.</p><p>🔹HTTP 3.0 first draft was published in 2020. It is the proposed</p><p>successor to HTTP 2.0. It uses QUIC instead of TCP for the underlying</p><p>transport protocol, thus removing HOL blocking in the transport layer.</p><p>QUIC is based on UDP. It introduces streams as first-class citizens at</p><p>the transport layer. QUIC streams share the same QUIC connection,</p><p>so no additional handshakes and slow starts are required to create</p><p>new ones, but QUIC streams are delivered independently such that in</p><p>most cases packet loss affecting one stream doesn't affect others.</p><p>Question: When shall we upgrade to HTTP 3.0? Any pros & cons you</p><p>can think of?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>54</p><p>How to scale a website to support millions of users?</p><p>We will explain this step-by-step.</p><p>The diagram below illustrates the evolution of a simplified eCommerce</p><p>website. It goes from a monolithic design on one single server, to a</p><p>service-oriented/microservice architecture.</p><p>55</p><p>56</p><p>Suppose we have two services: inventory service (handles product</p><p>descriptions and inventory management) and user service (handles</p><p>user information, registration, login, etc.).</p><p>Step 1 - With the growth of the user base, one single application server</p><p>cannot handle the traffic anymore. We put the application server and</p><p>the database server into two separate servers.</p><p>Step 2 - The business continues to grow, and a single application</p><p>server is no longer enough. So we deploy a cluster of application</p><p>servers.</p><p>Step 3 - Now the incoming requests have to be routed to multiple</p><p>application servers, how can we ensure each application server gets</p><p>an even load? The load balancer handles this nicely.</p><p>Step 4 - With the business continuing to grow, the database might</p><p>become</p><p>the bottleneck. To mitigate this, we separate reads and writes</p><p>in a way that frequent read queries go to read replicas. With this setup,</p><p>the throughput for the database writes can be greatly increased.</p><p>Step 5 - Suppose the business continues to grow. One single database</p><p>cannot handle the load on both the inventory table and user table. We</p><p>have a few options:</p><p>1. Vertical partition. Adding more power (CPU, RAM, etc.) to the</p><p>database server. It has a hard limit.</p><p>2. Horizontal partition by adding more database servers.</p><p>3. Adding a caching layer to offload read requests.</p><p>Step 6 - Now we can modularize the functions into different services.</p><p>The architecture becomes service-oriented / microservice.</p><p>Question: what else do we need to support an e-commerce website at</p><p>Amazon’s scale?</p><p>57</p><p>DevOps Books</p><p>Some 𝐃𝐞𝐯𝐎𝐩𝐬 books I find enlightening:</p><p>🔹Accelerate - presents both the findings and the science behind</p><p>measuring software delivery performance.</p><p>🔹Continuous Delivery - introduces automated architecture</p><p>management and data migration. It also pointed out key problems and</p><p>optimal solutions in each area.</p><p>🔹Site Reliability Engineering - famous Google SRE book. It explains</p><p>the whole life cycle of Google’s development, deployment, and</p><p>monitoring, and how to manage the world’s biggest software systems.</p><p>🔹Effective DevOps - provides effective ways to improve team</p><p>coordination.</p><p>58</p><p>🔹The Phoenix Project - a classic novel about effectiveness and</p><p>communications. IT work is like manufacturing plant work, and a</p><p>system must be established to streamline the workflow. Very</p><p>interesting read!</p><p>🔹The DevOps Handbook - introduces product development, quality</p><p>assurance, IT operations, and information security.</p><p>What’s your favorite dev-ops book?</p><p>59</p><p>Why is Kafka fast?</p><p>Kafka achieves low latency message delivery through Sequential I/O</p><p>and Zero Copy Principle. The same techniques are commonly used in</p><p>many other messaging/streaming platforms.</p><p>The diagram below illustrates how the data is transmitted between</p><p>producer and consumer, and what zero-copy means.</p><p>🔹Step 1.1 - 1.3: Producer writes data to the disk</p><p>60</p><p>🔹Step 2: Consumer reads data without zero-copy</p><p>2.1: The data is loaded from disk to OS cache</p><p>2.2 The data is copied from OS cache to Kafka application</p><p>2.3 Kafka application copies the data into the socket buffer</p><p>2.4 The data is copied from socket buffer to network card</p><p>2.5 The network card sends data out to the consumer</p><p>🔹Step 3: Consumer reads data with zero-copy</p><p>3.1: The data is loaded from disk to OS cache</p><p>3.2 OS cache directly copies the data to the network card via sendfile()</p><p>command</p><p>3.3 The network card sends data out to the consumer</p><p>Zero copy is a shortcut to save the multiple data copies between</p><p>application context and kernel context. This approach brings down the</p><p>time by approximately 65%.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>61</p><p>SOAP vs REST vs GraphQL vs RPC.</p><p>The diagram below illustrates the API timeline and API styles</p><p>comparison.</p><p>Over time, different API architectural styles are released. Each of them</p><p>has its own patterns of standardizing data exchange.</p><p>You can check out the use cases of each style in the diagram.</p><p>Source: https://lnkd.in/gFgi33RY I combined a few diagrams together.</p><p>The credit all goes to AltexSoft.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>62</p><p>How do modern browsers work?</p><p>Google published a series of articles about "Inside look at modern web</p><p>browser". It's a great read.</p><p>Links:</p><p>https://developer.chrome.com/blog/inside-browser-part1/</p><p>https://developer.chrome.com/blog/inside-browser-part2/</p><p>https://developer.chrome.com/blog/inside-browser-part3/</p><p>https://developer.chrome.com/blog/inside-browser-part4/</p><p>63</p><p>Redis vs Memcached</p><p>The diagram below illustrates the key differences.</p><p>The advantages on data structures make Redis a good choice for:</p><p>🔹 Recording the number of clicks and comments for each post (hash)</p><p>🔹 Sorting the commented user list and deduping the users (zset)</p><p>🔹 Caching user behavior history and filtering malicious behaviors</p><p>(zset, hash)</p><p>🔹 Storing boolean information of extremely large data into small</p><p>space. For example, login status, membership status. (bitmap)</p><p>64</p><p>Optimistic locking</p><p>Optimistic locking, also referred to as optimistic concurrency control,</p><p>allows multiple concurrent users to attempt to update the same</p><p>resource.</p><p>There are two common ways to implement optimistic locking: version</p><p>number and timestamp. Version number is generally considered to be</p><p>a better option because the server clock can be inaccurate over time.</p><p>We explain how optimistic locking works with version number.</p><p>The diagram below shows a successful case and a failure case.</p><p>1. A new column called “version” is added to the database table.</p><p>2. Before a user modifies a database row, the application reads the</p><p>version number of the row.</p><p>3. When the user updates the row, the application increases the</p><p>version number by 1 and writes it back to the database.</p><p>4. A database validation check is put in place; the next version number</p><p>should exceed the current version number by 1. The transaction aborts</p><p>if the validation fails and the user tries again from step 2.</p><p>65</p><p>Optimistic locking is usually faster than pessimistic locking because we</p><p>do not lock the database. However, the performance of optimistic</p><p>locking drops dramatically when concurrency is high.</p><p>To understand why, consider the case when many clients try to reserve</p><p>a hotel room at the same time. Because there is no limit on how many</p><p>clients can read the available room count, all of them read back the</p><p>same available room count and the current version number. When</p><p>different clients make reservations and write back the results to the</p><p>database, only one of them will succeed, and the rest of the clients</p><p>receive a version check failure message. These clients have to retry. In</p><p>the subsequent round of retries, there is only one successful client</p><p>again, and the rest have to retry. Although the end result is correct,</p><p>repeated retries cause a very unpleasant user experience.</p><p>Question: what are the possible ways of solving race conditions?</p><p>66</p><p>Tradeoff between latency and consistency</p><p>Understanding the 𝐭𝐫𝐚𝐝𝐞𝐨𝐟𝐟𝐬 is very important not only in system design</p><p>interviews but also designing real-world systems. When we talk about</p><p>data replication, there is a fundamental tradeoff between 𝐥𝐚𝐭𝐞𝐧𝐜𝐲 and</p><p>𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲. It is illustrated by the diagram below.</p><p>67</p><p>Cache miss attack</p><p>Caching is awesome but it doesn’t come without a cost, just like many</p><p>things in life.</p><p>One of the issues is 𝐂𝐚𝐜𝐡𝐞 𝐌𝐢𝐬𝐬 𝐀𝐭𝐭𝐚𝐜𝐤. Correct me if this is not the</p><p>right term. It refers to the scenario where data to fetch doesn't exist in</p><p>the database and the data isn’t cached either. So every request hits</p><p>the database eventually, defeating the purpose of using a cache. If a</p><p>malicious user initiates lots of queries with such keys, the database</p><p>can easily be overloaded.</p><p>The diagram below illustrates the process.</p><p>Two approaches are commonly used to solve this problem:</p><p>68</p><p>🔹Cache keys with null value. Set a short TTL (Time to Live) for keys</p><p>with null value.</p><p>🔹Using Bloom filter. A Bloom filter is a data structure that can rapidly</p><p>tell us whether an element is present in a set or not. If the key exists,</p><p>the request first goes to the cache and then queries the database if</p><p>needed. If the key doesn't exist in the data set, it means the key</p><p>doesn’t exist in the cache/database. In this case, the query will not hit</p><p>the cache or database layer.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>69</p><p>How to diagnose a mysterious process that’s taking too</p><p>much CPU, memory, IO, etc?</p><p>The diagram below illustrates helpful tools in a Linux system.</p><p>🔹‘vmstat’ - reports information about processes, memory, paging,</p><p>block IO, traps, and CPU activity.</p><p>🔹‘iostat’ - reports CPU and input/output statistics</p><p>of the system.</p><p>🔹‘netstat’ - displays statistical data related to IP, TCP, UDP, and ICMP</p><p>protocols.</p><p>🔹‘lsof’ - lists open files of the current system.</p><p>🔹‘pidstat’ - monitors the utilization of system resources by all or</p><p>specified processes, including CPU, memory, device IO, task</p><p>switching, threads, etc.</p><p>70</p><p>What are the top cache strategies?</p><p>Read data from the system:</p><p>🔹 Cache aside</p><p>🔹 Read through</p><p>Write data to the system:</p><p>🔹 Write around</p><p>🔹 Write back</p><p>🔹 Write through</p><p>The diagram below illustrates how those 5 strategies work. Some of</p><p>the caching strategies can be used together.</p><p>71</p><p>I left out a lot of details as that will make the post very long. Feel free to</p><p>leave a comment so we can learn from each other.</p><p>72</p><p>Question: What are the pros and cons of each caching strategy? How</p><p>to choose the right one to use?</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>73</p><p>Upload large files</p><p>How can we optimize performance when we 𝐮𝐩𝐥𝐨𝐚𝐝 𝐥𝐚𝐫𝐠𝐞 𝐟𝐢𝐥𝐞𝐬 to object</p><p>storage service such as S3?</p><p>Before we answer this question, let's take a look at why we need to</p><p>optimize this process. Some files might be larger than a few GBs. It is</p><p>possible to upload such a large object file directly, but it could take a</p><p>long time. If the network connection fails in the middle of the upload,</p><p>we have to start over. A better solution is to slice a large object into</p><p>smaller parts and upload them independently. After all the parts are</p><p>uploaded, the object store re-assembles the object from the parts. This</p><p>process is called 𝐦𝐮𝐥𝐭𝐢𝐩𝐚𝐫𝐭 𝐮𝐩𝐥𝐨𝐚𝐝.</p><p>The diagram below illustrates how multipart upload works:</p><p>74</p><p>1. The client calls the object storage to initiate a multipart upload.</p><p>2. The data store returns an uploadID, which uniquely identifies the</p><p>upload.</p><p>3. The client splits the large file into small objects and starts uploading.</p><p>Let’s assume the size of the file is 1.6GB and the client splits it into 8</p><p>parts, so each part is 200 MB in size. The client uploads the first part to</p><p>the data store together with the uploadID it received in step 2.</p><p>4. When a part is uploaded, the data store returns an ETag, which is</p><p>essentially the md5 checksum of that part. It is used to verify multipart</p><p>uploads.</p><p>5. After all parts are uploaded, the client sends a complete multipart</p><p>upload request, which includes the uploadID, part numbers, and</p><p>ETags.</p><p>6. The data store reassembles the object from its parts based on the</p><p>part number. Since the object is really large, this process may take a</p><p>few minutes. After reassembly is complete, it returns a success</p><p>message to the client.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>75</p><p>Why is Redis so Fast?</p><p>There are 3 main reasons as shown in the diagram below.</p><p>1. Redis is a RAM-based database. RAM access is at least 1000 times</p><p>faster than random disk access.</p><p>2. Redis leverages IO multiplexing and single-threaded execution loop</p><p>for execution efficiency.</p><p>3. Redis leverages several efficient lower-level data structures.</p><p>Question: Another popular in-memory store is Memcached. Do you</p><p>know the differences between Redis and Memcached?</p><p>You might have noticed the style of this diagram is different from my</p><p>previous posts. Please let me know which one you prefer.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>76</p><p>SWIFT payment network</p><p>You probably heard about 𝐒𝐖𝐈𝐅𝐓. What is SWIFT? What role does it</p><p>play in cross-border payments? You can find answers to those</p><p>questions in this post.</p><p>The Society for Worldwide Interbank Financial Telecommunication</p><p>(SWIFT) is the main secure 𝐦𝐞𝐬𝐬𝐚𝐠𝐢𝐧𝐠 𝐬𝐲𝐬𝐭𝐞𝐦 that links the world’s</p><p>banks.</p><p>The Belgium-based system is run by its member banks and handles</p><p>millions of payment messages per day. The diagram below illustrates</p><p>how payment messages are transmitted from Bank A (in New York) to</p><p>Bank B (in London).</p><p>Step 1: Bank A sends a message with transfer details to Regional</p><p>Processor A in New York. The destination is Bank B.</p><p>77</p><p>Step 2: Regional processor validates the format and sends it to Slice</p><p>Processor A. The Regional Processor is responsible for input message</p><p>validation and output message queuing. The Slice Processor is</p><p>responsible for storing and routing messages safely.</p><p>Step 3: Slice Processor A stores the message.</p><p>Step 4: Slice Processor A informs Regional Processor A the message</p><p>is stored.</p><p>Step 5: Regional Processor A sends ACK/NAK to Bank A. ACK means</p><p>a message will be sent to Bank B. NAK means the message will NOT</p><p>be sent to Bank B.</p><p>Step 6: Slice Processor A sends the message to Regional Processor B</p><p>in London.</p><p>Step 7: Regional Processor B stores the message temporarily.</p><p>Step 8: Regional Processor B assigns a unique ID MON (Message</p><p>Output Number) to the message and sends it to Slice Processor B</p><p>Step 9: Slice Processor B validates MON.</p><p>Step 10: Slice Processor B authorizes Regional Processor B to send</p><p>the message to Bank B.</p><p>Step 11: Regional Processor B sends the message to Bank B.</p><p>Step 12: Bank B receives the message and stores it.</p><p>Step 13: Bank B sends UAK/UNK to Regional Processor B. UAK (user</p><p>positive acknowledgment) means Bank B received the message</p><p>without error; UNK (user negative acknowledgment) means Bank B</p><p>received checksum failure.</p><p>Step 14: Regional Processor B creates a report based on Bank B’s</p><p>response, and sends it to Slice Processor B.</p><p>78</p><p>Step 15: Slice Processor B stores the report.</p><p>Step 16 - 17: Slice Processor B sends a copy of the report to Slice</p><p>Processor A. Slice Processor A stores the report.</p><p>—</p><p>Check out our bestselling system design books.</p><p>Paperback: Amazon Digital: ByteByteGo.</p><p>79</p><p>At-most once, at-least once, and exactly once</p><p>In modern architecture, systems are broken up into small and</p><p>independent building blocks with well-defined interfaces between them.</p><p>Message queues provide communication and coordination for those</p><p>building blocks. Today, let’s discuss different delivery semantics:</p><p>at-most once, at-least once, and exactly once.</p><p>𝐀𝐭-𝐦𝐨𝐬𝐭 𝐨𝐧𝐜𝐞</p><p>As the name suggests, at-most once means a message will be</p><p>delivered not more than once. Messages may be lost but are not</p><p>redelivered. This is how at-most once delivery works at the high level.</p><p>Use cases: It is suitable for use cases like monitoring metrics, where a</p><p>small amount of data loss is acceptable.</p><p>𝐀𝐭-𝐥𝐞𝐚𝐬𝐭 𝐨𝐧𝐜𝐞</p><p>With this data delivery semantic, it’s acceptable to deliver a message</p><p>more than once, but no message should be lost.</p><p>Use cases: With at-least once, messages won’t be lost but the same</p><p>message might be delivered multiple times. While not ideal from a user</p><p>perspective, at-least once delivery semantics are usually good enough</p><p>for use cases where data duplication is not a big issue or deduplication</p><p>80</p><p>is possible on the consumer side. For example, with a unique key in</p><p>each message, a message can be rejected when writing duplicate data</p><p>to the database.</p><p>𝐄𝐱𝐚𝐜𝐭𝐥𝐲 𝐨𝐧𝐜𝐞</p><p>Exactly once is the most difficult delivery semantic to implement. It is</p><p>friendly to users, but it has a high cost for the system’s performance</p><p>and complexity.</p><p>Use cases: Financial-related use cases (payment, trading, accounting,</p><p>etc.). Exactly once is especially important when duplication is not</p><p>acceptable and the downstream service or third party doesn’t support</p><p>idempotency.</p><p>Question: what is the difference between message queues vs event</p><p>streaming platforms such as Kafka, Apache Pulsar, etc?</p><p>81</p><p>Vertical partitioning and Horizontal partitioning</p><p>In many large-scale applications, data is divided into partitions that can</p><p>be accessed separately. There are two typical strategies for partitioning</p><p>data.</p><p>🔹 Vertical partitioning: it means some columns are moved to new</p><p>tables. Each table contains the same number of rows but fewer</p><p>columns (see diagram below).</p><p>🔹 Horizontal partitioning (often called sharding): it divides a table into</p><p>multiple smaller tables. Each table is a separate data store,</p><p>and it</p><p>contains the same number of columns, but fewer rows (see diagram</p><p>below).</p><p>82</p><p>Horizontal partitioning is widely used so let’s take a closer look.</p><p>𝐑𝐨𝐮𝐭𝐢𝐧𝐠 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦</p><p>Routing algorithm decides which partition (shard) stores the data.</p><p>🔹 Range-based sharding. This algorithm uses ordered columns, such</p><p>as integers, longs, timestamps, to separate the rows. For example, the</p><p>diagram below uses the User ID column for range partition: User IDs 1</p><p>and 2 are in shard 1, User IDs 3 and 4 are in shard 2.</p><p>🔹 Hash-based sharding. This algorithm applies a hash function to one</p><p>column or several columns to decide which row goes to which table.</p><p>For example, the diagram below uses 𝐔𝐬𝐞𝐫 𝐈𝐃 𝐦𝐨𝐝 2 as a hash</p><p>function. User IDs 1 and 3 are in shard 1, User IDs 2 and 4 are in</p><p>shard 2.</p><p>𝐁𝐞𝐧𝐞𝐟𝐢𝐭𝐬</p><p>🔹 Facilitate horizontal scaling. Sharding facilitates the possibility of</p><p>adding more machines to spread out the load.</p><p>🔹 Shorten response time. By sharding one table into multiple tables,</p><p>queries go over fewer rows, and results are returned much more</p><p>quickly.</p><p>𝐃𝐫𝐚𝐰𝐛𝐚𝐜𝐤𝐬</p><p>🔹 The order by the operation is more complicated. Usually, we need</p><p>to fetch data from different shards and sort the data in the application's</p><p>code.</p><p>🔹 Uneven distribution. Some shards may contain more data than</p><p>others (this is also called the hotspot).</p><p>This topic is very big and I’m sure I missed a lot of important details.</p><p>What else do you think is important for data partitioning?</p><p>83</p><p>CDN</p><p>A content delivery network (CDN) refers to a geographically distributed</p><p>servers (also called edge servers) which provide fast delivery of static</p><p>and dynamic content. Let’s take a look at how it works.</p><p>Suppose Bob who lives in New York wants to visit an eCommerce</p><p>website that is deployed in London. If the request goes to servers</p><p>located in London, the response will be quite slow. So we deploy CDN</p><p>servers close to where Bob lives, and the content will be loaded from</p><p>the nearby CDN server.</p><p>The diagram below illustrates the process:</p><p>1. Bob types in www.myshop.com in the browser. The browser looks</p><p>up the domain name in the local DNS cache.</p><p>84</p><p>2. If the domain name does not exist in the local DNS cache, the</p><p>browser goes to the DNS resolver to resolve the name. The DNS</p><p>resolver usually sits in the Internet Service Provider (ISP).</p><p>3. The DNS resolver recursively resolves the domain name (see my</p><p>previous post for details). Finally, it asks the authoritative name server</p><p>to resolve the domain name.</p><p>4. If we don’t use CDN, the authoritative name server returns the IP</p><p>address for www.myshop.com. But with CDN, the authoritative name</p><p>server has an alias pointing to www.myshop.cdn.com (the domain</p><p>name of the CDN server).</p><p>5. The DNS resolver asks the authoritative name server to resolve</p><p>www.myshop.cdn.com.</p><p>6. The authoritative name server returns the domain name for the load</p><p>balancer of CDN www.myshop.lb.com.</p><p>7. The DNS resolver asks the CDN load balancer to resolve</p><p>www.myshop.lb.com. The load balancer chooses an optimal CDN</p><p>edge server based on the user’s IP address, user’s ISP, the content</p><p>requested, and the server load.</p><p>8. The CDN load balancer returns the CDN edge server’s IP address</p><p>for www.myshop.lb.com.</p><p>9. Now we finally get the actual IP address to visit. The DNS resolver</p><p>returns the IP address to the browser.</p><p>10. The browser visits the CDN edge server to load the content. There</p><p>are two types of contents cached on the CDN servers: static contents</p><p>and dynamic contents. The former contains static pages, pictures, and</p><p>videos; the latter one includes results of edge computing.</p><p>11. If the edge CDN server cache doesn't contain the content, it goes</p><p>upward to the regional CDN server. If the content is still not found, it</p><p>will go upward to the central CDN server, or even go to the origin - the</p><p>85</p><p>London web server. This is called the CDN distribution network, where</p><p>the servers are deployed geographically.</p><p>Over to you: How do you prevent videos cached on CDN from being</p><p>pirated?</p><p>86</p><p>Erasure coding</p><p>A really cool technique that’s commonly used in object storage such as</p><p>S3 to improve durability is called 𝐄𝐫𝐚𝐬𝐮𝐫𝐞 𝐂𝐨𝐝𝐢𝐧𝐠. Let’s take a look at</p><p>how it works.</p><p>87</p><p>Erasure coding deals with data durability differently from replication. It</p><p>chunks data into smaller pieces (placed on different servers) and</p><p>creates parities for redundancy. In the event of failures, we can use</p><p>chunk data and parities to reconstruct the data. Let’s take a look at a</p><p>concrete example (4 + 2 erasure coding) as shown in Figure 1.</p><p>1⃣ Data is broken up into four even-sized data chunks d1, d2, d3, and</p><p>d4.</p><p>2⃣ The mathematical formula is used to calculate the parities p1 and p2.</p><p>To give a much simplified example, p1 = d1 + 2*d2 - d3 + 4*d4 and p2</p><p>= -d1 + 5*d2 + d3 - 3*d4.</p><p>3⃣ Data d3 and d4 are lost due to node crashes.</p><p>4⃣ The mathematical formula is used to reconstruct lost data d3 and d4,</p><p>using the known values of d1, d2, p1, and p2.</p><p>How much extra space does erasure coding need? For every two</p><p>chunks of data, we need one parity block, so the storage overhead is</p><p>50% (Figure 2). While in 3-copy replication, the storage overhead is</p><p>200% (Figure 2).</p><p>Does erasure coding increase data durability? Let’s assume a node</p><p>has a 0.81% annual failure rate. According to the calculation done by</p><p>Backblaze, erasure coding can achieve 11 nines durability vs 3-copy</p><p>replication can achieve 6 nines durability.</p><p>What other techniques do you think are important to improve the</p><p>scalability and durability of an object store such as S3?</p><p>88</p><p>Foreign exchange in payment</p><p>Have you wondered what happens under the hood when you pay with</p><p>USD online and the seller from Europe receives EUR (euro)? This</p><p>process is called foreign exchange.</p><p>Suppose Bob (the buyer) needs to pay 100 USD to Alice (the seller),</p><p>and Alice can only receive EUR. The diagram below illustrates the</p><p>process.</p><p>1. Bob sends 100 USD via a third-party payment provider. In our</p><p>example, it is Paypal. The money is transferred from Bob’s bank</p><p>account (Bank B) to Paypal’s account in Bank P1.</p><p>2. Paypal needs to convert USD to EUR. It leverages the foreign</p><p>exchange provider (Bank E). Paypal sends 100 USD to its USD</p><p>account in Bank E.</p><p>89</p><p>3. 100 USD is sold to Bank E’s funding pool.</p><p>4. Bank E’s funding pool provides 88 EUR in exchange for 100 USD.</p><p>The money is put into Paypal’s EUR account in Bank E.</p><p>5. Paypal’s EUR account in Bank P2 receives 88 EUR.</p><p>6. 88 EUR is paid to Alice’s EUR account in Bank A.</p><p>Now let’s take a close look at the foreign exchange (forex) market. It</p><p>has 3 layers:</p><p>🔹 Retail market. Funding pools are parts of the retail market. To</p><p>improve efficiency, Paypal usually buys a certain amount of foreign</p><p>currencies in advance.</p><p>🔹 Wholesale market. The wholesale business is composed of</p><p>investment banks, commercial banks, and foreign exchange providers.</p><p>It usually handles accumulated orders from the retail market.</p><p>🔹 Top-level participants. They are multinational commercial banks</p><p>that hold a large number of certificates of deposit from different</p><p>countries. They exchange these certificates for foreign exchange</p><p>trading.</p><p>When Bank E’s funding pool needs more EUR, it goes upward to the</p><p>wholesale market to sell USD and buy EUR. When the wholesale</p><p>market accumulates enough orders, it goes upward to top-level</p><p>participants. Steps 3.1-3.3 and 4.1-4.3 explain how it works.</p><p>If you have any questions, please leave a comment.</p><p>What foreign currency did you find difficult to exchange? And what</p><p>company have you used for foreign currency exchange?</p><p>90</p><p>Interview Question: Design S3</p><p>What happens when you upload a file to Amazon S3? Let’s design an</p><p>S3 like object storage system.</p><p>Before we dive into the design, let’s define some terms.</p><p>91</p><p>𝐁𝐮𝐜𝐤𝐞𝐭. A logical container for objects. The bucket name is globally</p><p>unique. To upload data to S3, we must first create a bucket.</p><p>𝐎𝐛𝐣𝐞𝐜𝐭. An object is an individual piece of data we store in a bucket. It</p><p>contains object data (also called payload) and</p><p>metadata. Object data</p><p>can be any sequence of bytes we want to store. The metadata is a set</p><p>of name-value pairs that describe the object.</p><p>An S3 object consists of (Figure 1):</p><p>🔹 Metadata. It is mutable and contains attributes such as ID, bucket</p><p>name, object name, etc.</p><p>🔹 Object data. It is immutable and contains the actual data.</p><p>In S3, an object resides in a bucket. The path looks like this:</p><p>/bucket-to-share/script.txt. The bucket only has metadata. The object</p><p>has metadata and the actual data.</p><p>The diagram below (Figure 2) illustrates how file uploading works. In</p><p>this example, we first create a bucket named “bucket-to-share” and</p><p>then upload a file named “script.txt” to the bucket.</p><p>1. The client sends an HTTP PUT request to create a bucket named</p><p>“bucket-to-share.” The request is forwarded to the API service.</p><p>2. The API service calls the Identity and Access Management (IAM) to</p><p>ensure the user is authorized and has WRITE permission.</p><p>3. The API service calls the metadata store to create an entry with the</p><p>bucket info in the metadata database. Once the entry is created, a</p><p>success message is returned to the client.</p><p>4. After the bucket is created, the client sends an HTTP PUT request</p><p>to create an object named “script.txt”.</p><p>5. The API service verifies the user’s identity and ensures the user has</p><p>WRITE permission on the bucket.</p><p>92</p><p>6. Once validation succeeds, the API service sends the object data in</p><p>the HTTP PUT payload to the data store. The data store persists the</p><p>payload as an object and returns the UUID of the object.</p><p>7. The API service calls the metadata store to create a new entry in the</p><p>metadata database. It contains important metadata such as the</p><p>object_id (UUID), bucket_id (which bucket the object belongs to),</p><p>object_name, etc.</p><p>93</p><p>Block storage, file storage and object storage</p><p>Yesterday, I posted the definitions of block storage, file storage, and</p><p>object storage. Let’s continue the discussion and compare those 3</p><p>options.</p><p>94</p><p>Block storage, file storage and object storage</p><p>In this post, let’s review the storage systems in general.</p><p>Storage systems fall into three broad categories:</p><p>🔹 Block storage</p><p>🔹 File storage</p><p>🔹 Object storage</p><p>The diagram below illustrates the comparison of different storage</p><p>systems.</p><p>𝐁𝐥𝐨𝐜𝐤 𝐬𝐭𝐨𝐫𝐚𝐠𝐞</p><p>Block storage came first, in the 1960s. Common storage devices like</p><p>hard disk drives (HDD) and solid-state drives (SSD) that are physically</p><p>attached to servers are all considered as block storage.</p><p>Block storage presents the raw blocks to the server as a volume. This</p><p>is the most flexible and versatile form of storage. The server can</p><p>format the raw blocks and use them as a file system, or it can hand</p><p>control of those blocks to an application. Some applications like a</p><p>database or a virtual machine engine manage these blocks directly in</p><p>order to squeeze every drop of performance out of them.</p><p>Block storage is not limited to physically attached storage. Block</p><p>storage could be connected to a server over a high-speed network or</p><p>over industry-standard connectivity protocols like Fibre Channel (FC)</p><p>95</p><p>and iSCSI. Conceptually, the network-attached block storage still</p><p>presents raw blocks. To the servers, it works the same as physically</p><p>attached block storage. Whether to a network or physically attached,</p><p>block storage is fully owned by a single server. It is not a shared</p><p>resource.</p><p>𝐅𝐢𝐥𝐞 𝐬𝐭𝐨𝐫𝐚𝐠𝐞</p><p>File storage is built on top of block storage. It provides a higher-level</p><p>abstraction to make it easier to handle files and directories. Data is</p><p>stored as files under a hierarchical directory structure. File storage is</p><p>the most common general-purpose storage solution. File storage could</p><p>be made accessible by a large number of servers using common</p><p>file-level network protocols like SMB/CIFS and NFS. The servers</p><p>accessing file storage do not need to deal with the complexity of</p><p>managing the blocks, formatting volume, etc. The simplicity of file</p><p>storage makes it a great solution for sharing a large number of files</p><p>and folders within an organization.</p><p>𝐎𝐛𝐣𝐞𝐜𝐭 𝐬𝐭𝐨𝐫𝐚𝐠𝐞</p><p>Object storage is new. It makes a very deliberate tradeoff to sacrifice</p><p>performance for high durability, vast scale, and low cost. It targets</p><p>relatively “cold” data and is mainly used for archival and backup.</p><p>Object storage stores all data as objects in a flat structure. There is no</p><p>hierarchical directory structure. Data access is normally provided via a</p><p>RESTful API. It is relatively slow compared to other storage types.</p><p>Most public cloud service providers have an object storage offering,</p><p>such as AWS S3, Google block storage, and Azure blob storage.</p><p>96</p><p>Domain Name System (DNS) lookup</p><p>DNS acts as an address book. It translates human-readable domain</p><p>names (google.com) to machine-readable IP addresses</p><p>(142.251.46.238).</p><p>To achieve better scalability, the DNS servers are organized in a</p><p>hierarchical tree structure.</p><p>There are 3 basic levels of DNS servers:</p><p>1. Root name server (.). It stores the IP addresses of Top Level</p><p>Domain (TLD) name servers. There are 13 logical root name servers</p><p>globally.</p><p>2. TLD name server. It stores the IP addresses of authoritative name</p><p>servers. There are several types of TLD names. For example, generic</p><p>TLD (.com, .org), country code TLD (.us), test TLD (.test).</p><p>3. Authoritative name server. It provides actual answers to the DNS</p><p>query. You can register authoritative name servers with domain name</p><p>registrar such as GoDaddy, Namecheap, etc.</p><p>The diagram below illustrates how DNS lookup works under the hood:</p><p>1. google.com is typed into the browser, and the browser sends the</p><p>domain name to the DNS resolver.</p><p>97</p><p>2. The resolver queries a DNS root name server.</p><p>3. The root server responds to the resolver with the address of a TLD</p><p>DNS server. In this case, it is .com.</p><p>4. The resolver then makes a request to the .com TLD.</p><p>5. The TLD server responds with the IP address of the domain’s name</p><p>server, google.com (authoritative name server).</p><p>6. The DNS resolver sends a query to the domain’s nameserver.</p><p>7. The IP address for google.com is then returned to the resolver from</p><p>the nameserver.</p><p>8. The DNS resolver responds to the web browser with the IP address</p><p>(142.251.46.238) of the domain requested initially.</p><p>DNS lookups on average take between 20-120 milliseconds to</p><p>complete (according to YSlow).</p><p>98</p><p>What happens when you type a URL into your browser?</p><p>The diagram below illustrates the steps.</p><p>1. Bob enters a URL into the browser and hits Enter. In this example,</p><p>the URL is composed of 4 parts:</p><p>🔹 scheme - 𝒉𝒕𝒕𝒑𝒔://. This tells the browser to send a connection to the</p><p>server using HTTPS.</p><p>🔹 domain - 𝒆𝒙𝒂𝒎𝒑𝒍𝒆.𝒄𝒐𝒎. This is the domain name of the site.</p><p>🔹 path - 𝒑𝒓𝒐𝒅𝒖𝒄𝒕/𝒆𝒍𝒆𝒄𝒕𝒓𝒊𝒄. It is the path on the server to the requested</p><p>resource: phone.</p><p>🔹 resource - 𝒑𝒉𝒐𝒏𝒆. It is the name of the resource Bob wants to visit.</p><p>2. The browser looks up the IP address for the domain with a domain</p><p>name system (DNS) lookup. To make the lookup process fast, data is</p><p>cached at different layers: browser cache, OS cache, local network</p><p>cache and ISP cache.</p><p>99</p><p>2.1 If the IP address cannot be found at any of the caches, the browser</p><p>goes to DNS servers to do a recursive DNS lookup until the IP address</p><p>is found (this will be covered in another post).</p><p>3. Now that we have the IP address of the server, the browser</p><p>establishes a TCP connection with the server.</p><p>4. The browser sends a HTTP request to the server. The request looks</p><p>like this:</p><p>𝘎𝘌𝘛 /𝘱𝘩𝘰𝘯𝘦 𝘏𝘛𝘛𝘗/1.1</p><p>𝘏𝘰𝘴𝘵: 𝘦𝘹𝘢𝘮𝘱𝘭𝘦.𝘤𝘰𝘮</p><p>5. The server processes the request and sends back the response. For</p><p>a successful response (the status code is 200). The HTML response</p><p>might look like this:</p><p>𝘏𝘛𝘛𝘗/1.1 200 𝘖𝘒</p><p>𝘋𝘢𝘵𝘦: 𝘚𝘶𝘯, 30 𝘑𝘢𝘯 2022 00:01:01 𝘎𝘔𝘛</p><p>𝘚𝘦𝘳𝘷𝘦𝘳: 𝘈𝘱𝘢𝘤𝘩𝘦</p><p>𝘊𝘰𝘯𝘵𝘦𝘯𝘵-𝘛𝘺𝘱𝘦: 𝘵𝘦𝘹𝘵/𝘩𝘵𝘮𝘭; 𝘤𝘩𝘢𝘳𝘴𝘦𝘵=𝘶𝘵𝘧-8</p><p>𝘏𝘦𝘭𝘭𝘰 𝘸𝘰𝘳𝘭𝘥</p><p>6. The browser renders the HTML content.</p><p>100</p><p>AI Coding</p>
  • Avaliação 5 ano medidas de comprimento
  • Exercício Avaliativo 3 gestao de conflitos
  • Atividade 01 tentantiva 01
  • ed_1_2021_tce_rj_tecnico_abertura
  • exercicio curso gestao de conflitos modulo 2
  • curso gestao de conflitos mod 1
  • 1
  • Raven teste psicologico
  • Raven teste psicologico 2
  • Raven teste psicologico 6
  • Exercício Avaliativo - Módulo 2_ Revisão da tentativa 1
  • Uma equipe de desenvolvimento precisa decidir qual padrão seguir para a manutenção de um antigo sistema em C. Qual padrao oferece mais recursos mod...
  • O termo “funcional” está associado à seleção de objetivos educacionais para o aluno, destacando a importância de que o que ele está prestes a apren...
  • sobre a profissão do pedagogo empresarial é correto afirmar que esse profissional tem a capacidade e conhecimento necessarios para: avaliar as co...
  • Costuma se útiizar como instrumento para detecção de altas habilidades ou superlotação o teste psicológico como e chamado?
  • Qual das alternativas apresenta concordância verbal INCORRETA? Questão 10Resposta a. "O cachorro e o gato brincam no quintal." — O verbo está cor...
  • Assinale a alternativa em que todas as palavras concordam corretamente em gênero e número: Questão 9Resposta a. As crianças felizes brincava no p...
  • faveni Assinale a segunda coluna de acordo com a primeira, com base nos princípios filosóficos norteadores. (1) A Pessoa como centro (2) Todos po...
  • Assinale a segunda coluna de acordo com a primeira, com base nos princípios filosóficos norteadores. (1) A Pessoa como centro (2) Todos podem apr...
  • Sobre o Protocolo do Kyoto (PK), assinale a alternativa incorreta.a. O PK foi o responsável por criar formas de desenvolvimento menos impactante ...
  • Preencha a lacuna corretamente: Pode-se conceituar o crime ambiental como um fato típico __________ e antijurídico que cause danos ao meio ambiente.
  • Preencha a lacuna corretamente: O objetivo do protocolo de Kyoto é firmar acordos e discussões internacionais para conjuntamente estabelecer metas ...
  • Alguns dos países que mais emitem gases de efeito estufa, considerados os vilões do aquecimento global, EXCETO:a. Islândiab. Chinac. Canadád. ...
  • Diante da crescente complexidade das redes de suprimentos, quais são os principais desafios e riscos que as empresas enfrentam? Como a implementaçã...
  • Aula 7 - Indicadores de Desempenho em TI
  • Aula 6 - Indicadores de Desempenho em TI
System Design EBook 1670968892 - Inglês (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Wyatt Volkman LLD

Last Updated:

Views: 6302

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Wyatt Volkman LLD

Birthday: 1992-02-16

Address: Suite 851 78549 Lubowitz Well, Wardside, TX 98080-8615

Phone: +67618977178100

Job: Manufacturing Director

Hobby: Running, Mountaineering, Inline skating, Writing, Baton twirling, Computer programming, Stone skipping

Introduction: My name is Wyatt Volkman LLD, I am a handsome, rich, comfortable, lively, zealous, graceful, gifted person who loves writing and wants to share my knowledge and understanding with you.