AWS Big Data Specialty Exam Tips and Tricks

05 Dec 2018

If you’re planning on taking the AWS Big Data Specialty exam, I’ve compiled a quick list of tips that you may want to remember headed into the exam.

I passed the exam on December 6, 2018 with a score of 76%. In my opinion, this exam is more difficult than the AWS Solutions Architect Pro!

  • You really, really need to understand Redshift distribution strategies. Here are some things to remember:
    • Automatic Distribution: The default option, Redshift automatically manages your distribution strategy for you, shifting from an inital ALL strategy (for smaller tables) to EVEN distribution (for larger tables). Note: Redshift will not automatically switch back from EVEN to ALL.
    • Even Distribution: With the EVEN distribution, the leader node distributes rows equally across all slices. This is appropriate for tables that do not participate in joining.
    • Key Distribution: With the KEY distribution, rows are distributed according to a selected column. Tables that share common join keys are physically co-located for performance.
    • All Distribution: A copy of the entire data set is stored on each node. This slows down inserting, updating, and querying. This distribution method is only appropriate for small or rarely-updated data sets.
  • You need to know the DynamoDB partition sizing formula by heart: (Desired RCU/3000 RCU) + (Desired WCU/1000 RCU) = # of partitions needed
  • AWS Machine Learning does not support unsupervised learning - you will need Apache Spark or Spark MLLib for real-time anomaly detection.
  • AWS IoT accepts four forms of identity verification: X.509 certificates, IAM users/roles, Cognito identities, and Federated identities. “Typically, AWS IoT devices use X.509 certificates, while mobile applications use Amazon Cognito identities. Web and desktop applications use IAM or federated identities. CLI commands use IAM.”
  • In the context of evaluating a Redshift query plan, DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required for that step because all of the joins are co-located.
  • DS_DIST_INNER means that the step will probably have a relatively high cost because the inner table is being redistributed to the nodes. DS_DIST_ALL_INNER, DS_BCAST_INNER and DS_DIST_BOTH are not good. (Source)
  • You must disable cross-region snapshots for Redshift before changing the encryption type of the cluster.
  • Amazon recommends allocating three dedicated master nodes for each production ElasticSearch domain.
  • Read up on DAX and DynamoDB.
In most cases, the DynamoDB response times can be measured in single-digit milliseconds. However, for use cases that require response times in microseconds, DynamoDB Accelerator (DAX) delivers fast response times for accessing eventually consistent data.
  • After a Redshift load operation is complete, query the STL_LOAD_COMMITS table to verify that the expected files were loaded.
  • Mahout is a machine learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Use it to carry out Machine Learning work on top of Hadoop.
  • When preparing to use a Lambda/Kinesis combination, make sure to optimize your Lambda memory and batch size, and adjust the number of shards used by the Kinesis streams.
  • Triggers do not exist in Redshift.
  • Amazon Kinesis Aggregators is a Java framework that enables the automatic creation of real-time aggregated time series data from Kinesis streams. (Source)
  • You can’t encrypt an existing DynamoDB table. You need to create a new, encrypted table and transfer your data over. (Source)
  • Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources. (Source)
  • Redshift creates the following log types:
    • Connection log — logs authentication attempts, and connections and disconnections.
    • User log — logs information about changes to database user definitions.
    • User activity log — logs each query before it is run on the database.
  • Kinesis Data Firehose can send records to S3, Redshift, or Elasticsearch. It cannot send records to DynamoDB. (Source)

A serverless solution to ingest, query, and visualize Big Data using Kinesis, Glue, Athena, and QuickSight.
  • Use Spark for general purpose Amazon EMR operations, use Presto for interactive queries, and use Hive for batch operations.
  • Use Athena generally to query existing data in S3. You should be aware that Redshift Spectrum exists, and that it can query data in S3.
  • If a question is asking how to handle joins or manipulations on millions of rows in DynamoDB, there’s a good chance that EMR with Hive is the answer.
  • When using Spark, you should aim for a memory-optimized instance type.
  • To improve query performance and reduce cost, AWS recommends partitioning data used for Athena, and storing your data in Apache Parquet or ORC form - not .csv!
  • Use the COPY command to transfer data from DynamoDB to Redshift. Use UNLOAD to transfer the results of a query from Redshift to S3.
  • Redshift clusters have two types of nodes: Leader nodes and Compute nodes.
Redshift Architecture: If a Leader node goes down, the cluster health will suffer.
  • Not to be confused with EMR, which uses Master, Core, and Task nodes.
EMR Node Structure
EMR Architecture: EMR stores log files on the Master node by default.
  • Use ElasticSearch to analyze data stream updates from other services, such as Kinesis Streams and DynamoDB.
  • Amazon Schema Conversion Tool is sometimes referred to as SCT.
Review EMR security and encryption

Training Materials I Used

Whitepapers I Read


AWS DevOps Engineer Professional Exam Tips and Tricks

30 Nov 2018

If you’re planning on taking the AWS DevOps Engineer Professional exam, I’ve compiled a quick list of tips that you may want to remember headed into the exam.

I passed the exam on November 30th, 2018. Before taking this exam, I had already achieved my SA Pro. There is a fair amount of overlap between the two tests - I would highly recommend taking the SA Pro before this exam. I scored an 85% on the exam.

This exam is fairly difficult, although I would argue it’s slightly easier than the Solutions Architect Professional exam. There is a greater focus on implementation and debugging in the DevOps exam - the SA Pro exam is more about breadth of knowledge across AWS services.

  1. Remember to solve exactly what the question is asking. If they want performance, give them read replicas, sharding, PIOPS, etc. If they want elasticity and high availability, start using Multi-AZ, cross-region replication, Auto Scaling Groups, ELB’s, etc. If they are worried about cost, remember the S3 storage classes, and use AutoScaling groups to scale down unnecessary instances.
  2. The basic stages of a Continuous Integration pipeline are Source Control -> Build -> Staging -> Production.
  3. ELB can be configured to publish logs at a 5 or 60 minute interval. Access logs are disabled by default.
  4. When an instance is launched by Autoscaling, the instance can be in the pending state for up to 60 minutes by default.
  5. Cloudwatch logs can be streamed to Kinesis, Lambda, or ElasticSearch for further processing.
  6. The fastest deployment times in Elastic Beanstalk are achieved using the All At Once deployment method.
  7. Remember this Opswork trivia: “For Linux stacks, if you want to associate an Amazon RDS service layer with your app, you must add the appropriate driver package to the associated app server layer”
  8. Generally, if you see Docker, think Elastic Beanstalk!
  9. AWS OpsWorks Stacks agents communicate regularly. If an agent does not communicate with the service for more than approximately five minutes, AWS OpsWorks Stacks considers the instance to have failed.
  10. You can configure an Opswork stack to accept custom cookbooks after creation. Just enable the option.
  11. Speaking of custom cookbooks, remember that the custom cookbook option is enabled at the stack level, not the layer level.
  12. The CodeDeploy service relies on a appspec.yml file included with your source code binaries on an EC2/On-Premises Compute Platform.
  13. A nested stack is a stack that you create within another stack by using the AWS::CloudFormation::Stack resource.
  14. When attempting to secure data in transit through an ELB, remember you can use either an HTTPS or SSL listener.
  15. AWS CodeDeploy can deploy application content stored in Amazon S3 buckets, GitHub repositories, or Bitbucket repositories. Note: Subversion is not supported.
  16. CodeDeploy offers two deployment types: In-place deployment and Blue-Green deployment.
  17. A Dockerrun.aws.json file is an Elastic Beanstalk–specific JSON file that describes how to deploy a set of Docker containers as an Elastic Beanstalk application.
  18. This exam is extremely tough to parse at first glance. There are lots of grammatical twists and turns that are not present in typical exams. You may need to check your proposed architecture against the question 4-5 times to ensure that you are solving the right problem.
  19. Remember to use the NoEcho property in CloudFormation templates to prevent a describe command from listing sensitive values, such as passwords.
  20. You really need to understand the lifecycle of an EC2 instance (pictured below) being brought into service.
EC2 Auto Scaling Lifecycle

Training Materials I Used

Whitepapers I Read


AWS Solutions Architect Professional Exam Tips and Tricks

26 Nov 2018

If you’re planning on taking the AWS Solutions Architect Professional exam, I’ve compiled a quick list of tips that you may want to remember headed into the exam.

I passed the exam on November 24th, 2018. Before taking this exam, I held all three Associate certifications and the Security Specialty certification. I passed with an 80% score, and it took 69 minutes.

This exam is very difficult - on par with the Security Specialty exam! You will not accidentally pass this exam. :)

STS and Identity Broker Diagram
  1. You need to know your STS use cases inside and out. Remember that you generally need to develop an identity broker, your application will authenticate against LDAP or your broker, and then against STS. Your application will not directly authenticate against STS first!
  2. Understand how cross-account access, and granting that access, works.
  3. Remember to turn on the Cloudtrail “global services” in order to track IAM usage.
  4. Prefer Cloudformation for version controlled infrastructure configuration.
  5. You need to understand what BGP does (and what it doesn’t do) in relation to Direct Connect. The ACloudGuru course did not cover eBGP or weighting policies very heavily, so you will need to do additional research. I recommend watching this video before taking the exam.
  6. Use SQS queues placed in front of RDS/DynamoDB to reduce the load on your databases!
  7. Use Kinesis for large amounts of incoming data, especially when it’s coming from multiple sources.
  8. Any time you see the words “mobile app”, “social networking login”, “Login with Amazon”, or “Facebook” on the exam, you should immediately be able to narrow down the answers - hint: the correct answer will probably involve Web Identity Federation.
  9. Play around with this CIDR calculator if you have trouble understanding VPC/subnet CIDR ranges and possibilities. Always remember that AWS reserves five IP addresses for their use (first four IP’s and the last IP).
  10. There are usually two blatantly incorrect answers, and two answers that could be right. Narrow down your choices.
  11. Understand the various best practices for encrypting data at rest and in transit.
  12. You need to understand public VIF’s vs. private VIF’s, and which services use which type of VIF.
  13. You need to understand ELBs, Auto Scaling, and how to architect a scalable architecture. By this point in your AWS studies (assuming you already have an Associate certificate), you should have this pretty down pat. I recommend watching this video for a cool overview of setting up low latency multiplayer servers globally.
  14. Even though Data Pipeline has been pretty much usurped by Lambda, it features fairly heavily on the exam. Definitely watch the ACloudGuru videos on Data Pipeline (and follow the labs) before taking the exam.
  15. AWS Connector for vCenter shows up in one or two questions, and is worth knowing about.
  16. You need to know the different instance types and what usage scenarios are appropriate for them. Remember that if you receive a capacity error when resizing a placement group, you just need to stop and restart the placement group - this will allocate the group to a new physical cluster with a proper amount of instances.

Training Materials I Used

Videos I Watched (in order of importance)

Whitepapers I Read


React Conf '18 - I'm Hooked

28 Oct 2018

I am flying home from React Conf 2018, hosted at the Westin Hotel in Lake Las Vegas. Some of the key takeaways of this year’s conference:

Hooks are coming

The conference started out strong with Dan Abramov demonstrating new React hooks to the crowd. I heard many, many audible gasps from the crowd as Dan walked us through using hooks to replace traditional React classes. I had the feeling that I was watching a new version of React being demoed right before our eyes. Although the API is unstable and experimental at this point (it’s just an RFC right now), it is immediately clear that this is the future of React. I am very grateful I was there to hear the announcement in person.

So, what are hooks going to do for you? Well, first of all, you’re going to be writing a lot less boilerplate class code. React 16.7 (alpha) allows us to use functions rather than classes as our components.

State management will be handled by React’s useState (docs) function - I will say that I felt a little strange about the ergonomics of useState (you have to call it sequentially, ordering matters). I think that for complex components that require lots of state updates, I would still write traditional classes. This useState pattern seems best suited for simpler components that only hold a couple values in state.

Instead of manually tracking side effects of components using componentDidMount and componentWillUnmount we can use (no pun intended) the new useEffect (docs) functions that React provides in 16.7 Alpha. This is probably the most promising feature I saw from Dan Abramov’s presentation.

Rather than segmenting application logic throughout these various lifecycle components:

  componentDidMount(){
    // Do something here because we mounted our component
    add_event_listener('foo_event')
  }
  componentWillUnmount() {
    // Make sure we unload whatever we did when we mounted!
    remove_event_listener('foo_event')
  }

We will now be able to import the useEffect function and use it like this:

export default function () {
    useEffect(() => {
        // Do a thing here.. maybe add an event listener
        add_event_listener('foo_event')

        // If we want the event listener to be removed when our component is unmounted
        // We just need to return an anonymous function containing the actions we want performed
        return () => {
            remove_event_listener('foo_event')
        }
    )
}

That’s really it. It may seem a bit magical (it certainly feels like it), but React will now perform the same duties that it used to, but with all of the logic nicely grouped together! This makes writing and debugging side effects much easier.

One more big change.

One of the worst parts of React was dealing with boilerplate code, particularly when it comes to shouldComponentUpdate. shouldComponentUpdate is rarely used for much more than checking if the prevProps != nextProps based on some criteria.

This is some very typical shouldComponentUpdate boilerplate that I’m sure you’re familiar with:

    shouldComponentUpdate(nextProps) {
        if (this.props.id !== nextProps.id) {
            return true
        }
        return false
    }

We are just checking if the incoming props have a different id than the props we already have. This is a fairly standard check for a lot of React components. What if React could just diff our props for us, and only update when necessary?

// Example taken from https://reactjs.org/docs/hooks-reference.html#conditionally-firing-an-effect
useEffect(
  () => {
    const subscription = props.source.subscribe();
    return () => {
      subscription.unsubscribe();
    };
  },
  [props.source], // Only run if the props.source values changes
);

Did I mention that when you use function components, there is no more binding to this? That alone is a compelling reason to begin exploring using React Hooks.

React Native isn’t there… yet

I really enjoyed James Long’s talk “Go Ahead, Block the Main Thread”, where he argued against a lot of common wisdom regarding Javascript. James talked about the viral impact of async functions - once a single function is async, everything in the codebase eventually follows suit. That’s never personally been a problem for me (I greatly enjoy using the asynchronous features of JS).

James argued that the asynchronous nature of React Native’s interaction with native API’s was harming UX. He showed some compelling examples of janky scrollng behavior that occurs when the React Native asynchronous processes fall behind the screen render.

His solution: Block the main thread. Don’t let other tasks interrupt crucial animation rendering. What’s the best way to do that? Get rid of async, and allow synchronous flow between native API’s and React Native.

GraphQL is in vogue

Speaking of talks I enjoyed, I greatly enjoyed Conor Hasting’s talk about GraphQL.

In a typical REST API setup, a consumer requests data from an endpoint. The consumer has little to no control over what is delivered to them. To use Conor’s analogy, it’s like calling a pizza parlor for pizza delivery, and since they don’t know what toppings you like, they put 40 different toppings on the pizza, deliver it to your house, and tell you to pick off the ingredients you don’t like.

When you’re working on a front-end application and constantly consuming API’s for varying amounts of data, this can get exhausting. Want to get only the id and timestamp of a given set of rows? Too f’ing bad. Now your front-end application is stuck having to munge, filter, and extract data, even though we know exactly what we want. It’s like calling the pizza parlor, asking for pepperoni, and getting 40 toppings.

GraphQL seeks to enforce the concept of getting only what you need, when you need it. This concept is not limited to any sort of technology stack or implementation - it is simply (in my eyes) a philosophy of API design. With GraphQL, your frontend can intelligently query the API for only the data it wants.

This saves time in two huge ways:

  1. Less data over the wire. Your API is no longer attempting to cram unnecessary information into a response.
  2. Less processing/filtering by the front-end. Your front-end doesn’t really need to know or care about how the API works. It just wants some data.

Good Captioning

As someone who has a hard time hearing, I really, really appreciated the real-time captions provided by the conference. They were incredibly precise, accurate, and they made my conference experience a lot better. I am used to only hearing 50-60% of a speaker’s talk, and I really loved being able to look to the caption monitors and follow along.