Automatic Distribution: The default option, Redshift automatically manages your distribution strategy for you, shifting from an initial ALL strategy (for smaller tables) to EVEN distribution (for larger tables). Note: Redshift will not automatically switch back from EVEN to ALL.
Even Distribution: With the EVEN distribution, the leader node distributes rows equally across all slices. This is appropriate for tables that do not participate in joining.
Key Distribution: With the KEY distribution, rows are distributed according to a selected column. Tables that share common join keys are physically co-located for performance.
All Distribution: A copy of the entire data set is stored on each node. This slows down inserting, updating, and querying. This distribution method is only appropriate for small or rarely-updated data sets.
You need to know the DynamoDB partition sizing formula by heart: (Desired RCU/3000 RCU) + (Desired WCU/1000 RCU) = # of partitions needed
AWS Machine Learning does not support unsupervised learning - you will need Apache Spark or Spark MLLib for real-time anomaly detection.
In the context of evaluating a Redshift query plan, DS_DIST_NONE and DS_DIST_ALL_NONE are good. They indicate that no distribution was required for that step because all of the joins are co-located.
DS_DIST_INNER means that the step will probably have a relatively high cost because the inner table is being redistributed to the nodes. DS_DIST_ALL_INNER, DS_BCAST_INNER and DS_DIST_BOTH are not good. (Source)
You must disable cross-region snapshots for Redshift before changing the encryption type of the cluster.
Amazon recommends allocating three dedicated master nodes for each production ElasticSearch domain.
DynamoDB: Remember to use “write sharding” to allow writes to be distributed evenly across partitions. There are two methods for this: Random Suffixes and Calculated Suffixes.
To make your data searchable in CloudSearch, you need to format it in JSON or XML.
You can use popular BI tools like Excel, MicroStrategy, QlikView, and Tableau with EMR to explore and visualize your data. Many of these tools require an ODBC (Open Database Connectivity) or JDBC (Java Database Connectivity) driver. (Source)
Mahout is a machine learning library with tools for clustering, classification, and several types of recommenders, including tools to calculate most-similar items or build item recommendations for users. Use it to carry out Machine Learning work on top of Hadoop.
When preparing to use a Lambda/Kinesis combination, make sure to optimize your Lambda memory and batch size, and adjust the number of shards used by the Kinesis streams.
Triggers do not exist in Redshift.
Amazon Kinesis Aggregators is a Java framework that enables the automatic creation of real-time aggregated time series data from Kinesis streams. (Source)
You can’t encrypt an existing DynamoDB table. You need to create a new, encrypted table and transfer your data over. (Source)
Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources. (Source)
Connection log — logs authentication attempts, and connections and disconnections.
User log — logs information about changes to database user definitions.
User activity log — logs each query before it is run on the database.
Kinesis Data Firehose can send records to S3, Redshift, or Elasticsearch. It cannot send records to DynamoDB. (Source)
A serverless solution to ingest, query, and visualize Big Data using Kinesis, Glue, Athena, and QuickSight.
Use Spark for general purpose Amazon EMR operations, use Presto for interactive queries, and use Hive for batch operations.
Use Athena generally to query existing data in S3. You should be aware that Redshift Spectrum exists, and that it can query data in S3.
If a question is asking how to handle joins or manipulations on millions of rows in DynamoDB, there’s a good chance that EMR with Hive is the answer.
When using Spark, you should aim for a memory-optimized instance type.
To improve query performance and reduce cost, AWS recommends partitioning data used for Athena, and storing your data in Apache Parquet or ORC form - not .csv!
Use the COPY command to transfer data from DynamoDB to Redshift. Use UNLOAD to transfer the results of a query from Redshift to S3.
Redshift clusters have two types of nodes: Leader nodes and Compute nodes.
Not to be confused with EMR, which uses Master, Core, and Task nodes.
Use ElasticSearch to analyze data stream updates from other services, such as Kinesis Streams and DynamoDB.
Amazon Schema Conversion Tool is sometimes referred to as SCT.