Introducing Amazon Omics – A Purpose-Built Service to Store, Query, and Analyze Genomic and Biological Data at Scale

海外精选
re:Invent
Amazon SageMaker
海外精选的内容汇集了全球优质的亚马逊云科技相关技术内容。同时,内容中提到的“AWS” 是 “Amazon Web Services” 的缩写,在此网站不作为商标展示。
0
0
{"value":"You might learn in high school biology class that the human genome is composed of over three billion letters of code using adenine (A), guanine (G), cytosine (C), and thymine (T) paired in the deoxyribonucleic acid (DNA). The human genome acts as the biological blueprint of every human cell. And that’s only the foundation for what makes us human.\n\n\n![image.png](https://dev-media.amazoncloud.cn/07d73ec56f2742e794702827bd6c822e_image.png)\n\nHealthcare and life sciences organizations collect myriad types of biological data to improve patient care and drive scientific research. These organizations map an individual’s genetic predisposition to disease, identify new drug targets based on protein structure and function, profile tumors based on what genes are expressed in a specific cell, or investigate how gut bacteria can influence human health. Collectively, these studies are often known as “omics”.\n\nAWS has helped [healthcare and life sciences](https://aws.amazon.com/health/) organizations accelerate the translation of this data into actionable insights for over a decade. Industry leaders such as as [Ancestry](https://aws.amazon.com/solutions/case-studies/ancestry-case-study/), [AstraZeneca](https://aws.amazon.com/solutions/case-studies/astrazeneca/), [Illumina](https://aws.amazon.com/solutions/case-studies/illumina-case-study/), [DNAnexus](https://aws.amazon.com/solutions/case-studies/dnanexus/), [Genomics England](https://aws.amazon.com/solutions/case-studies/genomics-england/), and [GRAIL ](https://www.youtube.com/watch?v=eexsAyOJML4&t=786s) leverage AWS to accelerate time to discovery while concurrently reducing costs and enhancing security.\n\nThe scale these customers, and others, operate at continues to increase rapidly. When omics data across thousand or hundreds of thousands (or more!) of individuals are compared and analyzed, new insights for predicting disease and the efficacy of different drug treatments are possible.\n\nHowever, this scale, which can be many petabytes of data, can add complexity. When I studied medical informatics in my Ph.D course, I experienced this complexity in data access, processing, and tooling. You need a way to store omics data that is cost-efficient and easy to access. You need to scale compute across millions of biological samples while preserving accuracy and reliability. You also need specialized tools to analyze genetic patterns across populations and train machine learning (ML) models to predict diseases.\n\nToday I’m excited to announce the general availability of **[Amazon Omics](https://aws.amazon.com/omics/)**, a purpose-built service to help bioinformaticians, researchers, and scientists store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health and advance scientific discoveries.\n\nWith just a few clicks in the Omics console, you can import and normalize petabytes of data into formats optimized for analysis. Amazon Omics provides scalable workflows and integrated tools for preparing and analyzing omics data and automatically provisions and scales the underlying cloud infrastructure. So, you can focus on advancing science and translate discoveries into diagnostics and therapies.\n\n\nAmazon Omics has three primary components:\n\n- Omics-optimized object storage that helps customers store and share their data efficiently and at low cost.\n- Managed compute for bioinformatics workflows that allows customers to run the exact analysis they specify, without worrying about provisioning underlying infrastructure.\n- Optimized data stores for population-scale variant analysis.\n\n\nNow let’s learn more about each component of Amazon Omics. Generally, it follows the steps to create a data store and import data files, such as genome sequencing raw data, set up a basic bioinformatics workflow, and analyze results using existing AWS analytics and ML services.\n\n![image.png](https://dev-media.amazoncloud.cn/9ae63118d88642f29c00e6355b063198_image.png)\n\nThe **Getting Started** page in the Omics console contains tutorial examples using [Amazon SageMaker](https://aws.amazon.com/sagemaker) notebooks with the Python SDK. I will demonstrate Amazon Omics features through an example using a human genome reference.\n\n### ++Omics Data Storage++\nThe Omics data storage helps you store and share petabytes of omics data efficiently. You can create data stores and import sample data in the Omics console and also do the same job in the AWS Command Line Interface (AWS CLI).\n\n![image.png](https://dev-media.amazoncloud.cn/2d06894c0bca41cebcc72efdb5a6448c_image.png)\n\nLet’s make a reference store and import a reference genome. This example uses [Genome Reference Consortium Human Reference 38](https://registry.opendata.aws/broad-references/) (hg38), which is open access and available from the following Amazon ```S3bucket:s3://broadreferences/hg38/v0/Homo_sapiens_assembly38.fasta.```\n\nAs prerequisites, you need to create [Amazon S3](https://aws.amazon.com/cn/s3/?trk=cndc-detail) bucket in your preferred Region and the necessary IAM permissions to access S3 buckets. In the Omics console, you can easily create and select IAM role during the Omics storage setup.\n\nUse the following AWS CLI command to create your reference store, copy the genome data to your S3 bucket, and import it data into your reference store.\n\n\n```\\n// Create your reference store\\n\$ aws omics create-reference-store --name \\"Reference Store\\"\\n\\n// Import your reference data into your data store\\n\$ aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta,name=hg38 s3://channy-omics\\n\$ aws omics start-reference-import-job --sources sourceFile=s3://channy-omics/Homo_sapiens_assembly38.fasta,name=hg38 --reference-store-id 123456789 --role-arn arn:aws:iam::01234567890:role/OmicsImportRole\\n```\n\nYou can see the result in your console too.\n\n![image.png](https://dev-media.amazoncloud.cn/558061e671ac4156971bf79301ee176d_image.png)\n\n\nNow you can create a sequence store. A sequence store is similar to an S3 bucket. Each object in a sequence store is known as a “read set”. A read set is an abstraction of a set of genomics file types:\n\n- **[FASTQ](https://en.wikipedia.org/wiki/FASTQ_format)** – A text-based file format that stores information about a base (sequence letter) from a sequencer and the corresponding quality information.\n- **[BAM](https://en.wikipedia.org/wiki/Binary_Alignment_Map)** – The compressed binary version of raw reads and their mapping to a reference genome.\n- [CRAM](https://samtools.github.io/hts-specs/CRAMv3.pdf) – Similar to BAM, but uses the reference genome information to aid in compression.\n\nAmazon Omics allows you to specify domain-specific metadata to your read sets you import. These are searchable and defined when you start a read set import job.\n\nAs an example, we will use the [1000 Genomes Project](https://registry.opendata.aws/1000-genomes/), a highly detailed catalogue of more than 80 million human genetic variants for more than 400 billions data points from over 2500 individuals. Let’s make a sequence store and then import genome sequence files into it.\n\n```\\n// Create your sequence store \\n\$ aws omics create-sequence-store --name \\"MySequenceStore\\"\\n\\n// Import your reference data into your data store\\n\$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://channy-omics\\n\$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://channy-omics\\n\\n\$ aws omics start-read-set-import-job --cli-input-json ‘\\n{\\n \\"sourceFiles\\":\\n {\\n \\"source1\\": \\"s3://channy-omics/SRR233106_1.filt.fastq.gz\\",\\n \\"source2\\": \\"s3://channy-omics/SRR233106_2.filt.fastq.gz\\"\\n\\n },\\n \\"sourceFileType\\": \\"FASTQ\\",\\n \\"subjectId\\": \\"mySubject2\\",\\n \\"sampleId\\": \\"mySample2\\",\\n \\"referenceArn\\": \\"arn:aws:omics:us-east-1:123456789012:referenceStore/123467890\\",\\n \\"name\\": \\"HG00100\\"\\n}’\\n```\n\nYou can see the result in your console again.\n\n![image.png](https://dev-media.amazoncloud.cn/77af05dd0b724541a3ace61534aa73cf_image.png)\n\n### ++Analytics Transformations++\nYou can store variant data referring to a mutation, a difference between what the sequencer read at a position compared to the known reference and annotation data, known information about a location or variant in a genome, such as whether it may cause disease.\n\nA variant store supports both variant call format files (VCF) where there is a called variant and gVCF inputs with records covering every position in a genome. An annotation store supports either a generic feature format (GFF3), tab-separated values (TSV), or VCF file. An annotation store can be mapped to the same coordinate system as variant stores during an import.\n\n![image.png](https://dev-media.amazoncloud.cn/02a546d53c764817bed471e386a8690c_image.png)\n\nOnce you’ve imported your data, you can now run queries like as followings which search for Single Nucleotide Variants (SNVs), the most common type of genetic variation among people, on human chromosome 1.\n\n```\\nSELECT\\n sampleid,\\n contigname,\\n start,\\n referenceallele,\\n alternatealleles\\nFROM \\"myvariantstore\\".\\"myvariantstore\\"\\nWHERE\\n contigname = 'chr1'\\n and cardinality(alternatealleles) = 1\\n and length(alternatealleles[1]) = 1\\n and length(referenceallele) = 1\\nLIMIT 10\\n```\n\nYou can see the output of this query:\n\n```\\n#\\tsampleid\\tcontigname\\tstart\\treferenceallele\\talternatealleles\\n1\\tNA20858\\tchr1\\t10096\\tT\\t[A]\\n2\\tNA19347\\tchr1\\t10096\\tT\\t[A]\\n3\\tNA19735\\tchr1\\t10096\\tT\\t[A]\\n4\\tNA20827\\tchr1\\t10102\\tT\\t[A]\\n5\\tHG04132\\tchr1\\t10102\\tT\\t[A]\\n6\\tHG01961\\tchr1\\t10102\\tT\\t[A]\\n7\\tHG02314\\tchr1\\t10102\\tT\\t[A]\\n8\\tHG02837\\tchr1\\t10102\\tT\\t[A]\\n9\\tHG01111\\tchr1\\t10102\\tT\\t[A]\\n10\\tNA19205\\tchr1\\t10108\\tA\\t[T] \\n```\n\nYou can view, manage, and query those data by integrating with existing analytics engines such as [Amazon Athena](https://aws.amazon.com/athena). These query results can be used to train ML models in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail).\n\n### ++Bioinformatics Workflows++\nAmazon Omics allows you to perform bioinformatics workflow, such as variant calling or gene expression, analysis on AWS. These compute workloads are defined using workflow languages like [Workflow Description Language](https://github.com/openwdl/wdl) (WDL) and [Nextflow](https://github.com/nextflow-io/nextflow), domain-specific languages that specify multiple compute tasks and their input and output dependencies.\n\n![image.png](https://dev-media.amazoncloud.cn/ab52520e4b81431289e2bcdc022d98ce_image.png)\n\nYou can define and execute a workflow using a few simple CLI commands. As an example, create a ```main.wdl``` file with the following WDL codes to create a simple WDL workflow with one task that creates a copy of a file.\n\n```\\nversion 1.0\\nworkflow Test {\\n\\tinput {\\n\\t\\tFile input_file\\n\\t}\\n\\tcall FileCopy {\\n\\t\\tinput:\\n\\t\\t\\tinput_file = input_file,\\n\\t}\\n\\toutput {\\n\\t\\tFile output_file = FileCopy.output_file\\n\\t}\\n}\\ntask FileCopy {\\n\\tinput {\\n\\t\\tFile input_file\\n\\t}\\n\\tcommand {\\n\\t\\techo \\"copying ~{input_file}\\" >&2\\n\\t\\tcat ~{input_file} > output\\n\\t}\\n\\toutput {\\n\\t\\tFile output_file = \\"output\\"\\n\\t}\\n}\\n```\n\nThen zip up your workflow and create your workflow with Amazon Omics using the AWS CLI:\n\n```\\n\$ zip my-wdl-workflow-zip main.wdl\\n\$ aws omics create-workflow \\\\\\n --name MyWDLWorkflow \\\\\\n --description \\"My WDL Workflow\\" \\\\\\n --definition-zip file://my-wdl-workflow.zip \\\\\\n --parameter-template '{\\"input_file\\": \\"input test file to copy\\"}'\\n```\n\nTo run the workflow we just created, you can use the following command:\n\n\n```\\naws omics start-run \\\\\\n --workflow-id // id of the workflow we just created \\\\\\n --role-arn // arn of the IAM role to run the workflow with \\\\\\n --parameters '{\\"input_file\\": \\"s3://bucket/path/to/file\\"}' \\\\\\n --output-uri s3://bucket/path/to/results\\n```\n\nOnce the workflow completes, you could use these results in ```\\ns3://bucket/path/to/results```for downstream analyses in the Omics variant store.\n\nYou can execute a run, a single invocation of a workflow with a task and defined compute specifications. An individual run acts on your defined input data and produces an output. Runs also can have priorities associated with them, which allow specific runs to take execution precedence over other submitted and concurrent runs. For example, you can specify that a run that is high priority will be run before one that is lower priority.\n\n![image.png](https://dev-media.amazoncloud.cn/25f87050ff004668bccc12fcf413e6ca_image.png)\n\nYou can optionally use a run group, a group of runs that you can set the max vCPU and max duration runs to help limit the compute resources used per run. This can help you partition users who may need access to different workflows to run on different data. It can also be used as a budget control/resource fairness mechanism by isolating users to specific run groups.\n\nAs you saw, Amazon Omics gives you a managed service with a couple of clicks and simple commands, and APIs in analyzing large-scale omic data, such as human genome samples so you can derive meaningful insights from this data, in hours rather than weeks. We also provide more [tutorial SageMaker notebooks](https://github.com/aws-samples/amazon-omics-tutorials/) that you can use in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) to help you get started.\n\n![image.png](https://dev-media.amazoncloud.cn/0560658e69bf4fb68f1ac647206c8f54_image.png)\n\n\nIn terms of data security, Amazon Omics helps ensure that your data remains secure and patient privacy is protected with customer-managed encryption keys, and [HIPAA eligibility](https://aws.amazon.com/compliance/hipaa-compliance/).\n\n\n### ++Customer and Partner Voices++\nCustomers and partners in the healthcare and life science industry have shared how they are using Amazon Omics to accelerate scientific insights.\n\n\n**[Children’s Hospital of Philadelphia (CHOP)](https://aws.amazon.com/solutions/case-studies/chop/)** is the oldest hospital in the United States dedicated exclusively to pediatrics and strives to advance healthcare for children with the integration of excellent patient care and innovative research. AWS has worked with the CHOP Research Institute for many years as they’ve led the way in utilizing data and technology to solve challenging problems in child health.\n\n“At Children’s Hospital of Philadelphia, we know that getting a comprehensive view of our patients is crucial to delivering the best possible care, based on the most innovative research. Combining multiple clinical modalities is foundational to achieving this. With Amazon Omics, we can expand our understanding of our patients’ health, all the way down to their DNA.” – Jeff Pennington, Associate Vice President & Chief Research Informatics Officer, Children’s Hospital of Philadelphia\n\n**[G42 Healthcare](https://www.g42.ai/resources/news/g42-healthcare-aws-collaborate-to-offer-global-genomics-services)** enables AI-powered healthcare that uses data and emerging technologies to personalize preventative care.\n\n\n“Amazon Omics allows G42 to accelerate a competitive and deployable end-to-end service with globally leading data governance. We’re able to leverage the extensive omics data management and bioinformatics solutions hosted globally on AWS, at our customers’ fingertips. Our collaboration with AWS is much more than data – it’s about value.” – Ashish Koshi, CEO, G42 Healthcare\n\n**[C2i Genomics](https://c2i-genomics.com/)** brings together researchers, physicians and patients to utilize ultra-sensitive whole-genome cancer detection to personalize medicine, reduce cancer treatment costs, and accelerate drug development.\n\n“In C2i Genomics, we empower our data scientists by providing them cloud-based computational solutions to run high-scale, customizable genomic pipelines, allowing them to focus on method development and clinical performance, while the company’s engineering teams are responsible for the operations, security and privacy aspects of the workloads. Amazon Omics allows researchers to use tools and languages from their own domain, and considerably reduces the engineering maintenance effort while taking care of cost and resource allocation considerations, which in turn reduce time-to-market and NRE costs of new features and algorithmic improvements.” – Ury Alon, VP Engineering, C2i Genomics\n\n\nWe are excited to work hand in hand with our AWS partners to build scalable, multi-modal solutions that enable the conversion of raw sequencing data into insights.\n\n**[Lifebit](https://aws.amazon.com/solutions/case-studies/lifebit/)** builds enterprise data platforms for organizations with complex and sensitive biomedical datasets, empowering customers across the life sciences sector to transform how they use sensitive biomedical data.\n\n“At Lifebit, we’re on a mission to connect the world’s biomedical data to obtain novel therapeutic insights. Our customers work with vast cohorts of linked genomic, multi-omics and clinical data – and these data volumes are expanding rapidly. With Amazon Omics they will have access to optimised analytics and storage for this large-scale data, allowing us to provide even more scalable bioinformatics solutions. Our customers will benefit from significantly lower cost per gigabase of data, essentially achieving hot storage performance at cold storage prices, removing cost as a barrier to generating insights from their population-scale biomedical data.” – Thorben Seeger, Chief Business Development Officer, Lifebit\n\nTo hear more customers and partner voices, see Amazon Omics Customers page.\n\n### ++Now Available++\nAmazon Omics is now available in the US East (N. Virginia), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Singapore) Regions.\n\n\nTo learn more, see the [Amazon Omics page](https://aws.amazon.com/omics/), [Amazon Omics User Guide](https://docs.aws.amazon.com/omics/latest/dev/what-is-service.html), [Genomics on AWS](https://aws.amazon.com/health/genomics/), and [Healthcare & Life Sciences](https://aws.amazon.com/health/) on AWS. Give it a try, and please [contact AWS genomics team](https://pages.awscloud.com/GenomicsContactSales.html) and send feedback through your usual AWS support contacts.\n\n– [Channy](https://twitter.com/)\n\n![4cb1206ddc6551fff41080fcf865b37.png](https://dev-media.amazoncloud.cn/bb9d929fe59e479eb5e6bc47133e1213_4cb1206ddc6551fff41080fcf865b37.png)\n\n### **[Channy Yun](https://aws.amazon.com/blogs/aws/author/channy-yun/)**\nChanny Yun is a Principal Developer Advocate for AWS, and passionate about helping developers to build modern applications on latest AWS services. A pragmatic developer and blogger at heart, he loves community-driven learning and sharing of technology, which has funneled developers to global AWS Usergroups. His main topics are open-source, container, storage, network & security, and IoT. Follow him on Twitter at @channyun.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","render":"<p>You might learn in high school biology class that the human genome is composed of over three billion letters of code using adenine (A), guanine (G), cytosine ©, and thymine (T) paired in the deoxyribonucleic acid (DNA). The human genome acts as the biological blueprint of every human cell. And that’s only the foundation for what makes us human.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/07d73ec56f2742e794702827bd6c822e_image.png\\" alt=\\"image.png\\" /></p>\n<p>Healthcare and life sciences organizations collect myriad types of biological data to improve patient care and drive scientific research. These organizations map an individual’s genetic predisposition to disease, identify new drug targets based on protein structure and function, profile tumors based on what genes are expressed in a specific cell, or investigate how gut bacteria can influence human health. Collectively, these studies are often known as “omics”.</p>\n<p>AWS has helped <a href=\\"https://aws.amazon.com/health/\\" target=\\"_blank\\">healthcare and life sciences</a> organizations accelerate the translation of this data into actionable insights for over a decade. Industry leaders such as as <a href=\\"https://aws.amazon.com/solutions/case-studies/ancestry-case-study/\\" target=\\"_blank\\">Ancestry</a>, <a href=\\"https://aws.amazon.com/solutions/case-studies/astrazeneca/\\" target=\\"_blank\\">AstraZeneca</a>, <a href=\\"https://aws.amazon.com/solutions/case-studies/illumina-case-study/\\" target=\\"_blank\\">Illumina</a>, <a href=\\"https://aws.amazon.com/solutions/case-studies/dnanexus/\\" target=\\"_blank\\">DNAnexus</a>, <a href=\\"https://aws.amazon.com/solutions/case-studies/genomics-england/\\" target=\\"_blank\\">Genomics England</a>, and <a href=\\"https://www.youtube.com/watch?v=eexsAyOJML4&amp;t=786s\\" target=\\"_blank\\">GRAIL </a> leverage AWS to accelerate time to discovery while concurrently reducing costs and enhancing security.</p>\\n<p>The scale these customers, and others, operate at continues to increase rapidly. When omics data across thousand or hundreds of thousands (or more!) of individuals are compared and analyzed, new insights for predicting disease and the efficacy of different drug treatments are possible.</p>\n<p>However, this scale, which can be many petabytes of data, can add complexity. When I studied medical informatics in my Ph.D course, I experienced this complexity in data access, processing, and tooling. You need a way to store omics data that is cost-efficient and easy to access. You need to scale compute across millions of biological samples while preserving accuracy and reliability. You also need specialized tools to analyze genetic patterns across populations and train machine learning (ML) models to predict diseases.</p>\n<p>Today I’m excited to announce the general availability of <strong><a href=\\"https://aws.amazon.com/omics/\\" target=\\"_blank\\">Amazon Omics</a></strong>, a purpose-built service to help bioinformaticians, researchers, and scientists store, query, and analyze genomic, transcriptomic, and other omics data and then generate insights from that data to improve health and advance scientific discoveries.</p>\n<p>With just a few clicks in the Omics console, you can import and normalize petabytes of data into formats optimized for analysis. Amazon Omics provides scalable workflows and integrated tools for preparing and analyzing omics data and automatically provisions and scales the underlying cloud infrastructure. So, you can focus on advancing science and translate discoveries into diagnostics and therapies.</p>\n<p>Amazon Omics has three primary components:</p>\n<ul>\\n<li>Omics-optimized object storage that helps customers store and share their data efficiently and at low cost.</li>\n<li>Managed compute for bioinformatics workflows that allows customers to run the exact analysis they specify, without worrying about provisioning underlying infrastructure.</li>\n<li>Optimized data stores for population-scale variant analysis.</li>\n</ul>\\n<p>Now let’s learn more about each component of Amazon Omics. Generally, it follows the steps to create a data store and import data files, such as genome sequencing raw data, set up a basic bioinformatics workflow, and analyze results using existing AWS analytics and ML services.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/9ae63118d88642f29c00e6355b063198_image.png\\" alt=\\"image.png\\" /></p>\n<p>The <strong>Getting Started</strong> page in the Omics console contains tutorial examples using <a href=\\"https://aws.amazon.com/sagemaker\\" target=\\"_blank\\">Amazon SageMaker</a> notebooks with the Python SDK. I will demonstrate Amazon Omics features through an example using a human genome reference.</p>\\n<h3><a id=\\"Omics_Data_Storage_31\\"></a><ins>Omics Data Storage</ins></h3>\\n<p>The Omics data storage helps you store and share petabytes of omics data efficiently. You can create data stores and import sample data in the Omics console and also do the same job in the AWS Command Line Interface (AWS CLI).</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/2d06894c0bca41cebcc72efdb5a6448c_image.png\\" alt=\\"image.png\\" /></p>\n<p>Let’s make a reference store and import a reference genome. This example uses <a href=\\"https://registry.opendata.aws/broad-references/\\" target=\\"_blank\\">Genome Reference Consortium Human Reference 38</a> (hg38), which is open access and available from the following Amazon <code>S3bucket:s3://broadreferences/hg38/v0/Homo_sapiens_assembly38.fasta.</code></p>\\n<p>As prerequisites, you need to create Amazon S3 bucket in your preferred Region and the necessary IAM permissions to access S3 buckets. In the Omics console, you can easily create and select IAM role during the Omics storage setup.</p>\n<p>Use the following AWS CLI command to create your reference store, copy the genome data to your S3 bucket, and import it data into your reference store.</p>\n<pre><code class=\\"lang-\\">// Create your reference store\\n\$ aws omics create-reference-store --name &quot;Reference Store&quot;\\n\\n// Import your reference data into your data store\\n\$ aws s3 cp s3://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta,name=hg38 s3://channy-omics\\n\$ aws omics start-reference-import-job --sources sourceFile=s3://channy-omics/Homo_sapiens_assembly38.fasta,name=hg38 --reference-store-id 123456789 --role-arn arn:aws:iam::01234567890:role/OmicsImportRole\\n</code></pre>\\n<p>You can see the result in your console too.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/558061e671ac4156971bf79301ee176d_image.png\\" alt=\\"image.png\\" /></p>\n<p>Now you can create a sequence store. A sequence store is similar to an S3 bucket. Each object in a sequence store is known as a “read set”. A read set is an abstraction of a set of genomics file types:</p>\n<ul>\\n<li><strong><a href=\\"https://en.wikipedia.org/wiki/FASTQ_format\\" target=\\"_blank\\">FASTQ</a></strong> – A text-based file format that stores information about a base (sequence letter) from a sequencer and the corresponding quality information.</li>\n<li><strong><a href=\\"https://en.wikipedia.org/wiki/Binary_Alignment_Map\\" target=\\"_blank\\">BAM</a></strong> – The compressed binary version of raw reads and their mapping to a reference genome.</li>\n<li><a href=\\"https://samtools.github.io/hts-specs/CRAMv3.pdf\\" target=\\"_blank\\">CRAM</a> – Similar to BAM, but uses the reference genome information to aid in compression.</li>\\n</ul>\n<p>Amazon Omics allows you to specify domain-specific metadata to your read sets you import. These are searchable and defined when you start a read set import job.</p>\n<p>As an example, we will use the <a href=\\"https://registry.opendata.aws/1000-genomes/\\" target=\\"_blank\\">1000 Genomes Project</a>, a highly detailed catalogue of more than 80 million human genetic variants for more than 400 billions data points from over 2500 individuals. Let’s make a sequence store and then import genome sequence files into it.</p>\\n<pre><code class=\\"lang-\\">// Create your sequence store \\n\$ aws omics create-sequence-store --name &quot;MySequenceStore&quot;\\n\\n// Import your reference data into your data store\\n\$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_1.filt.fastq.gz s3://channy-omics\\n\$ aws s3 cp s3://1000genomes/phase3/data/HG00146/sequence_read/SRR233106_2.filt.fastq.gz s3://channy-omics\\n\\n\$ aws omics start-read-set-import-job --cli-input-json ‘\\n{\\n &quot;sourceFiles&quot;:\\n {\\n &quot;source1&quot;: &quot;s3://channy-omics/SRR233106_1.filt.fastq.gz&quot;,\\n &quot;source2&quot;: &quot;s3://channy-omics/SRR233106_2.filt.fastq.gz&quot;\\n\\n },\\n &quot;sourceFileType&quot;: &quot;FASTQ&quot;,\\n &quot;subjectId&quot;: &quot;mySubject2&quot;,\\n &quot;sampleId&quot;: &quot;mySample2&quot;,\\n &quot;referenceArn&quot;: &quot;arn:aws:omics:us-east-1:123456789012:referenceStore/123467890&quot;,\\n &quot;name&quot;: &quot;HG00100&quot;\\n}’\\n</code></pre>\\n<p>You can see the result in your console again.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/77af05dd0b724541a3ace61534aa73cf_image.png\\" alt=\\"image.png\\" /></p>\n<h3><a id=\\"Analytics_Transformations_95\\"></a><ins>Analytics Transformations</ins></h3>\\n<p>You can store variant data referring to a mutation, a difference between what the sequencer read at a position compared to the known reference and annotation data, known information about a location or variant in a genome, such as whether it may cause disease.</p>\n<p>A variant store supports both variant call format files (VCF) where there is a called variant and gVCF inputs with records covering every position in a genome. An annotation store supports either a generic feature format (GFF3), tab-separated values (TSV), or VCF file. An annotation store can be mapped to the same coordinate system as variant stores during an import.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/02a546d53c764817bed471e386a8690c_image.png\\" alt=\\"image.png\\" /></p>\n<p>Once you’ve imported your data, you can now run queries like as followings which search for Single Nucleotide Variants (SNVs), the most common type of genetic variation among people, on human chromosome 1.</p>\n<pre><code class=\\"lang-\\">SELECT\\n sampleid,\\n contigname,\\n start,\\n referenceallele,\\n alternatealleles\\nFROM &quot;myvariantstore&quot;.&quot;myvariantstore&quot;\\nWHERE\\n contigname = 'chr1'\\n and cardinality(alternatealleles) = 1\\n and length(alternatealleles[1]) = 1\\n and length(referenceallele) = 1\\nLIMIT 10\\n</code></pre>\\n<p>You can see the output of this query:</p>\n<pre><code class=\\"lang-\\">#\\tsampleid\\tcontigname\\tstart\\treferenceallele\\talternatealleles\\n1\\tNA20858\\tchr1\\t10096\\tT\\t[A]\\n2\\tNA19347\\tchr1\\t10096\\tT\\t[A]\\n3\\tNA19735\\tchr1\\t10096\\tT\\t[A]\\n4\\tNA20827\\tchr1\\t10102\\tT\\t[A]\\n5\\tHG04132\\tchr1\\t10102\\tT\\t[A]\\n6\\tHG01961\\tchr1\\t10102\\tT\\t[A]\\n7\\tHG02314\\tchr1\\t10102\\tT\\t[A]\\n8\\tHG02837\\tchr1\\t10102\\tT\\t[A]\\n9\\tHG01111\\tchr1\\t10102\\tT\\t[A]\\n10\\tNA19205\\tchr1\\t10108\\tA\\t[T] \\n</code></pre>\\n<p>You can view, manage, and query those data by integrating with existing analytics engines such as <a href=\\"https://aws.amazon.com/athena\\" target=\\"_blank\\">Amazon Athena</a>. These query results can be used to train ML models in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail).</p>\\n<h3><a id=\\"Bioinformatics_Workflows_138\\"></a><ins>Bioinformatics Workflows</ins></h3>\\n<p>Amazon Omics allows you to perform bioinformatics workflow, such as variant calling or gene expression, analysis on AWS. These compute workloads are defined using workflow languages like <a href=\\"https://github.com/openwdl/wdl\\" target=\\"_blank\\">Workflow Description Language</a> (WDL) and <a href=\\"https://github.com/nextflow-io/nextflow\\" target=\\"_blank\\">Nextflow</a>, domain-specific languages that specify multiple compute tasks and their input and output dependencies.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/ab52520e4b81431289e2bcdc022d98ce_image.png\\" alt=\\"image.png\\" /></p>\n<p>You can define and execute a workflow using a few simple CLI commands. As an example, create a <code>main.wdl</code> file with the following WDL codes to create a simple WDL workflow with one task that creates a copy of a file.</p>\\n<pre><code class=\\"lang-\\">version 1.0\\nworkflow Test {\\n\\tinput {\\n\\t\\tFile input_file\\n\\t}\\n\\tcall FileCopy {\\n\\t\\tinput:\\n\\t\\t\\tinput_file = input_file,\\n\\t}\\n\\toutput {\\n\\t\\tFile output_file = FileCopy.output_file\\n\\t}\\n}\\ntask FileCopy {\\n\\tinput {\\n\\t\\tFile input_file\\n\\t}\\n\\tcommand {\\n\\t\\techo &quot;copying ~{input_file}&quot; &gt;&amp;2\\n\\t\\tcat ~{input_file} &gt; output\\n\\t}\\n\\toutput {\\n\\t\\tFile output_file = &quot;output&quot;\\n\\t}\\n}\\n</code></pre>\\n<p>Then zip up your workflow and create your workflow with Amazon Omics using the AWS CLI:</p>\n<pre><code class=\\"lang-\\">\$ zip my-wdl-workflow-zip main.wdl\\n\$ aws omics create-workflow \\\\\\n --name MyWDLWorkflow \\\\\\n --description &quot;My WDL Workflow&quot; \\\\\\n --definition-zip file://my-wdl-workflow.zip \\\\\\n --parameter-template '{&quot;input_file&quot;: &quot;input test file to copy&quot;}'\\n</code></pre>\\n<p>To run the workflow we just created, you can use the following command:</p>\n<pre><code class=\\"lang-\\">aws omics start-run \\\\\\n --workflow-id // id of the workflow we just created \\\\\\n --role-arn // arn of the IAM role to run the workflow with \\\\\\n --parameters '{&quot;input_file&quot;: &quot;s3://bucket/path/to/file&quot;}' \\\\\\n --output-uri s3://bucket/path/to/results\\n</code></pre>\\n<p>Once the workflow completes, you could use these results in <code> s3://bucket/path/to/results</code>for downstream analyses in the Omics variant store.</p>\\n<p>You can execute a run, a single invocation of a workflow with a task and defined compute specifications. An individual run acts on your defined input data and produces an output. Runs also can have priorities associated with them, which allow specific runs to take execution precedence over other submitted and concurrent runs. For example, you can specify that a run that is high priority will be run before one that is lower priority.</p>\n<p><img src=\\"https://dev-media.amazoncloud.cn/25f87050ff004668bccc12fcf413e6ca_image.png\\" alt=\\"image.png\\" /></p>\n<p>You can optionally use a run group, a group of runs that you can set the max vCPU and max duration runs to help limit the compute resources used per run. This can help you partition users who may need access to different workflows to run on different data. It can also be used as a budget control/resource fairness mechanism by isolating users to specific run groups.</p>\n<p>As you saw, Amazon Omics gives you a managed service with a couple of clicks and simple commands, and APIs in analyzing large-scale omic data, such as human genome samples so you can derive meaningful insights from this data, in hours rather than weeks. We also provide more <a href=\\"https://github.com/aws-samples/amazon-omics-tutorials/\\" target=\\"_blank\\">tutorial SageMaker notebooks</a> that you can use in [Amazon SageMaker](https://aws.amazon.com/cn/sagemaker/?trk=cndc-detail) to help you get started.</p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/0560658e69bf4fb68f1ac647206c8f54_image.png\\" alt=\\"image.png\\" /></p>\n<p>In terms of data security, Amazon Omics helps ensure that your data remains secure and patient privacy is protected with customer-managed encryption keys, and <a href=\\"https://aws.amazon.com/compliance/hipaa-compliance/\\" target=\\"_blank\\">HIPAA eligibility</a>.</p>\\n<h3><a id=\\"Customer_and_Partner_Voices_212\\"></a><ins>Customer and Partner Voices</ins></h3>\\n<p>Customers and partners in the healthcare and life science industry have shared how they are using Amazon Omics to accelerate scientific insights.</p>\n<p><strong><a href=\\"https://aws.amazon.com/solutions/case-studies/chop/\\" target=\\"_blank\\">Children’s Hospital of Philadelphia (CHOP)</a></strong> is the oldest hospital in the United States dedicated exclusively to pediatrics and strives to advance healthcare for children with the integration of excellent patient care and innovative research. AWS has worked with the CHOP Research Institute for many years as they’ve led the way in utilizing data and technology to solve challenging problems in child health.</p>\n<p>“At Children’s Hospital of Philadelphia, we know that getting a comprehensive view of our patients is crucial to delivering the best possible care, based on the most innovative research. Combining multiple clinical modalities is foundational to achieving this. With Amazon Omics, we can expand our understanding of our patients’ health, all the way down to their DNA.” – Jeff Pennington, Associate Vice President &amp; Chief Research Informatics Officer, Children’s Hospital of Philadelphia</p>\n<p><strong><a href=\\"https://www.g42.ai/resources/news/g42-healthcare-aws-collaborate-to-offer-global-genomics-services\\" target=\\"_blank\\">G42 Healthcare</a></strong> enables AI-powered healthcare that uses data and emerging technologies to personalize preventative care.</p>\n<p>“Amazon Omics allows G42 to accelerate a competitive and deployable end-to-end service with globally leading data governance. We’re able to leverage the extensive omics data management and bioinformatics solutions hosted globally on AWS, at our customers’ fingertips. Our collaboration with AWS is much more than data – it’s about value.” – Ashish Koshi, CEO, G42 Healthcare</p>\n<p><strong><a href=\\"https://c2i-genomics.com/\\" target=\\"_blank\\">C2i Genomics</a></strong> brings together researchers, physicians and patients to utilize ultra-sensitive whole-genome cancer detection to personalize medicine, reduce cancer treatment costs, and accelerate drug development.</p>\n<p>“In C2i Genomics, we empower our data scientists by providing them cloud-based computational solutions to run high-scale, customizable genomic pipelines, allowing them to focus on method development and clinical performance, while the company’s engineering teams are responsible for the operations, security and privacy aspects of the workloads. Amazon Omics allows researchers to use tools and languages from their own domain, and considerably reduces the engineering maintenance effort while taking care of cost and resource allocation considerations, which in turn reduce time-to-market and NRE costs of new features and algorithmic improvements.” – Ury Alon, VP Engineering, C2i Genomics</p>\n<p>We are excited to work hand in hand with our AWS partners to build scalable, multi-modal solutions that enable the conversion of raw sequencing data into insights.</p>\n<p><strong><a href=\\"https://aws.amazon.com/solutions/case-studies/lifebit/\\" target=\\"_blank\\">Lifebit</a></strong> builds enterprise data platforms for organizations with complex and sensitive biomedical datasets, empowering customers across the life sciences sector to transform how they use sensitive biomedical data.</p>\n<p>“At Lifebit, we’re on a mission to connect the world’s biomedical data to obtain novel therapeutic insights. Our customers work with vast cohorts of linked genomic, multi-omics and clinical data – and these data volumes are expanding rapidly. With Amazon Omics they will have access to optimised analytics and storage for this large-scale data, allowing us to provide even more scalable bioinformatics solutions. Our customers will benefit from significantly lower cost per gigabase of data, essentially achieving hot storage performance at cold storage prices, removing cost as a barrier to generating insights from their population-scale biomedical data.” – Thorben Seeger, Chief Business Development Officer, Lifebit</p>\n<p>To hear more customers and partner voices, see Amazon Omics Customers page.</p>\n<h3><a id=\\"Now_Available_238\\"></a><ins>Now Available</ins></h3>\\n<p>Amazon Omics is now available in the US East (N. Virginia), US West (Oregon), Europe (Ireland), Europe (London), Europe (Frankfurt), and Asia Pacific (Singapore) Regions.</p>\n<p>To learn more, see the <a href=\\"https://aws.amazon.com/omics/\\" target=\\"_blank\\">Amazon Omics page</a>, <a href=\\"https://docs.aws.amazon.com/omics/latest/dev/what-is-service.html\\" target=\\"_blank\\">Amazon Omics User Guide</a>, <a href=\\"https://aws.amazon.com/health/genomics/\\" target=\\"_blank\\">Genomics on AWS</a>, and <a href=\\"https://aws.amazon.com/health/\\" target=\\"_blank\\">Healthcare &amp; Life Sciences</a> on AWS. Give it a try, and please <a href=\\"https://pages.awscloud.com/GenomicsContactSales.html\\" target=\\"_blank\\">contact AWS genomics team</a> and send feedback through your usual AWS support contacts.</p>\\n<p>– <a href=\\"https://twitter.com/\\" target=\\"_blank\\">Channy</a></p>\\n<p><img src=\\"https://dev-media.amazoncloud.cn/bb9d929fe59e479eb5e6bc47133e1213_4cb1206ddc6551fff41080fcf865b37.png\\" alt=\\"4cb1206ddc6551fff41080fcf865b37.png\\" /></p>\n<h3><a id=\\"Channy_Yunhttpsawsamazoncomblogsawsauthorchannyyun_248\\"></a><strong><a href=\\"https://aws.amazon.com/blogs/aws/author/channy-yun/\\" target=\\"_blank\\">Channy Yun</a></strong></h3>\n<p>Channy Yun is a Principal Developer Advocate for AWS, and passionate about helping developers to build modern applications on latest AWS services. A pragmatic developer and blogger at heart, he loves community-driven learning and sharing of technology, which has funneled developers to global AWS Usergroups. His main topics are open-source, container, storage, network &amp; security, and IoT. Follow him on Twitter at @channyun.</p>\n"}
目录
亚马逊云科技解决方案 基于行业客户应用场景及技术领域的解决方案
联系亚马逊云科技专家
亚马逊云科技解决方案
基于行业客户应用场景及技术领域的解决方案
联系专家
0
目录
关闭