Max M. He, Ph.D.

  Primary Appointment:
      Center for Human Genetics at Marshfield Clinic Research Foundation
  Joint Appointments:
      Biomedical Informatics Research Center at Marshfield Clinic Research Foundation
      Computation and Informatics in Biology and Medicine at University of Wisconsin - Madison
  Academic Rank:
      Assistant Professor of Human Genetics & Biomedical Informatics
      he dot max at marshfieldclinic dot org

[Research] [Software] [Selected Publications] [Press]

Research [top]

    The research in my lab focuses on the development of advanced machine learning approaches, statistical methods, and efficient computational tools on a Big Data infrastructure for precision medicine:

  • Developing Advanced Statistical Methods and Reliable Computational Tools to discover clinically actionable genetic variants and clinical/environmental factors for disease diagnosis and personalized treatment;
  • Developing Advanced Machine Learning Approaches to Mine Large-Scale Scientific Articles to discover disease- and/or drug-gene associations and other relevance;
  • Developing New Statistical Methodologies and Computational Tools to detect disease-associated genetic variants by genome-wide association studies (GWAS) and/or phenome-wide association studies (PheWAS);
  • Creating a Big Data System to store, manipulate, and analyze diverse genomic data and comprehensive clinical data derived from electronic health records (EHRs).

Software [top]

  1. SeqHBase: a big data toolset for family-based sequencing data analysis: SeqHBase is a big data toolset developed based on Apache Hadoop and HBase infrastructure. It is designed for analyzing family-based sequencing data to detect de novo, inherited homozygous or compound heterozygous mutations. SeqHBase takes as input BAM files (for coverage of 3 billion sites of a genome), VCF files (for variant calls) and functional annotations (for variant prioritization). SeqHBase works through distributed and completely parallel manner over multiple data nodes. We applied SeqHBase to a 5-member nuclear family and a 10-member three-generation family with whole genome sequencing (WGS) data, as well as a 4-member nuclear family with whole exome sequencing (WES) data. Analysis times were linearly scalable with the number of data nodes. With 20 data nodes, SeqHBase took about 5 seconds for analyzing WES familial data and approximately 1 minute for analyzing the 10-member WGS familial data. These results demonstrated SeqHBase's high efficiency and scalability. In addition, it is distributed, customizable, and scalable based on the needs with available data volume. As more data become available, addition of more data nodes is possible, making the system very nimble. The newly added data nodes can be seamlessly incorporated with the existing system. SeqHBase can be applied to manipulate and analyze millions of WGS data.
  2. SparkText: an efficient toolset for data mining large-scale scientific literature: Text mining is a specialized data mining method that extracts information (e.g. facts, biological processes, or diseases) from text, such as scientific literature. We utilized natural language processing (NLP), machine learning strategies, and Big Data infrastructure to design and develop a distributed and scalable framework to extract information, such as breast, prostate, and/or lung cancers, and then to develop prediction models to classify information extracted from more than 29,437 full-text articles downloaded from PubMed Central. We employed three different classification algorithms, including Naive Bayes, Support Vector Machine (SVM), and Logistic Regression, to build a prediction model using 5-fold cross validation on the 29,437 full-text articles. The framework was developed on a Big Data infrastructure, including an Apache Hadoop cluster, together with Apache Spark component and Cassandra Database. The run time required when using Big Data platform to mine more than 29,437 full-text articles was about 6 minutes, while it took more than 11 hours without using any Big Data infrastructure. It showed that mining large-scale biomedical articles on a Big Data infrastructure can be significantly accelerated. Accuracy, precision, or recall of predicting a cancer type using any of the three machine learning methods on 29,437 full-text articles was compatible or better than the one using other libraries, such as Weka library and TagHelper Tools. Both the time efficiency and accuracy of our scalable framework were promising and this strategy will provide tangible benefits to medical research.
  3. Collaborative development of HadoopCNV with Dr. Kai Wang at USC: HadoopCNV is a highly scalable solution for accurate detection of copy number variations (CNVs) from WGS data. It infers interesting aberration events, such as copy number changes and loss of heterozygosity (LOH), through information encoded in both allelic and overall read depth. In particular, resolving small regions in samples with deep coverage can be very time consuming due to massive I/O cost. Our implementation is built on the Hadoop MapReduce paradigm, enabling multiple processors to efficiently process separate regions in tandem. We employed a Viterbi scoring algorithm to infer the most likely copy number/heterozygosity state for each region of the genome. We applied HadoopCNV to a 10 member pedigree sequenced by Illumina HiSeq. Our method has a Mendelian inconsistency that is overall lower than other competing approaches. Our method also has comparable performance with the NA12878 individual from the 1000 Genomes Project. Most importantly, our method only takes 1.3 hours from BAM files to CNV output, while other methods take more than 13 hours.
  4. Association Tests for Annotated Variants (ATAV): ATAV is a statistical toolset that is designed to detect complex disease-associated rare genetic variants by performing association analysis, trio analysis, and/or linkage analysis on whole-genome or whole-exome sequencing data.
  5. A User-Friendly Software Tool for Population Stratification Adjustment in Genome-Wide Association Studies: Population stratification is characterized by systematic differences in allele frequencies between sub-populations. If differences in disease burden between sub-populations are also present, population stratification can result in false-positive associations between the disease and genetic variants. The "stratification score" approach of Epstein, Allen, and Satten has been proposed to address this problem. The basic idea is to develop strata within which individuals have similar baseline probabilities of disease conditional on genomic information. Stratified association tests using these strata have been shown to have both the correct type I error rate and good power. Here we present a user friendly software tool that implements the "stratification score" and is able to handle genome-wide association data. The tool allows users to import data in many popular data formats and performs several other useful functions including the calculation and visualization of principal components. Both Web-based and standalone versions of the tool are implemented. The Web-based tool allows research groups to operate under a client/server model in which users are able to interact with the tool remotely, getting results via email if they wish.
  6. Statistical Analysis of Antigen Receptor Spectratype Data: Spectratype analysis (SpA) is a method used in clinical and basic immunological settings in which antigen receptor length diversity is assessed as a surrogate for functional diversity. We have developed the statistical methods appropriate for the comparison of multiple different spectratypes in a variety of ways. The fundamental statistic for these comparisons and statistical tests is the completeness, an information-theoretic quantity that arises naturally in the statistical derivations. The completeness is closely related to the entropy as a measure of the diversity of the antigen receptor repertoire and serves as a sensitive and objective measure of the state of the repertoire. Several of the statistical tests based on the completeness are performed automatically upon data submission, and additional tests are available to the user online through SpA. Specialized statistical tools, developed for hypothesis testing and modeling for multiple spectratypes, are also available through the SpA interface. In addition to the specific procedures provided by SpA, the powerful, general-purpose data analysis package R is integrated into SpA system for more specialized procedures (Bioinformatics, 2005, 21, 3394-3400; Bioinformatics, 2005, 21, 3697-3699). It is used both on campus and throughout the world.
  7. OmicShare: OmicShare is a collaborative work environment that enables users to easily store, manage and share all types of instrumental and analytical data files for project management in biomedical research.  It facilitates research collaboration and reduces the risk of data loss.  OmicShare has a user friendly interface accessed through an Internet browser.  Data files are uploaded to the system underlying a robust database (The database can be any one of the relational databases, such as Oracle, MySQL, PostgreSQL, etc.) by selecting, coping, or simple drag-and-drop files.  OmicShare allows users to upload/download multiple subfolders and files by a simple click.  Folders or files can be granted different permissions to other collaborators by the data supplier or system administrator.  OmicShare allows users to share files with collaborators quickly, easily, and professionally.   Users can securely and quickly navigate to the projects in which they are involved to communicate with other collaborators inside and outside their organizations, upload/download single or multiple data file(s) by one click, as well as download analyses.  Click here to evalute the software.
Selected Publications [top]
  1. Ye Z, Tafti AP, He KY, Wang K*, He MM*. SparkText: Biomedical Text Mining on Big Data Framework. PLoS ONE, 2016, 11(9) e0162721. doi:10.1371/journal.pone.0162721 [View PubMed]
  2. Tafti AP, Holz JD, Baghaie A, Owen HA, He MM, Zeyun Yu. 3DSEM++: adaptive and intelligent 3D SEM surface reconstruction. Micron, 2016, 87, 33-45. [View PubMed]
  3. Carter TC, He, MM*. Challenges of identifying clinically actionable genetic variants for precision medicine. Journal of Healthcare Engineering, 2016. doi:10.1155/2016/3617572. [View PubMed]
  4. Van Driest SL, Wells, QS, Stallings, S, et al. Association of Arrhythmia-Related Genetic Variants With Phenotypes Documented in Electronic Medical Records. JAMA, 2016, 315(1), 47-57. doi:10.1001/jama.2015.17701. [View PubMed]
  5. Zhang W, Yu Y, Hertwig F, et al. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biol, 2015, 2015 Jun 25;16:133. doi: 10.1186/s13059-015-0694-1. [View PubMed]
  6. Schrodi SJ, Debarber A, He M, Ye Z, Peissig PL, VanWormer JJ, Haws R, Brilliant MH, Steiner RD. Prevalence estimation for monogenic autosomal recessive diseases using population-based genetic data. Human Genetics, 2015, 134(6), 659-669. [View PubMed]
  7. He M*, Person TN, Hebbring SJ, Heinzen E, Ye Z, Schrodi SJ, McPherson EW, Lin SM, Peissig PL, Brilliant MH, O'Rawe J, Robison RJ, Lyon GJ, Wang K*. SeqHBase: a big data toolset for family based sequencing data analysis. Journal of Medical Genetics, 2015, 52(4), 282-288. [View PubMed]
  8. Ye Z, Mayer J, Ivacic L, Zhou Y, He M, Schrodi SJ, Page CD, Brilliant MH, Scott J Hebbring SJ. Phenome-wide association studies (PheWASs) for functional variants. European Journal of Human Genetics, 2015, 23(4), 523-529. [View PubMed]
  9. Mayer J, Kitchner T, Ye Z, Zhou ZY, He M, Schrodi SJ, Hebbring SJ. Use of an Electronic Medical Record to Create the Marshfield Clinic Twin/Multiple Birth Cohort. Genetic Epidemiology, 2014, 38(8), 692-698. [View PubMed]
  10. Chute CG, Ullman-Cullere M, Wood GM, Lin SM, He M, Pathak J. Some experiences and opportunities for big data in translational research. Genetics in Medicine, 2013, 15, 802-809. [View PubMed]
  11. Zhu Q, Ge D, Heinzen EL, Dickson SP, Urban TJ, Zhu M, Maia JM, He M, Zhao Q, Shianna KV, Goldstein DB. Prioritizing genetic variants for causality on the basis of preferential linkage disequilibrium. Am J Hum Genet., 2012, 91, 422-434. [View PubMed]
  12. Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET, Pelak K, He M, Ruzzo EK, Gumbs C, Singh A, Feng S, Shianna KV and Goldstein DB. Using ERDS to Infer Copy-Number Variants in High-Coverage Genomes. Am J Hum Genet., 2012, 91, 408-421. [View PubMed]
  13. Need AC, McEvoy JP, Gennarelli M, Heinzen EL, Ge D, Maia JM, Shianna KV, He M, Cirulli ET, Gumbs CE, Zhao Q, Rosenquist P, Levy DL, Meltzer HM, Goldstein DB. Exome sequencing followed by large-scale genotyping suggests a limited role for moderately rare risk factors of strong effect in schizophrenia. Am J Hum Genet., 2012, 91, 303-312. [View PubMed]
  14. Heinzen EL, Depondt C, Cavalleri G, Ruzzo EK, Walley NM, Need AC, Ge D, He M, Cirulli ET, Zhao Q, Cronin KD, Gumbs CE, Campbell CR, Hong LK, Maia JM, Shianna KV, McCormack M, Radtke RA, Mikati MA, Gallentine WB, Husain AM, Sinha SR, Puranam RS, McNamara JO, Ottman R, Sisodiya SM, Delanty N, Goldstein DB. Exome sequencing followed by large scale genotyping fails to identify single rare variants of large effect in idiopathic generalized epilepsy. Am J Hum Genet., 2012, 91, 293-302. [View PubMed]
  15. Epstein MP, Duncan R, Broadaway KA, He M, Allen AS, Satten GA. Stratification Score Matching Improves Correction for Confounding by Population Stratification in Case-Control Association Studies. Genetic Epidemiology, 2012, 36, 195-205. [View PubMed]
  16. Ge D*, Ruzzo EK*, Shianna KV, He M, Pelak K, Heinzen EL, Need AC, Cirulli ET, Maia JM, Zhu M, Singh A, Allen AS, Goldstein DB. SVA: Software for Annotating and Visualizing Sequenced Human Genomes. Bioinformatics, 2011, 27, 1998-2000. [View PubMed]
  17. He M, Allen AS. Testing gene-treatment interactions in pharmacogetic studies. Journal of Biopharmaceutical Statistics, 2010, 20(2), 301-314. [View PubMed]
  18. Markert ML, Devlin BH, Alexieff MJ, Li J, McCarthy EA, Gupton SE, Chinn IK, Hale LP, Kepler TB, He M, Sarzotti M, Skinner MA, Rice HE, Hoehner JC. Review of 54 patients with complete DiGeorge anomaly enrolled in protocols for thymus transplantation: outcome of 44 consecutive transplants. Blood, 2007, 109, 4539-4547. [View PubMed]
  19. He M, Devlin BH, Markert ML, Sarzotti M, Kepler TB. SpA: Web-accessible Spectratype Analysis: application to investigate the development of TCR diversity in a patient with complete DiGeorge syndrome. Proceedings of The 2006 International Conference on Bioinformatics & Computational Biology - BIOCOMP'06 in Las Vegas, Nevada, Jun 26-29, 2006, 503-510. [View PDF]
  20. Liu CX, He M, Rooney B, Kepler TB, Chao NJ. Longitudinal Analysis of T-Cell Receptor Variable Beta Chain Repertoire in Patients with Acute Graft-versus-Host Disease after Allogeneic Stem Cell Transplantation. Biology of Blood and Marrow Transplantation, 2006, 12, 335-345. [View PubMed]
  21. He M, Tomfohr JK, Devlin BH, Markert ML, Sarzotti M, Kepler TB. SpA: web-accessible spectratype analysis: data management, statistical analysis and visualization. Bioinformatics, 2005, 21, 3697-3699. [View PubMed]
  22. Kepler TB, He M, Tomfohr JK, Devlin BH, Sarzotti M, Markert ML. Statistical analysis of antigen receptor spectratype data. Bioinformatics, 2005, 21, 3394-3400. [View PubMed]
  23. He M, Yan XJ, Zhou JJ, Xie GR. Traditional Chinese Medicine Database and Application on the Web. J. Chem. Inf. Comput. Sci., 2001, 41, 273-277. [View PubMed]
The full list of publications can be accessed here.

Press Releases [top]
    March 21, 2016 - Marshfield Clinic Study of Dead Patients' Genomes Reveals Multiple Actionable Mutations

    October 15, 2015 - Sequencing the Genomes of Dead People

    February 12, 2015 - Marshfield Clinic's SeqHBase Offers Toolset for Familial Sequencing-based Data Analysis