Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPIP: MLSQL Table Cache Cleaner #1031

Closed
allwefantasy opened this issue Apr 17, 2019 · 1 comment
Closed

MPIP: MLSQL Table Cache Cleaner #1031

allwefantasy opened this issue Apr 17, 2019 · 1 comment

Comments

@allwefantasy
Copy link
Contributor

allwefantasy commented Apr 17, 2019

MPIP: MLSQL Table Cache Cleaner

We have provided ET CacheExt so we can cache table in MLSQL. But there are some disadvantages:

  1. People need to uncache the table manually otherwize the system will have memory leak.
  2. Some misuse will also cause memory leak, check the following code:
 select * from b as table1;
 !cache table1;
 select a from table1 as table1;
 !uncache table1;

This means it's hard to use and sometimes a little dangerous with table cache. We should provie a mechanism which guarantee that the cache memory should be released automatically. And the system also should warn user that when they change the reference of a table that have been cached.

In order to deal with this, a potential solution is assign lifttime to cache. There are three kinds of lifttime:

  1. script
  2. session
  3. application

Script lifetime, this means onch the script is finished the cache will be released.
Session lifetime, once the session is logout or timeout, then all cache created in this session will be released.
Application liftime, the user should release the cache mannually.

In this proposal, we will focus on how to implements script liftime.

Script Lifetime should bind to the job life cycle, this means we should provide a MLSQL job listener and the developer can implements the listener to hook the clean action.

Notice that when we runs stream job, there is no need to clean the cache.

First we should provide a JobListener, it should like this:

package tech.mlsql.job

abstract class JobListener {

  import JobListener._

  def onJobStarted(event: JobStartedEvent): Unit

  def onJobFinished(event: JobFinishedEvent): Unit

}

object JobListener {

  trait JobEvent

  class JobStartedEvent(val groupId:String) extends JobEvent

  class JobFinishedEvent(val groupId:String) extends JobEvent

}

Then, we need to collect the tables are cached and provide function to clean them.

object SQLCacheExt {
  val cache = new java.util.concurrent.ConcurrentHashMap[String, ArrayBuffer[TableCacheItem]]()

  def addCache(tci: TableCacheItem) = {
    synchronized {
      val items = cache.getOrDefault(tci.groupId, ArrayBuffer[TableCacheItem]())
      items += tci
      cache.put(tci.groupId, items)
    }
  }

  def cleanCache(session: SparkSession, groupId: String) = {
    val items = cache.remove(groupId)
    if (items != null) {
      items.foreach { item =>
        SparkExposure.cleanCache(session, item.planToCache)
      }
    }
  }
}

case class TableCacheItem(groupId: String, tableName: String, planToCache: LogicalPlan, cacheStartTime: Long)

Hook addCache into class SQLCacheExt:

val df = if (path.isEmpty) _df else _df.sparkSession.table(path)
    val exe = params.get(execute.name).getOrElse {
      "cache"
    }
    val __dfname__ = params("__dfname__")
    val _isEager = params.get(isEager.name).map(f => f.toBoolean).getOrElse(false)

    if (!execute.isValid(exe)) {
      throw new MLSQLException(s"${execute.name} should be cache or uncache")
    }

    val context = ScriptSQLExec.contextGetOrForTest()

    if (exe == "cache") {
      df.persist()
      SQLCacheExt.addCache(TableCacheItem(context.groupId, context.owner, __dfname__, df.queryExecution.logical, System.currentTimeMillis()))
    } else {
      df.unpersist()
    }

    if (_isEager) {
      df.count()
    }
    df

Job manager should add listener callback:

def run(session: SparkSession, job: MLSQLJobInfo, f: () => Unit): Unit = {
    try {
      _jobListeners.foreach { f => f.onJobStarted(new JobStartedEvent(job.groupId)) }
      if (_jobManager == null) {
        f()
      } else {
        session.sparkContext.setJobGroup(job.groupId, job.jobName, true)
        _jobManager.groupIdToMLSQLJobInfo.put(job.groupId, job)
        f()
      }
      _jobListeners.foreach { f => f.onJobFinished(new JobFinishedEvent(job.groupId)) }
    } finally {
      handleJobDone(job.groupId)
      session.sparkContext.clearJobGroup()
    }
  }

Finally, Implements JobListener.

@allwefantasy
Copy link
Contributor Author

The branch is MPIP-1031.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant