Description
MPIP: MLSQL Table Cache Cleaner
We have provided ET CacheExt so we can cache table in MLSQL. But there are some disadvantages:
- People need to uncache the table manually otherwize the system will have memory leak.
- Some misuse will also cause memory leak, check the following code:
select * from b as table1;
!cache table1;
select a from table1 as table1;
!uncache table1;
This means it's hard to use and sometimes a little dangerous with table cache. We should provie a mechanism which guarantee that the cache memory should be released automatically. And the system also should warn user that when they change the reference of a table that have been cached.
In order to deal with this, a potential solution is assign lifttime to cache. There are three kinds of lifttime:
- script
- session
- application
Script lifetime, this means onch the script is finished the cache will be released.
Session lifetime, once the session is logout or timeout, then all cache created in this session will be released.
Application liftime, the user should release the cache mannually.
In this proposal, we will focus on how to implements script liftime.
Script Lifetime should bind to the job life cycle, this means we should provide a MLSQL job listener and the developer can implements the listener to hook the clean action.
Notice that when we runs stream job, there is no need to clean the cache.
First we should provide a JobListener, it should like this:
package tech.mlsql.job
abstract class JobListener {
import JobListener._
def onJobStarted(event: JobStartedEvent): Unit
def onJobFinished(event: JobFinishedEvent): Unit
}
object JobListener {
trait JobEvent
class JobStartedEvent(val groupId:String) extends JobEvent
class JobFinishedEvent(val groupId:String) extends JobEvent
}
Then, we need to collect the tables are cached and provide function to clean them.
object SQLCacheExt {
val cache = new java.util.concurrent.ConcurrentHashMap[String, ArrayBuffer[TableCacheItem]]()
def addCache(tci: TableCacheItem) = {
synchronized {
val items = cache.getOrDefault(tci.groupId, ArrayBuffer[TableCacheItem]())
items += tci
cache.put(tci.groupId, items)
}
}
def cleanCache(session: SparkSession, groupId: String) = {
val items = cache.remove(groupId)
if (items != null) {
items.foreach { item =>
SparkExposure.cleanCache(session, item.planToCache)
}
}
}
}
case class TableCacheItem(groupId: String, tableName: String, planToCache: LogicalPlan, cacheStartTime: Long)
Hook addCache into class SQLCacheExt:
val df = if (path.isEmpty) _df else _df.sparkSession.table(path)
val exe = params.get(execute.name).getOrElse {
"cache"
}
val __dfname__ = params("__dfname__")
val _isEager = params.get(isEager.name).map(f => f.toBoolean).getOrElse(false)
if (!execute.isValid(exe)) {
throw new MLSQLException(s"${execute.name} should be cache or uncache")
}
val context = ScriptSQLExec.contextGetOrForTest()
if (exe == "cache") {
df.persist()
SQLCacheExt.addCache(TableCacheItem(context.groupId, context.owner, __dfname__, df.queryExecution.logical, System.currentTimeMillis()))
} else {
df.unpersist()
}
if (_isEager) {
df.count()
}
df
Job manager should add listener callback:
def run(session: SparkSession, job: MLSQLJobInfo, f: () => Unit): Unit = {
try {
_jobListeners.foreach { f => f.onJobStarted(new JobStartedEvent(job.groupId)) }
if (_jobManager == null) {
f()
} else {
session.sparkContext.setJobGroup(job.groupId, job.jobName, true)
_jobManager.groupIdToMLSQLJobInfo.put(job.groupId, job)
f()
}
_jobListeners.foreach { f => f.onJobFinished(new JobFinishedEvent(job.groupId)) }
} finally {
handleJobDone(job.groupId)
session.sparkContext.clearJobGroup()
}
}
Finally, Implements JobListener.
Activity
allwefantasy commentedon Apr 17, 2019
The branch is MPIP-1031.